- Project Overview
- Live Application
- Dataset Description
- Approach
- Results
- Key Findings
- Feature Importance
- Installation Guide
- Running the Application
- Jupyter Book
- Note on Dependencies
- Pydantic Deprecation Warnings
- Deployment
- API Endpoints
- User Interface
- Future Improvements
This project focuses on developing, deploying, and serving a machine learning model for credit risk prediction using the Home Credit dataset. The goal is to create an interpretable, deployable, and financially sound model that effectively identifies potential loan defaulters while maintaining a balance between precision and recall.
The application is deployed and accessible at: https://retail-bank-risk-app-562777194669.us-central1.run.app/
The Home Credit dataset contains information about loan applications, including:
- Applicant demographics
- Financial history
- Loan specifics
- External data sources
The main data tables used are:
- application_train.csv
- application_test.csv
-
Data Preprocessing:
- Loaded and cleaned raw data
- Performed memory optimization
- Handled missing values and outliers
- Created derived features
-
Feature Engineering:
- Binned continuous variables (age, income, credit amount)
- Created financial ratios (debt-to-income, credit-to-goods, annuity-to-income)
- Engineered time-based features
-
Model Development:
- Used XGBoost algorithm
- Optimized hyperparameters using Optuna (200 trials)
- Selected 40 key features for interpretability and relevance
-
Evaluation Metrics:
- Focused on recall and F2-score
- Analyzed precision-recall trade-offs
-
Enhanced Financial Analysis:
- Incorporated existing mortgage and loan payments
- Implemented a comprehensive debt-to-income ratio calculation
- Set a 40% threshold for total debt-to-income ratio
-
Improved Risk Assessment:
- Adjusted default probability based on debt-to-income ratio
- Implemented a more nuanced risk level determination
-
Realistic Financial Assumptions:
- Used a 5% annual interest rate for loan calculations
- Improved monthly payment calculations
-
Expanded Anomaly Detection:
- Set specific bounds for key financial variables
- Flagged and reported anomalies in model output
-
Enhanced Error Handling and Logging:
- Improved input validation and error messaging
- Added detailed logging of financial ratios and decision points
- Kaggle Competition Score: 67%
- Test Set Performance:
- Recall: 74.42%
- Precision: 11.23%
- F1-Score: 19.52%
- F2-Score: 35.02%
- AUC-ROC: 0.6754
While these metrics indicate that the final model may not be the best in terms of raw performance, it's important to note that our manual fine-tuning process has significantly improved the model's effectiveness on edge cases. The incorporation of domain knowledge and financial best practices allows the model to make more nuanced and accurate decisions in complex scenarios that may not be well-represented in the general test set.
- The model demonstrates high recall (74.42%) for detecting defaults, crucial in credit risk management.
- This high recall comes at the cost of low precision (11.23%), indicating a tendency to overpredict defaults.
- The model errs on the side of caution, which may be acceptable if the cost of missing a default significantly outweighs the cost of false alarms.
- The precision-recall curve suggests the model performs moderately well but is dealing with imbalanced data.
- Manual fine-tuning improved the model's alignment with real-world financial decision-making processes, particularly for edge cases and complex scenarios.
Top features influencing the model's predictions include:
- External source scores
- Age
- Income-related features
- Loan amount and goods price
- Various derived financial ratios
- Python 3.10+
- pip
- virtualenv (optional but recommended)
-
Clone the repository:
git clone https://github.com/vytautas-bunevicius/retail-bank-risk-evaluation.git cd retail-bank-risk-evaluation
-
(Optional) Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the project and its dependencies:
pip install -e .
To run the application, use the following command:
uvicorn app.main:app
This command:
- Uses
uvicorn
to run the FastAPI application - Specifies
app.main:app
as the application import string, where:app.main
is the Python module pathapp
is the FastAPI application instance within that module
By default, this will run the server on http://127.0.0.1:8000
. If you need to specify a different host or port, you can use the --host
and --port
options:
uvicorn app.main:app --host 0.0.0.0 --port 8080
To view all notebooks in one place, you can run the Jupyter Book in the root directory. Follow these steps:
-
Ensure you have Jupyter Book installed (you should if you have installed the requirements.txt):
pip install jupyter-book
-
Build the book:
jupyter-book build .
-
Open the generated
_build/html/index.html
file in your web browser to view the compiled book.
This Jupyter Book provides a comprehensive view of all project notebooks, making it easier to navigate and understand the entire workflow.
The setup.py
file in this project is configured to read and install dependencies from requirements.txt
. When you run pip install -e .
, it installs both the project and all dependencies listed in requirements.txt
.
If you make changes to requirements.txt
, you may need to run pip install -e .
again to update the installed dependencies.
You may see deprecation warnings related to Pydantic validators. These are not errors, but suggestions to update to the newer Pydantic V2 style validators. Consider updating these in future maintenance of the project.
The application is deployed on Google Cloud Platform using Cloud Run. To deploy your own instance:
- Install and set up the Google Cloud SDK
- Authenticate with Google Cloud:
gcloud auth login
- Set your project ID:
gcloud config set project YOUR_PROJECT_ID
- Build and deploy the application using Cloud Build:
gcloud builds submit --config cloudbuild.yaml .
This command uses the cloudbuild.yaml
configuration file to build and deploy the application, ensuring consistency and reproducibility in the deployment process.
/
: Serves the loan application form (GET)/predict
: Makes a loan risk prediction (POST)/health
: Health check endpoint (GET)
The application features a user-friendly interface for loan applications. Here are some screenshots of the UI:
- Incorporate additional data sources beyond the application data
- Explore advanced ensemble techniques
- Consider adding features like AMT_CREDIT_SUM_DEBT from the "bureau.csv" file
- Further refine the financial analysis based on industry feedback
- Continuously monitor and update anomaly detection thresholds
- Conduct more extensive testing on edge cases to quantify the improvements from manual fine-tuning