đł German Credit Risk Prediction & Explainability
Created by Saksham Sharma
This repository contains a comprehensive data science project focused on predicting credit risk using the German Credit Card Dataset. It covers the entire pipeline from Exploratory Data Analysis (EDA) and data preprocessing to model training, rigorous evaluation (with a focus on imbalanced data), and in-depth model interpretability using various feature importance techniques.
The goal is not just to build a predictive model, but also to understand why it makes certain decisions, providing valuable, actionable insights for credit evaluation processes.
Project Report
Video based Project Demo
đ Project Highlights
- Comprehensive EDA: Deep dive into the dataset to uncover distributions, relationships, and patterns across various features (Age, Credit Amount, Duration, Housing, Job, Purpose, Savings, Checking Account Status, Sex) in relation to credit risk.
- Robust Preprocessing: Handling missing values and transforming categorical data for model readiness.
- Powerful Modeling: Implementation of a highly effective XGBoost classifier, known for its performance on tabular data.
- Rigorous Evaluation: Moving beyond simple accuracy to focus on metrics critical for imbalanced datasets (Precision, Recall, F1-score, Confusion Matrix). Detailed explanation of why these metrics are essential in a credit risk context.
- Model Explainability (XAI): Understanding the modelâs decision-making process using:
- XGBoostâs internal feature importance.
- Permutation Feature Importance (model-agnostic).
- SHapley Additive exPlanations (SHAP) for both global and local insights.
- Actionable Insights: Deriving practical takeaways from feature importance analysis to inform and potentially improve real-world credit assessment strategies.
*
đŻ Problem Definition
Credit risk assessment is a fundamental challenge for financial institutions. Accurately predicting whether a loan applicant is likely to default (âBad Creditâ) or repay (âGood Creditâ) is vital for financial stability and responsible lending. The German Credit Dataset presents this binary classification problem with features describing borrower demographics, financial status, and loan characteristics. A key challenge is the datasetâs class imbalance (more âGoodâ than âBadâ credit instances), which requires careful model evaluation and interpretation.
đ Dataset
The project utilizes the German Credit Dataset from the UCI Machine Learning Repository.
- Source: UCI Machine Learning Repository: German Credit Data
- Instances: 1000
- Features: 20 (mixture of numerical and categorical) + 1 target variable (âRiskâ).
- Target Variable: âRiskâ (Good Credit vs. Bad Credit).
- Key Characteristic: The dataset is imbalanced, with a majority of instances belonging to the âGood Creditâ class.
đ Exploratory Data Analysis (EDA)
The notebook includes an in-depth EDA section, exploring the relationships between various features and the target variable for the purpose of code reproducability.
A project report viewable on web browsers with interactive graphs is also made available.
- Initial Look: Data types, missing values, and unique values.
- Target Variable Distribution: Visualization of the class imbalance.
- Age Analysis: Distribution of age by credit risk, including grouping ages into categories for further analysis.
- Housing Status Analysis: Credit amount and age distributions by housing status and risk.
- Sex Distribution: Counts and credit amount distributions by sex and risk.
- Job Category Analysis: Counts, credit amount, and age distributions by job category and risk.
- Purpose of Loan: Counts, age, and credit amount distributions by loan purpose and risk.
- Duration of Loan: Counts, average credit amount, and distribution frequency by loan duration and risk.
- Savings and Checking Account Status: Distribution of counts, credit amount, and age by savings and checking account status and risk.
- Correlation Heatmap: Visualization of linear correlations between numerical and one-hot encoded categorical features.
Key EDA Findings:
- The dataset is indeed imbalanced, with ~70% Good Credit and ~30% Bad Credit instances.
- Features like
Credit amount, Duration, Age, Checking account status, and Housing show notable patterns related to credit risk.
- Interestingly, borrowers with âBad Creditâ sometimes tend to have applied for higher credit amounts or longer durations for certain categories compared to âGood Creditâ borrowers in the same category.
- The âno checking accountâ status seems strongly linked to higher risk.
đ§š Data Preprocessing
The preprocessing steps prepared the data for the machine learning model:
- Missing Value Imputation: Missing values in âSaving accountsâ and âChecking accountâ were handled by filling them with a new category like âno_infoâ or âno_checkingâ. (Based on your code, you filled with âno_infâ).
- Categorical Feature Encoding: One-Hot Encoding was applied to all remaining categorical features to convert them into a numerical format suitable for XGBoost.
- Target Variable Encoding: The âRiskâ variable was encoded (e.g., Good=0, Bad=1).
- Feature and Target Split: The dataset was split into features (X) and the target variable (y).
- Train-Test Split: The data was divided into training and testing sets (e.g., 75% train, 25% test).
đ¤ Model Training
- Model Choice: XGBoost (Extreme Gradient Boosting) was selected due to its proven performance on tabular data, robustness to multicollinearity, handling of non-linear relationships, built-in regularization, and feature importance capabilities.
- Hyperparameter Tuning:
GridSearchCV was used to find the optimal hyperparameters for the XGBoost classifier, maximizing the weighted F1-score on the cross-validation folds.
- Training: The XGBoost model was trained on the preprocessed training data.
đ Model Evaluation
Given the imbalanced nature of the dataset, evaluation focused on metrics beyond simple accuracy:
- Confusion Matrix: Visualizing True Positives, True Negatives, False Positives, and False Negatives.
- Classification Report: Providing Precision, Recall, and F1-score for both the âGoodâ and âBadâ credit classes, as well as overall accuracy and weighted/macro averages.
Evaluation Results:
đ§ Model Interpretability (XAI)
Understanding why the model makes predictions is crucial in credit risk. Various methods were used to gain insights into feature importance:
- XGBoost Internal Feature Importance: Based on metrics like âGainâ or âWeightâ within the tree structure during training.
- Permutation Feature Importance: Measures the decrease in model performance (weighted F1-score) when a featureâs values are randomly shuffled on the test set. A larger decrease indicates higher importance.
- SHapley Additive exPlanations (SHAP): A game-theory approach that assigns an importance value to each feature for each prediction, explaining its contribution to the final output. Aggregating SHAP values provides global feature importance and direction of impact.
XGB Internal Feature Importance
Permutation Feature Importance (Impact on Weighted F1 Score)
SHAP Summary Plot (Overall Impact and Direction)
Key Interpretability Findings (Consistent across methods):
The most consistently important features driving the modelâs predictions are:
- Credit amount: Higher amounts strongly increase predicted risk.
- Checking account status (especially âno_infâ / no info): Having no information about the checking account is a strong indicator of higher risk.
- Duration: Longer loan durations increase predicted risk.
- Age: Younger applicants tend to be associated with higher risk.
Other features like Housing_own, Sex_male, Job, and certain Purpose categories show moderate importance. Features like Savings account status generally have lower influence in this model.
This analysis confirms intuitive risk factors and highlights the critical role of checking account information.
đ How to Get Started
To run this project locally:
- Clone the repository:
git clone https://github.com/AlexFierro9/German-Credit-EDA.git
cd German-Credit-EDA
- Create and activate a virtual environment (recommended):
# Using venv
python -m venv venv
# On Windows
.\venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Run the Jupyter Notebook:
Open the notebook in your preferred Jupyter environment (Jupyter Lab, VS Code, etc.) and run the cells sequentially to reproduce the analysis and model training.
đ ď¸ Dependencies
pandas
numpy
seaborn
matplotlib
plotly
scikit-learn
xgboost
shap
All dependencies are listed in requirements.txt.
đš Demo
A video demo showcasing the notebook analysis is available here:
[Link to Demo Recording]
đ Acknowledgements
- The German Credit Dataset is made available by the UCI Machine Learning Repository.
-
Thanks to the developers of the open-source libraries used in this project (pandas, numpy, scikit-learn, xgboost, shap, seaborn, matplotlib, plotly).