German-Credit-EDA

💳 German Credit Risk Prediction & Explainability

Created by Saksham Sharma

This repository contains a comprehensive data science project focused on predicting credit risk using the German Credit Card Dataset. It covers the entire pipeline from Exploratory Data Analysis (EDA) and data preprocessing to model training, rigorous evaluation (with a focus on imbalanced data), and in-depth model interpretability using various feature importance techniques.

The goal is not just to build a predictive model, but also to understand why it makes certain decisions, providing valuable, actionable insights for credit evaluation processes.

Project Report

Video based Project Demo

🌟 Project Highlights

Credit risk assessment is a fundamental challenge for financial institutions. Accurately predicting whether a loan applicant is likely to default (‘Bad Credit’) or repay (‘Good Credit’) is vital for financial stability and responsible lending. The German Credit Dataset presents this binary classification problem with features describing borrower demographics, financial status, and loan characteristics. A key challenge is the dataset’s class imbalance (more ‘Good’ than ‘Bad’ credit instances), which requires careful model evaluation and interpretation.

📊 Dataset

The project utilizes the German Credit Dataset from the UCI Machine Learning Repository.

🔍 Exploratory Data Analysis (EDA)

The notebook includes an in-depth EDA section, exploring the relationships between various features and the target variable for the purpose of code reproducability.

A project report viewable on web browsers with interactive graphs is also made available.

Key EDA Plot Example

Key EDA Findings:

🧹 Data Preprocessing

The preprocessing steps prepared the data for the machine learning model:

🤖 Model Training

📈 Model Evaluation

Given the imbalanced nature of the dataset, evaluation focused on metrics beyond simple accuracy:

Confusion Matrix Screenshot

Evaluation Results:

🧠 Model Interpretability (XAI)

Understanding why the model makes predictions is crucial in credit risk. Various methods were used to gain insights into feature importance:

  1. XGBoost Internal Feature Importance: Based on metrics like “Gain” or “Weight” within the tree structure during training.
  2. Permutation Feature Importance: Measures the decrease in model performance (weighted F1-score) when a feature’s values are randomly shuffled on the test set. A larger decrease indicates higher importance.
  3. SHapley Additive exPlanations (SHAP): A game-theory approach that assigns an importance value to each feature for each prediction, explaining its contribution to the final output. Aggregating SHAP values provides global feature importance and direction of impact.

XGB Internal Feature Importance
XGB Internal Feature Importance

Permutation Feature Importance
Permutation Feature Importance (Impact on Weighted F1 Score)

SHAP Summary Plot Screenshot
SHAP Summary Plot (Overall Impact and Direction)

Key Interpretability Findings (Consistent across methods): The most consistently important features driving the model’s predictions are:

  1. Credit amount: Higher amounts strongly increase predicted risk.
  2. Checking account status (especially ‘no_inf’ / no info): Having no information about the checking account is a strong indicator of higher risk.
  3. Duration: Longer loan durations increase predicted risk.
  4. Age: Younger applicants tend to be associated with higher risk.

Other features like Housing_own, Sex_male, Job, and certain Purpose categories show moderate importance. Features like Savings account status generally have lower influence in this model.

This analysis confirms intuitive risk factors and highlights the critical role of checking account information.

🚀 How to Get Started

To run this project locally:

  1. Clone the repository:
    git clone https://github.com/AlexFierro9/German-Credit-EDA.git
    cd German-Credit-EDA
    
  2. Create and activate a virtual environment (recommended):
    # Using venv
    python -m venv venv
    # On Windows
    .\venv\Scripts\activate
    # On macOS/Linux
    source venv/bin/activate
    
  3. Install dependencies:
    pip install -r requirements.txt
    
  4. Run the Jupyter Notebook: Open the notebook in your preferred Jupyter environment (Jupyter Lab, VS Code, etc.) and run the cells sequentially to reproduce the analysis and model training.

    🛠️ Dependencies

📹 Demo

A video demo showcasing the notebook analysis is available here:

[Link to Demo Recording]

🙏 Acknowledgements