Loan Default Prediction Data
Fraud Detection & Risk Management
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed to aid banks in predicting loan defaults using machine learning. Lending loans is a primary revenue source for banks, but it carries the inherent risk of borrowers defaulting. To mitigate this, banks are looking to leverage machine learning to develop robust models that can classify whether a new borrower is likely to default or not.
The dataset is substantial and includes numerous deterministic factors such as the borrower's income, gender, and loan purpose. Users should be aware that the dataset is subject to strong multicollinearity and contains empty values, presenting a challenge for model development. The primary objective is to clean and understand the dataset, build a classification model to predict loan defaults, fine-tune hyperparameters, and compare the evaluation metrics of various classification algorithms.
Columns
- ID: Unique identifier for each record.
- year: The year the data was recorded. This dataset primarily covers 2019.
- loan_limit: Indicates the type or limit of the loan, with 'cf' (91%) and 'ncf' (7%) as common values.
- Gender: The gender of the loan applicant, including categories like 'Male' (28%) and 'Joint' (28%).
- approv_in_adv: Status of approval in advance, primarily 'nopre' (84%) and 'pre' (16%).
- loan_type: Categorisation of the loan type, with 'type1' being the most common (76%).
- loan_purpose: The stated purpose of the loan, with 'p3' (38%) and 'p4' (37%) being frequent.
- Credit_Worthiness: Reflects the borrower's credit standing, largely 'l1' (96%).
- open_credit: Status of open credit, mostly 'nopc' (100%).
- business_or_commercial: Indicates if the loan is for business or commercial purposes, mainly 'nob/c' (86%).
- loan_amount: The value of the loan requested, ranging from £16.5k to £3.58m, with a mean of £331k.
- rate_of_interest: The interest rate applied to the loan, with values ranging from 0 to 8, and a mean of 4.05.
- Interest_rate_spread: The spread in the interest rate, ranging from -3.64 to 3.36, with a mean of 0.44.
- Upfront_charges: Any upfront charges associated with the loan, ranging from £0 to £60k, with a mean of £3.22k.
- term: The term of the loan, predominantly 360 units (e.g., months), with a mean of 335.
- Neg_ammortization: Indicates if negative amortisation is present, mostly 'not_neg' (90%).
- interest_only: Specifies if the loan is interest-only, primarily 'not_int' (95%).
- lump_sum_payment: Indicates if a lump sum payment is involved, largely 'not_lpsm' (98%).
- property_value: The value of the property associated with the loan, ranging from £8k to £16.5m, with a mean of £498k.
- construction_type: The type of construction, exclusively 'sb' (100%).
- occupancy_type: The occupancy type of the property, primarily 'pr' (93%).
- Secured_by: How the loan is secured, exclusively 'home' (100%).
- total_units: The number of units, mostly '1U' (99%).
- income: The borrower's income, ranging from £0 to £579k, with a mean of £6.96k.
- credit_type: The type of credit, with 'CIB' (32%) and 'CRIF' (30%) being common.
- Credit_Score: The borrower's credit score, ranging from 500 to 900, with a mean of 700.
- co-applicant_credit_type: The co-applicant's credit type, split evenly between 'CIB' (50%) and 'EXP' (50%).
- age: The age range of the borrower, with '45-54' (23%) and '35-44' (22%) being common.
- submission_of_application: How the application was submitted, mostly 'to_inst' (64%).
- LTV: Loan-to-Value ratio, with values ranging from 0.97 to 7.83k, and a mean of 72.7.
- Region: The geographic region, with 'North' (50%) and 'south' (43%) being common.
- Security_Type: The type of security, exclusively 'direct' (100%).
- Status: The target variable, indicating loan default status (0 or 1).
- dtir1: Debt-to-income ratio (Dtir 1), ranging from 5 to 61, with a mean of 37.7.
Distribution
The dataset is provided as a CSV file and has a file size of 28.48 MB. It comprises 34 columns. The dataset contains approximately 149,000 records for most columns, although some columns like 'rate_of_interest', 'Interest_rate_spread', 'Upfront_charges', 'property_value', 'income', and 'dtir1' have missing values ranging from 2% to 27%.
Usage
This dataset is ideally suited for:
- Developing and testing machine learning classification models to predict loan defaults.
- Risk assessment in the financial sector to identify potentially defaulting borrowers.
- Performing data cleaning and preprocessing techniques to handle multicollinearity and missing values.
- Hyperparameter tuning and comparing performance of various classification algorithms.
- Building predictive analytics solutions for banking and lending institutions.
Coverage
The dataset's time range is focused on the year 2019. It includes various demographic and financial factors such as borrower gender, income, age, credit score, loan type, and purpose. No specific geographic coverage is detailed beyond general 'Region' categories like 'North' and 'South'. The dataset notably contains attributes that may exhibit multicollinearity and missing values, which should be addressed during analysis.
License
CC0: Public Domain
Who Can Use It
This dataset is intended for:
- Data Scientists and Machine Learning Engineers for building and validating predictive models.
- Financial Analysts and Risk Managers within banking institutions for assessing credit risk.
- Researchers and Academics studying financial stability, credit behaviour, and predictive modelling.
- Students and Beginners in data science looking to gain practical experience with a real-world classification problem.
Dataset Name Suggestions
- Loan Default Prediction Data
- Bank Loan Risk Classification
- Borrower Default Likelihood
- Credit Risk Assessment Dataset
- Financial Default Predictor
Attributes
Original Data Source: Loan Default Prediction Data