Imbalanced Client Risk Prediction Data
Data Science and Analytics
Tags and Keywords
Trusted By



"No reviews yet"
Free
About
This starter set focuses on client risk classification using data afflicted by class imbalance. This condition, where observations are disproportionately distributed across categories, is a common and challenging problem in machine learning classification tasks. The imbalance renders standard metrics like accuracy ineffective for reliably measuring model performance, making model training trickier. The dataset is particularly useful for developing predictive models in critical domains such as anti-fraud and anti-spam systems.
Columns
month: The month of the associated purchase.credit_amount: The amount requested for the loan.credit_term: The duration or terms of the loan.age: The customer's age (ranging from 18 to 90).sex: The customer's gender (Male or Female).education: The level of education attained by the customer (e.g., Secondary special education, Higher education).product_type: The category of the purchased product (e.g., Cell phones, Household appliances).having_children_flg: A binary flag indicating the presence of children associated with the client.region: The customer location category.income: The customer's total income (ranging up to 401k).family_status: The client's familial status (e.g., Married).phone_operator: The mobile operator category used by the client.is_client: A flag indicating if the individual is an existing client of the institution.bad_client_target: The classification target variable, indicating whether the client is high-risk.
Distribution
The data is delivered as a CSV file named
clients.csv and contains 1,723 valid records across 14 columns. The dataset is entirely clean, showing zero missing or mismatched entries for all attributes. Its defining structural feature is the significant class imbalance within the bad_client_target variable: approximately 1,527 records belong to the majority class (0), while only 196 records belong to the minority class (1).Usage
This dataset is ideal for practitioners seeking to mitigate the challenges posed by data imbalance. Ideal applications include benchmarking classification algorithms on skewed data, experimenting with cost-sensitive training methods, applying sampling techniques (such as up-sampling the minority class or down-sampling the majority class), and developing predictive risk assessment models using tree-based algorithms. It is specifically built for identifying rare or high-risk events.
Coverage
The data covers various customer demographics, including customer age (from 18 up to 90), gender (54% male), and diverse categories of education (Secondary special education is the most common at 49%) and family status. Financially, it spans loan requests between 5,000 and 301k and customer incomes up to 401k. The temporal scope is defined by the month of purchase and includes categorised customer location data (
region).License
CC0: Public Domain
Who Can Use It
- Machine Learning Specialists: Seeking to refine performance metrics and modelling strategies for classification problems where data classes are heavily skewed.
- Banking and Finance Researchers: Analysing how customer profiles relate to the probability of default or loan risk.
- Data Scientists: Learning practical methods for dealing with real-world complexities like highly unbalanced class distributions.
- Risk Management Developers: Building and testing predictive engines to identify high-risk clients or potential fraudulent activities.
Dataset Name Suggestions
- Imbalanced Client Risk Prediction Data
- Starter Credit Classification Set
- Financial Risk Scoring Data with Skewed Classes
- Imbalanced Customer Profile Data
Attributes
Original Data Source: Imbalanced Client Risk Prediction Data
Loading...
