Synthetic Bank Customer Churn Data
Synthetic Data Generation
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Data for binary classification tasks concerning bank customer attrition. This collection of synthetic records was created for the Playground Series S4 E1 competition and is highly suitable for developing and testing machine learning models that predict whether a customer will exit the bank. The data includes both raw attributes and several engineered features to aid model performance.
Columns
- Surname: Label Encoded Surnames.
- Surname_tfidf_0 through Surname_tfidf_4: Features derived by applying a TFIDF Vectorizer to Surnames.
- Credit Score: A numerical value indicating the customer's credit score, ranging from 350 to 850.
- Geography: The customer's country of residence (France, Spain, or Germany).
- Gender: The customer's gender (Male or Female).
- Age: The customer's age, spanning 18 to 92 years.
- Tenure: The number of years the customer has maintained an account with the bank, from 0 to 10 years.
- Balance: The customer's account balance, with a maximum value around 251k.
- NumOfProducts: The quantity of bank products utilized (e.g., savings account, credit card), ranging from 1 to 4.
- HasCrCard: Binary indicator (1 = yes, 0 = no) showing credit card ownership.
- IsActiveMember: Binary indicator (1 = yes, 0 = no) showing active membership status.
- EstimatedSalary: The estimated salary of the customer, up to 200k.
- Exited: The target variable, indicating customer churn (1 = yes, 0 = no).
- Germany, France, Spain: One-Hot Encoded geography features.
- Male, Female: One-Hot Encoded gender features.
- Mem__no__Products: Engineered feature calculated as NumOfProducts multiplied by IsActiveMember.
- Cred_Bal_Sal: Engineered feature calculated as (Credit Score * Balance) / EstimatedSalary.
- Bal_sal: Engineered feature calculated as Balance / EstimatedSalary.
- Tenure_Age: Engineered feature calculated as Tenure / Age.
- Age_Tenure_product: Engineered feature calculated as Age * Tenure.
Distribution
The file is provided in a CSV format, with a size of approximately 36.27 MB. It consists of 25 distinct columns and contains 175,000 records. All variables are present, and the records show no missing values. The underlying data is entirely synthetic.
Usage
This collection is ideal for developing binary classification models, specifically predictive analytics for customer churn risk in the banking sector. It is highly suitable for educational purposes, machine learning competitions, and initial exploration into classification techniques using tabular data. It can also support investment-related analyses regarding banking stability.
Coverage
The geographical scope covers customer records from France, Spain, and Germany. Demographic details include customer ages ranging from 18 to 92 years and bank tenure spanning 0 to 10 years. The data includes both Male and Female customer genders. This is a static collection with no expected future updates.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
Data Scientists and ML Engineers: For training, testing, and benchmarking binary classification algorithms focused on retention strategies.
Students and Beginners: Ideal for learning core machine learning concepts due to its clear structure and synthetic nature.
Banking Analysts: For simulating and understanding the key drivers behind customer attrition risk within financial institutions.
Dataset Name Suggestions
- Synthetic Bank Customer Churn Data
- Financial Attrition Binary Classification
- Bank Customer Exit Prediction Dataset
- S4 E1 Churn Analysis Data.
Attributes
Original Data Source: Synthetic Bank Customer Churn Data