Financial Transaction Fraud Detection Dataset
Finance & Banking Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Designed to assist in building, training, and evaluating machine learning models aimed at detecting fraudulent financial transactions. It provides a valuable resource for identifying anomalies in financial activity through data analysis. The data presents a significant challenge due to a severe class imbalance, where less than one per cent of transactions are classified as fraudulent. To enhance detection capabilities, the dataset includes multiple supplementary files with location-based scores, proprietary grouping weights, network turn-around times, and vulnerability scores that can be merged with the main transaction data.
Columns
The dataset comprises several CSV files, each with distinct columns:
-
train.csv (28 columns):
id
(integer): A masked, unique identifier for each transaction.Group
(string): A masked grouping label.Per1
toPer9
(float): Nine masked numeric features.Dem1
toDem9
(float): Nine additional masked numeric features, likely demographic.Cred1
toCred6
(float): Six masked credit or risk-related features.Normalised_FNT
(float): A masked numeric field.Target
(integer): The fraud indicator, where1
signifies fraud and0
indicates a clean transaction.
-
test_share.csv (27 columns):
- Identical to
train.csv
but lacks theTarget
column, intended for making predictions.
- Identical to
-
Geo_scores.csv:
id
(integer)geo_score
(float): Geospatial location scores associated with transactions.
-
Lambda_wts.csv:
Group
(string)lambda_wt
(float): Proprietary weights or scores for each group.
-
Qset_tats.csv:
id
(integer)qsets_normalized_tat
(float): Network turn-around times (TAT) for transactions.
-
instance_scores.csv:
id
(integer)instance_scores
(float): Vulnerability or risk qualification scores.
Distribution
The data files are provided in CSV format. The primary transaction data is split into training and testing sets:
train.csv
contains approximately 227,845 rows and 28 columns, while test_share.csv
contains around 56,962 rows and 27 columns. Additional CSV files (Geo_scores.csv
, Lambda_wts.csv
, Qset_tats.csv
, instance_scores.csv
) supply extra features. These supplementary files can be merged with the main transaction data using id
or Group
columns to enrich the feature set. The train.csv
file exhibits a notable class imbalance, with a very small percentage of transactions marked as fraudulent.Usage
This dataset is ideal for:
- Developing, training, and evaluating machine learning models for financial fraud detection.
- Identifying anomalous patterns in financial transactions.
- Practising with imbalanced datasets and applying techniques such as SMOTE, Random Oversampler, or class weighting.
- Training various classification models, including Random Forest, XGBoost, or LightGBM.
- Performing feature engineering to determine the predictive value of additional scores like geospatial, lambda, and instance scores.
- Evaluating model performance using appropriate metrics for imbalanced data, such as Precision, Recall, F1-score, and ROC-AUC.
- Generating predictions on new, unseen transaction data (
test_share.csv
).
Coverage
The dataset includes location-based geospatial scores, indicating a geographic dimension to the transactions, though specific regions or countries are not detailed. It consists of historical transaction data; however, the precise time range (e.g., specific years or months) is not specified. Some features, such as
Dem1
to Dem9
and Group
, are masked, which limits the ability to infer specific demographic or grouping details. The primary focus is on financial transactions, with a distinct emphasis on the detection of a small minority of fraudulent cases.License
CC BY-NC-SA 4.0
Who Can Use It
- Machine Learning Practitioners and Data Scientists: For developing and refining fraud detection algorithms.
- Data Analysts: To explore transaction data and uncover insights into financial anomalies.
- Beginners in Machine Learning: The dataset is considered beginner-friendly, offering a practical scenario to learn about handling imbalanced data and integrating multiple data sources.
- Researchers: To investigate new methods for fraud detection, particularly in settings with skewed class distributions.
Dataset Name Suggestions
- Financial Transaction Fraud Detection Dataset
- Imbalanced Financial Fraud Data
- Machine Learning Fraud Analytics Dataset
- Transaction Anomaly Prediction Data
- Integrated Financial Risk Dataset
Attributes
Original Data Source: Financial Transaction Fraud Detection Dataset