Opendatabay APP

Financial Transaction Fraud Detection Dataset

Finance & Banking Analytics

Tags and Keywords

Fraud

Financial

Transactions

Machine

Learning

Imbalance

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Financial Transaction Fraud Detection Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Designed to assist in building, training, and evaluating machine learning models aimed at detecting fraudulent financial transactions. It provides a valuable resource for identifying anomalies in financial activity through data analysis. The data presents a significant challenge due to a severe class imbalance, where less than one per cent of transactions are classified as fraudulent. To enhance detection capabilities, the dataset includes multiple supplementary files with location-based scores, proprietary grouping weights, network turn-around times, and vulnerability scores that can be merged with the main transaction data.

Columns

The dataset comprises several CSV files, each with distinct columns:
  • train.csv (28 columns):
    • id (integer): A masked, unique identifier for each transaction.
    • Group (string): A masked grouping label.
    • Per1 to Per9 (float): Nine masked numeric features.
    • Dem1 to Dem9 (float): Nine additional masked numeric features, likely demographic.
    • Cred1 to Cred6 (float): Six masked credit or risk-related features.
    • Normalised_FNT (float): A masked numeric field.
    • Target (integer): The fraud indicator, where 1 signifies fraud and 0 indicates a clean transaction.
  • test_share.csv (27 columns):
    • Identical to train.csv but lacks the Target column, intended for making predictions.
  • Geo_scores.csv:
    • id (integer)
    • geo_score (float): Geospatial location scores associated with transactions.
  • Lambda_wts.csv:
    • Group (string)
    • lambda_wt (float): Proprietary weights or scores for each group.
  • Qset_tats.csv:
    • id (integer)
    • qsets_normalized_tat (float): Network turn-around times (TAT) for transactions.
  • instance_scores.csv:
    • id (integer)
    • instance_scores (float): Vulnerability or risk qualification scores.

Distribution

The data files are provided in CSV format. The primary transaction data is split into training and testing sets: train.csv contains approximately 227,845 rows and 28 columns, while test_share.csv contains around 56,962 rows and 27 columns. Additional CSV files (Geo_scores.csv, Lambda_wts.csv, Qset_tats.csv, instance_scores.csv) supply extra features. These supplementary files can be merged with the main transaction data using id or Group columns to enrich the feature set. The train.csv file exhibits a notable class imbalance, with a very small percentage of transactions marked as fraudulent.

Usage

This dataset is ideal for:
  • Developing, training, and evaluating machine learning models for financial fraud detection.
  • Identifying anomalous patterns in financial transactions.
  • Practising with imbalanced datasets and applying techniques such as SMOTE, Random Oversampler, or class weighting.
  • Training various classification models, including Random Forest, XGBoost, or LightGBM.
  • Performing feature engineering to determine the predictive value of additional scores like geospatial, lambda, and instance scores.
  • Evaluating model performance using appropriate metrics for imbalanced data, such as Precision, Recall, F1-score, and ROC-AUC.
  • Generating predictions on new, unseen transaction data (test_share.csv).

Coverage

The dataset includes location-based geospatial scores, indicating a geographic dimension to the transactions, though specific regions or countries are not detailed. It consists of historical transaction data; however, the precise time range (e.g., specific years or months) is not specified. Some features, such as Dem1 to Dem9 and Group, are masked, which limits the ability to infer specific demographic or grouping details. The primary focus is on financial transactions, with a distinct emphasis on the detection of a small minority of fraudulent cases.

License

CC BY-NC-SA 4.0

Who Can Use It

  • Machine Learning Practitioners and Data Scientists: For developing and refining fraud detection algorithms.
  • Data Analysts: To explore transaction data and uncover insights into financial anomalies.
  • Beginners in Machine Learning: The dataset is considered beginner-friendly, offering a practical scenario to learn about handling imbalanced data and integrating multiple data sources.
  • Researchers: To investigate new methods for fraud detection, particularly in settings with skewed class distributions.

Dataset Name Suggestions

  • Financial Transaction Fraud Detection Dataset
  • Imbalanced Financial Fraud Data
  • Machine Learning Fraud Analytics Dataset
  • Transaction Anomaly Prediction Data
  • Integrated Financial Risk Dataset

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

08/09/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format