Opendatabay APP

Data Leak Detection Simulator

Data Science and Analytics

Tags and Keywords

Leakage

Artificial

Tabular

Features

Analysis

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Data Leak Detection Simulator Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Specially created as artificial data for a hands-on workshop dedicated to data analysis and feature engineering skills. It is structured as a binary classification task. By utilising only the raw features provided, a standard binary classifier typically achieves an Area Under the Curve (AUC) score of around 0.75. However, successful Exploratory Data Analysis (EDA) and the subsequent crafting of 10 features designed to capture these specific leaks can dramatically improve model performance, allowing scores to get close to AUC 0.95. This resource is key for demonstrating the dangers and impacts of preparing data incorrectly for Machine Learning tasks.

Columns

The dataset has 11 columns in total, covering feature columns and one target variable used for validation. The feature columns collectively contain the 10 data leakage properties.
  • col0: A numerical feature column exhibiting high skew. 24,941 records fall within the lowest range (0.00 to 0.69).
  • col2: A numerical feature column with a large range of values, spanning from a minimum of -40.1 to a maximum of 42.8.
  • col3: A numerical feature column that is extremely concentrated in its lower range; nearly 48,000 records fall within the 0.01 to 5.60 interval.
  • col7: A numerical feature column centred around a mean of 1.01. The highest concentration of labels is found between -0.16 and 2.30.
  • col8: A numerical feature column that is clustered in the upper middle values, with high label counts between 91.05 and 103.50.
  • target: The Target column used for validation only, which supports a binary classification task (0 or 1).
  • Remaining Columns: Additional feature columns (e.g., col1, col4, col5, col6, col9) that hold the remaining leakage properties.

Distribution

The data is delivered in a tabular structure, typically in CSV format (e.g., test.csv, 8.23 MB). The total number of valid records is 50,000 across all 11 columns. Data integrity is perfect: there are zero mismatched values and zero missing values across all records, resulting in 100% usability. The structure is designed specifically for a binary classification task.

Usage

  • Feature Engineering Practice: Ideal for challenging data scientists to create novel features that explicitly capture non-obvious relationships.
  • Exploratory Data Analysis (EDA): Used to sharpen techniques for deep data analysis to uncover unusual patterns or specific properties located in unexpected places.
  • Machine Learning Education: Excellent resource for demonstrating the impact of data preparation and the dangers associated with inadvertently incorporating leakage into training data.
  • Model Validation: Used to test model robustness and the sensitivity of algorithms to artificially strong features.

Coverage

This is an artificially generated dataset, created purely for statistical training purposes. It is theoretical in nature and does not possess any geographic, temporal, or demographic scope.

License

CC0: Public Domain

Who Can Use It

  • Data Scientists: For advanced feature engineering practice and preparing robust models.
  • ML Engineers: To validate data pipelines and ensure training data quality.
  • Academic Researchers: As a controlled environment for testing data quality metrics and data leak detection algorithms.
  • Data Analysts: For mastering pattern recognition skills required in initial data exploration stages.

Dataset Name Suggestions

  1. Artificial Data Leak Challenge
  2. Feature Engineering Test Bench
  3. Data Leak Detection Simulator
  4. Exploratory Data Analysis Training Set

Attributes

Original Data Source: Data Leak Detection Simulator

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

29/11/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in ZIP Format