Dark Mode

Home

Data Categories

Synthetic Data

Synthetic Spam Detection Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Synthetic Spam Detection Dataset

Synthetic Data Generation

Tags and Keywords

Spam

Detection

Email

Machine

Classification

Trusted By

Synthetic Spam Detection Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This is a synthetic dataset designed for the development and evaluation of spam email detection models. It comprises 20,000 simulated email samples, each characterised by five distinct features and a binary label indicating whether the email is spam. The dataset is generated using a rule-based formula with added noise to simulate real-world uncertainties, making it suitable for training various machine learning algorithms.

Columns

num_links: An integer representing the number of links found within the email body. This feature is generated using a Poisson distribution, with a higher count often indicating a greater likelihood of spam.
num_words: An integer denoting the total word count in the email. Values range between 20 and 200, serving as a relatively neutral feature, though very short or very long emails can sometimes appear suspicious.
has_offer: A binary (0 or 1) indicator showing whether the email contains the word “offer”. This feature is simulated with a 30% chance of being true, reflecting common marketing language often found in spam.
sender_score: A float value between 0 and 1, representing a simulated reputation score for the email sender. Lower scores suggest a less trustworthy sender, increasing the probability of spam. The scores are normally distributed around 0.7.
all_caps: A binary (0 or 1) flag indicating if the subject line is written entirely in capital letters. This is simulated with a 10% chance of being true, as all-caps subject lines are frequently used in spam to grab attention.
is_spam: The binary target label (0 or 1), specifying whether the email is categorised as spam. This label is determined by a rule-based formula where certain feature values (e.g., links > 2, presence of "offer", low sender score, all-caps subject) increase the spam probability, combined with Gaussian randomness.

Distribution

The dataset is provided as a CSV data file, named spam_detection_dataset.csv, with a size of approximately 610.36 kB. It contains 20,000 records (email samples) and consists of 6 columns. All data points across the columns are valid, with no mismatched or missing entries.

Usage

This dataset is ideal for a variety of applications in machine learning and data science, including:

Training and testing binary classification algorithms such as Logistic Regression, Decision Trees, Random Forests, and Neural Networks for spam detection.
Feature importance analysis, helping users understand which features most significantly influence spam prediction.
Testing model robustness against data with noisy, rule-based labels.
Building and evaluating explainable AI (XAI) models, given that the underlying rules for spam generation are known.

Coverage

As a synthetic dataset, it does not possess specific geographic, time range, or demographic coverage. Its scope is purely conceptual, designed to simulate typical email characteristics relevant to spam detection scenarios without reliance on real-world personal or temporal data.

License

CC0: Public Domain

Who Can Use It

This dataset is particularly beneficial for:

Machine learning engineers developing and refining spam classification models.
Data scientists seeking to explore feature engineering and model performance in binary classification tasks.
Researchers focused on email security, anti-spam techniques, and the interpretability of AI models.
Students and educators for learning and demonstrating machine learning concepts in a practical context.

Dataset Name Suggestions

Synthetic Spam Detection Dataset
Email Spam Classifier Training Data
AI Spam Prevention Dataset
Rule-Based Email Spam Corpus
Digital Spam Identification Data

Attributes

Original Data Source: Synthetic Spam Detection Dataset

Listing Stats

VIEWS

DOWNLOADS

LISTED

13/08/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...