Synthetic Spam Detection Dataset
Synthetic Data Generation
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This is a synthetic dataset designed for the development and evaluation of spam email detection models. It comprises 20,000 simulated email samples, each characterised by five distinct features and a binary label indicating whether the email is spam. The dataset is generated using a rule-based formula with added noise to simulate real-world uncertainties, making it suitable for training various machine learning algorithms.
Columns
- num_links: An integer representing the number of links found within the email body. This feature is generated using a Poisson distribution, with a higher count often indicating a greater likelihood of spam.
- num_words: An integer denoting the total word count in the email. Values range between 20 and 200, serving as a relatively neutral feature, though very short or very long emails can sometimes appear suspicious.
- has_offer: A binary (0 or 1) indicator showing whether the email contains the word “offer”. This feature is simulated with a 30% chance of being true, reflecting common marketing language often found in spam.
- sender_score: A float value between 0 and 1, representing a simulated reputation score for the email sender. Lower scores suggest a less trustworthy sender, increasing the probability of spam. The scores are normally distributed around 0.7.
- all_caps: A binary (0 or 1) flag indicating if the subject line is written entirely in capital letters. This is simulated with a 10% chance of being true, as all-caps subject lines are frequently used in spam to grab attention.
- is_spam: The binary target label (0 or 1), specifying whether the email is categorised as spam. This label is determined by a rule-based formula where certain feature values (e.g., links > 2, presence of "offer", low sender score, all-caps subject) increase the spam probability, combined with Gaussian randomness.
Distribution
The dataset is provided as a CSV data file, named
spam_detection_dataset.csv
, with a size of approximately 610.36 kB. It contains 20,000 records (email samples) and consists of 6 columns. All data points across the columns are valid, with no mismatched or missing entries.Usage
This dataset is ideal for a variety of applications in machine learning and data science, including:
- Training and testing binary classification algorithms such as Logistic Regression, Decision Trees, Random Forests, and Neural Networks for spam detection.
- Feature importance analysis, helping users understand which features most significantly influence spam prediction.
- Testing model robustness against data with noisy, rule-based labels.
- Building and evaluating explainable AI (XAI) models, given that the underlying rules for spam generation are known.
Coverage
As a synthetic dataset, it does not possess specific geographic, time range, or demographic coverage. Its scope is purely conceptual, designed to simulate typical email characteristics relevant to spam detection scenarios without reliance on real-world personal or temporal data.
License
CC0: Public Domain
Who Can Use It
This dataset is particularly beneficial for:
- Machine learning engineers developing and refining spam classification models.
- Data scientists seeking to explore feature engineering and model performance in binary classification tasks.
- Researchers focused on email security, anti-spam techniques, and the interpretability of AI models.
- Students and educators for learning and demonstrating machine learning concepts in a practical context.
Dataset Name Suggestions
- Synthetic Spam Detection Dataset
- Email Spam Classifier Training Data
- AI Spam Prevention Dataset
- Rule-Based Email Spam Corpus
- Digital Spam Identification Data
Attributes
Original Data Source: Synthetic Spam Detection Dataset