Opendatabay APP

Email Filtering Machine Learning Dataset

Data Science and Analytics

Tags and Keywords

Spam

Email

Text

Classification

Detection

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Email Filtering Machine Learning Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a collection of emails categorised into two main classes: spam and not spam. It is engineered to aid in the development and evaluation of systems for spam detection or email filtering. The spam emails included are typically unsolicited, unwanted messages designed to promote products, spread malware, or deceive recipients for malicious ends, often featuring misleading subject lines, excessive advertising, or unauthorised links. Conversely, the non-spam emails are genuine communications from individuals or organisations, such as personal or professional messages, newsletters, or transaction receipts. The dataset reflects the natural variety of email communication, encompassing diverse lengths, languages, and writing styles, which helps in training algorithms that can generalise effectively across different email types and resist varied spammer tactics.

Columns

  • title: Represents the subject or title of the email.
  • text: Contains the full body text of the email.
  • type: Indicates the classification of the email, either 'spam' or 'not spam'.

Distribution

The dataset is provided in a CSV file format (.csv extension). The file, named email_spam.csv, has a size of 75.79 kB and consists of 3 columns. Based on the sample, it contains 84 valid records for each column. The distribution of email types shows that approximately 69% are 'not spam' and 31% are 'spam'. Specific numbers for total rows or records beyond this sample are not explicitly stated, but the sample indicates at least 84 records.

Usage

This dataset is ideal for a range of applications, including:
  • Spam detection systems.
  • Fraud detection analysis.
  • Developing and enhancing email filtering systems.
  • Applications in customer support automation.
  • Tasks related to natural language processing (NLP).

Coverage

The dataset's content spans emails of varying lengths, languages (primarily English as observed in examples), and writing styles, reflecting the heterogeneity of email communication. There is no specific geographic, time range, or demographic scope provided. The dataset is not expected to be updated. Custom email spam collection can be arranged based on specific requirements.

License

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Who Can Use It

This dataset is suitable for:
  • Researchers and developers working on spam detection and email filtering systems.
  • Data scientists and machine learning engineers focused on text classification, natural language processing, and cybersecurity.
  • Organisations looking to improve their email security protocols or customer communication channels.
  • Anyone interested in analysing patterns in email content to distinguish between legitimate and malicious messages.

Dataset Name Suggestions

  • Email Spam and Ham Classification Dataset
  • Email Text Classification for Spam Detection
  • Spam/Non-Spam Email Corpus
  • Email Filtering Machine Learning Dataset
  • Cybersecurity Email Text Dataset

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

22/08/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format