Opendatabay APP

Classified Email and SMS Spam Repository

Data Science and Analytics

Tags and Keywords

Spam

Email

Classification

Text

Internet

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Classified Email and SMS Spam Repository Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Distinguishing between legitimate messages and unsolicited, fraudulent communications is a foundational task in digital security. This collection of classified email and SMS text provides the raw material needed to develop automated filters that can identify unwanted content with high accuracy. By providing a binary classification of "spam" and "ham," these records enable the exploration of linguistic patterns that separate personal or professional correspondence from commercial promotions and scams. The inclusion of real-world subject lines and message bodies makes this a vital tool for understanding the nuances of modern internet-based messaging.

Columns

  • text: The raw content of the message, which may include subject lines and the main body of the email or SMS. This field contains a diverse array of words, phrases, and symbols typically found in real-world electronic correspondence.
  • spam: A numerical classification label where a value of 1 signifies that the message is unsolicited spam, while a value of 0 indicates the message is legitimate, non-spam "ham."

Distribution

The information is delivered in a single CSV file titled emails.csv with a size of 8.95 MB. It consists of 5,728 valid records structured across 2 distinct columns. The data maintains a 100% validity rate with no missing or mismatched entries reported. The repository features 5,695 unique text values, and updates are expected to occur on an annual basis.

Usage

This resource is ideal for training supervised machine learning models, specifically for binary text classification tasks. It is well-suited for practicing natural language processing techniques such as tokenisation, vectorisation, and the implementation of Naive Bayes algorithms. Additionally, developers can use the records to benchmark the performance of spam filters and to study the common characteristics of fraudulent or promotional internet traffic.

Coverage

The scope encompasses a wide variety of electronic text messages found across the internet, reflecting real-world communication styles and diverse linguistic structures. While specific geographic demographics are not explicitly detailed, the content focuses on English-language messaging typical of global internet users. The data provides a snapshot of various message types, from professional scheduling requests to commercial solicitations.

License

CC0: Public Domain

Who Can Use It

Data science beginners can leverage these records to learn the fundamentals of classification and supervised learning. Machine learning engineers may utilise the text samples to develop and refine more sophisticated NLP models for messaging platforms. Furthermore, cybersecurity researchers can use these patterns to better understand the evolving nature of unsolicited messaging and the tactics used in fraudulent digital communications.

Dataset Name Suggestions

  • Classified Email and SMS Spam Repository
  • Supervised Text Classification: Spam vs Ham
  • Electronic Messaging Security Archive
  • Internet Spam Detection Training Set
  • Binary Email Labelling and Text Metrics

Attributes

Listing Stats

VIEWS

4

DOWNLOADS

3

LISTED

30/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format