Opendatabay APP

Spam and Ham Message Classification Data

Data Science and Analytics

Tags and Keywords

Spam

Email

Nlp

Text

Classification

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Spam and Ham Message Classification Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

A valuable collection of electronic mail messages specifically designed for identifying and classifying unsolicited messages. The resource is comprised of full email texts that are pre-labelled as either spam or non-spam (often referred to as 'ham'). It is instrumental for researchers and data scientists focused on developing and testing robust spam detection algorithms, particularly those involved in natural language processing (NLP), machine learning, and refining email filtering systems. The diversity of the content covers a wide range of topics and communication styles, reflecting genuine, real-world examples.

Columns

  • text: The full content of the individual email message. This column contains 5,695 unique values.
  • spam: A binary label indicating the classification of the email. A value of 1 signifies a spam message, while 0 indicates a non-spam (ham) message.

Distribution

The dataset is in a Tab-Delimited Format, facilitating easy import and processing within various data analysis tools.
  • Size and Structure: The dataset contains 5,728 total observations (records), making it suitable for robust analysis. The file is named emails.csv and has a size of 8.95 MB. It consists of 2 columns.
  • Data Integrity: Data integrity is high, with 5,728 valid records reported and zero missing or mismatched values in either column.
  • Label Distribution: The data shows a mean 'spam' value of 0.24. The distribution includes 4,360 non-spam labels (0) and 1,368 spam labels (1).
  • Update Frequency: The expected update frequency is "Never."

Usage

This data product is ideally applied in the following areas:
  • Developing and testing machine learning models specifically for binary classification tasks.
  • Training and evaluating effective spam detection algorithms.
  • Refining natural language processing techniques tailored to message content analysis.
  • Building effective email filtering systems and anti-spam technologies.

Coverage

The emails included cover a wide range of topics and styles, ensuring the content reflects realistic real-world examples of both spam and non-spam correspondence. The sources do not specify explicit geographic, time range, or demographic scope for the message contents.

License

CC0: Public Domain

Who Can Use It

This data is highly suitable for professionals and academics engaged in technology and analysis:
  • Data Scientists: For training and testing classification and prediction models.
  • Machine Learning Engineers: To develop and optimise complex email filtering algorithms.
  • Researchers: Studying natural language processing, text analytics, and the characteristics of unsolicited electronic mail.

Dataset Name Suggestions

  • Email Spam Detection Corpus
  • Spam and Ham Message Classification Data
  • ML Email Filtering Training Set
  • NLP Spam Messages

Attributes

Listing Stats

VIEWS

4

DOWNLOADS

0

LISTED

07/11/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format