Opendatabay APP

Ham & Spam SMS Dataset

Telecommunications & Network Data

Tags and Keywords

Text

Email

Intermediate

Nlp

Binary

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Ham & Spam SMS Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a collection of SMS messages, precisely 5,574 entries, each meticulously tagged as either 'ham' (legitimate) or 'spam'. It was compiled primarily for the purpose of SMS spam research and is invaluable for developing predictive models that accurately classify text messages. The collection offers a robust foundation for projects involving natural language processing (NLP) and binary classification tasks within the telecommunications domain.

Columns

  • v1: This column contains the label for each SMS message, indicating whether it is 'ham' (legitimate) or 'spam'. It comprises two distinct classes.
  • v2: This column holds the raw text content of the SMS message. It is the core textual data for analysis.

Distribution

The dataset consists of 5,574 individual SMS messages. Each message is presented on a single line, structured with two distinct columns. The data files are typically in a text-based format, suitable for processing. The distribution of messages is approximately 87% 'ham' (legitimate) messages and 13% 'spam' messages. There are 5,171 unique text values within the dataset.

Usage

This dataset is ideally suited for:
  • Developing and training machine learning models for SMS spam detection.
  • Conducting research in Natural Language Processing (NLP), particularly for text categorisation.
  • Implementing binary classification algorithms to distinguish between legitimate and unsolicited messages.
  • Exploring text analytics and pattern recognition in short message services.

Coverage

The messages within this dataset originate from diverse sources, including a UK forum where users reported SMS spam, and a large collection of legitimate messages primarily from Singaporean university students. While the original collection points span specific regions, the dataset is globally relevant for research and application. A specific time range for the original data collection is not specified in the available information.

License

CC0

Who Can Use It

This dataset is beneficial for:
  • Data Scientists: To build and evaluate machine learning models for text classification and spam filtering.
  • Machine Learning Engineers: For developing and deploying automated spam detection systems in telecommunications.
  • Researchers: Engaged in natural language processing, data mining, and communication security studies.
  • Students: Working on academic projects that require text analysis and classification.

Dataset Name Suggestions

  • SMS Message Spam-Ham Classification
  • Text Message Spam Detection Dataset
  • Mobile SMS Content Classifier
  • Ham & Spam SMS Dataset
  • Short Message Service Categorisation Data

Attributes

Original Data Source: Ham & Spam Messages Dataset

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free