Ham & Spam SMS Dataset
Telecommunications & Network Data
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a collection of SMS messages, precisely 5,574 entries, each meticulously tagged as either 'ham' (legitimate) or 'spam'. It was compiled primarily for the purpose of SMS spam research and is invaluable for developing predictive models that accurately classify text messages. The collection offers a robust foundation for projects involving natural language processing (NLP) and binary classification tasks within the telecommunications domain.
Columns
- v1: This column contains the label for each SMS message, indicating whether it is 'ham' (legitimate) or 'spam'. It comprises two distinct classes.
- v2: This column holds the raw text content of the SMS message. It is the core textual data for analysis.
Distribution
The dataset consists of 5,574 individual SMS messages. Each message is presented on a single line, structured with two distinct columns. The data files are typically in a text-based format, suitable for processing. The distribution of messages is approximately 87% 'ham' (legitimate) messages and 13% 'spam' messages. There are 5,171 unique text values within the dataset.
Usage
This dataset is ideally suited for:
- Developing and training machine learning models for SMS spam detection.
- Conducting research in Natural Language Processing (NLP), particularly for text categorisation.
- Implementing binary classification algorithms to distinguish between legitimate and unsolicited messages.
- Exploring text analytics and pattern recognition in short message services.
Coverage
The messages within this dataset originate from diverse sources, including a UK forum where users reported SMS spam, and a large collection of legitimate messages primarily from Singaporean university students. While the original collection points span specific regions, the dataset is globally relevant for research and application. A specific time range for the original data collection is not specified in the available information.
License
CC0
Who Can Use It
This dataset is beneficial for:
- Data Scientists: To build and evaluate machine learning models for text classification and spam filtering.
- Machine Learning Engineers: For developing and deploying automated spam detection systems in telecommunications.
- Researchers: Engaged in natural language processing, data mining, and communication security studies.
- Students: Working on academic projects that require text analysis and classification.
Dataset Name Suggestions
- SMS Message Spam-Ham Classification
- Text Message Spam Detection Dataset
- Mobile SMS Content Classifier
- Ham & Spam SMS Dataset
- Short Message Service Categorisation Data
Attributes
Original Data Source: Ham & Spam Messages Dataset