Text Message Spam/Ham Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed to facilitate the training of machine learning models for classifying SMS messages as either spam or not spam, often referred to as 'ham'. It comprises a collection of real, English, and non-encoded SMS messages, each meticulously labelled to indicate its status as legitimate or unsolicited. This makes it particularly valuable for research into mobile phone spam, enabling the development of automated tools for identification and blocking, as well as providing a foundation for studying the characteristics of spam messages and devising strategies for avoidance.
Columns
- sms: This column contains the actual text content of the SMS message. (String)
- label: This column provides the classification for each SMS message, indicating whether it is 'ham' (legitimate) or 'spam' (unsolicited). (String)
- There are 5171 unique SMS message texts.
- Label counts: 4,827 messages are labelled as 'ham' and 747 messages are labelled as 'spam'.
Distribution
The dataset is typically provided in a CSV file format, such as
train.csv
. It contains 5574 individual SMS messages. The messages are structured with two key fields: the message text itself and its corresponding label (ham or spam).Usage
- Training machine learning models to effectively distinguish between legitimate and spam SMS messages.
- Developing tools capable of automatically identifying and blocking unwanted messages on mobile phones.
- Conducting academic or industry research into the evolving nature and characteristics of spam messages.
- Formulating strategies and preventative measures for users to identify and avoid unsolicited communications.
Coverage
This dataset covers SMS messages globally. The messages are in English, representing real and non-encoded content. While a specific time range for data collection isn't provided, it is a public set collected for mobile phone spam research.
License
CCO
Who Can Use It
- Data Scientists and Machine Learning Engineers: For developing and refining text classification models.
- Mobile Security Developers: To create or enhance spam filtering applications.
- Academic Researchers: For studies on unsolicited communication patterns and natural language processing.
- Analysts: To gain insights into the properties of spam messages.
Dataset Name Suggestions
- SMS Spam Collection
- SMS Message Classifier Data
- Mobile Spam Detection Dataset
- Text Message Spam/Ham Data
Attributes
Original Data Source: SMS Spam Collection (Text Classification)