Classified Email and SMS Spam Repository
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Distinguishing between legitimate messages and unsolicited, fraudulent communications is a foundational task in digital security. This collection of classified email and SMS text provides the raw material needed to develop automated filters that can identify unwanted content with high accuracy. By providing a binary classification of "spam" and "ham," these records enable the exploration of linguistic patterns that separate personal or professional correspondence from commercial promotions and scams. The inclusion of real-world subject lines and message bodies makes this a vital tool for understanding the nuances of modern internet-based messaging.
Columns
- text: The raw content of the message, which may include subject lines and the main body of the email or SMS. This field contains a diverse array of words, phrases, and symbols typically found in real-world electronic correspondence.
- spam: A numerical classification label where a value of 1 signifies that the message is unsolicited spam, while a value of 0 indicates the message is legitimate, non-spam "ham."
Distribution
The information is delivered in a single CSV file titled
emails.csv with a size of 8.95 MB. It consists of 5,728 valid records structured across 2 distinct columns. The data maintains a 100% validity rate with no missing or mismatched entries reported. The repository features 5,695 unique text values, and updates are expected to occur on an annual basis.Usage
This resource is ideal for training supervised machine learning models, specifically for binary text classification tasks. It is well-suited for practicing natural language processing techniques such as tokenisation, vectorisation, and the implementation of Naive Bayes algorithms. Additionally, developers can use the records to benchmark the performance of spam filters and to study the common characteristics of fraudulent or promotional internet traffic.
Coverage
The scope encompasses a wide variety of electronic text messages found across the internet, reflecting real-world communication styles and diverse linguistic structures. While specific geographic demographics are not explicitly detailed, the content focuses on English-language messaging typical of global internet users. The data provides a snapshot of various message types, from professional scheduling requests to commercial solicitations.
License
CC0: Public Domain
Who Can Use It
Data science beginners can leverage these records to learn the fundamentals of classification and supervised learning. Machine learning engineers may utilise the text samples to develop and refine more sophisticated NLP models for messaging platforms. Furthermore, cybersecurity researchers can use these patterns to better understand the evolving nature of unsolicited messaging and the tactics used in fraudulent digital communications.
Dataset Name Suggestions
- Classified Email and SMS Spam Repository
- Supervised Text Classification: Spam vs Ham
- Electronic Messaging Security Archive
- Internet Spam Detection Training Set
- Binary Email Labelling and Text Metrics
Attributes
Original Data Source: Classified Email and SMS Spam Repository
Loading...
Free
Download Dataset in CSV Format
Recommended Datasets
Loading recommendations...
