Spam and Ham Message Classification Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
A valuable collection of electronic mail messages specifically designed for identifying and classifying unsolicited messages. The resource is comprised of full email texts that are pre-labelled as either spam or non-spam (often referred to as 'ham'). It is instrumental for researchers and data scientists focused on developing and testing robust spam detection algorithms, particularly those involved in natural language processing (NLP), machine learning, and refining email filtering systems. The diversity of the content covers a wide range of topics and communication styles, reflecting genuine, real-world examples.
Columns
- text: The full content of the individual email message. This column contains 5,695 unique values.
- spam: A binary label indicating the classification of the email. A value of 1 signifies a spam message, while 0 indicates a non-spam (ham) message.
Distribution
The dataset is in a Tab-Delimited Format, facilitating easy import and processing within various data analysis tools.
- Size and Structure: The dataset contains 5,728 total observations (records), making it suitable for robust analysis. The file is named
emails.csvand has a size of 8.95 MB. It consists of 2 columns. - Data Integrity: Data integrity is high, with 5,728 valid records reported and zero missing or mismatched values in either column.
- Label Distribution: The data shows a mean 'spam' value of 0.24. The distribution includes 4,360 non-spam labels (0) and 1,368 spam labels (1).
- Update Frequency: The expected update frequency is "Never."
Usage
This data product is ideally applied in the following areas:
- Developing and testing machine learning models specifically for binary classification tasks.
- Training and evaluating effective spam detection algorithms.
- Refining natural language processing techniques tailored to message content analysis.
- Building effective email filtering systems and anti-spam technologies.
Coverage
The emails included cover a wide range of topics and styles, ensuring the content reflects realistic real-world examples of both spam and non-spam correspondence. The sources do not specify explicit geographic, time range, or demographic scope for the message contents.
License
CC0: Public Domain
Who Can Use It
This data is highly suitable for professionals and academics engaged in technology and analysis:
- Data Scientists: For training and testing classification and prediction models.
- Machine Learning Engineers: To develop and optimise complex email filtering algorithms.
- Researchers: Studying natural language processing, text analytics, and the characteristics of unsolicited electronic mail.
Dataset Name Suggestions
- Email Spam Detection Corpus
- Spam and Ham Message Classification Data
- ML Email Filtering Training Set
- NLP Spam Messages
Attributes
Original Data Source:Spam and Ham Message Classification Data
Loading...
