Email Spam Message Classification Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset consists of raw mail messages, designed for tasks such as spam classification and natural language processing (NLP) pre-processing. It provides a foundation for developing models to identify unwanted emails. The messages include plain text, headers, and some HTML tags, making it suitable for a range of NLP techniques. It serves as a foundational resource for those new to the NLP domain, assisting with steps like tokenisation, stop word removal, stemming, and HTML tag parsing, and is compatible with various NLP libraries for vectorisation and analysis.
Columns
- Category: Specifies whether a mail message is spam or not. This is a binary classification, where '1' indicates spam and '0' indicates not spam.
- Message: Contains the raw text content of the mail messages. These messages can be combinations of plain text, include headers, and may also contain HTML tags.
- File_Name: Provides a unique identifier for each message within the dataset.
Distribution
The dataset is typically provided in a CSV file format. It comprises 5,796 records. Specifically, there are 3,900 messages classified as 'not spam' (Category 0) and 1,896 messages classified as 'spam' (Category 1). The dataset contains 5,796 unique file names and 5,625 unique message entries, indicating some messages may be duplicates.
Usage
This dataset is ideal for a variety of applications, including:
- Spam Detection and Classification: Training machine learning models to identify and filter spam emails.
- Natural Language Processing (NLP) Pre-processing: Practising fundamental NLP steps such as tokenisation, removing stop words, stemming, and parsing HTML tags.
- Text Classification: Building and evaluating text classification algorithms.
- Library Compatibility: Working with NLP libraries for tasks like vectorisation and feature extraction.
- Educational Purposes: A valuable resource for individuals entering the field of NLP and data science.
Coverage
The dataset's region of coverage is global. Specific time ranges or demographic scopes for the raw mail messages are not detailed in the available information.
License
CCO
Who Can Use It
This dataset is particularly useful for:
- NLP Beginners: Individuals learning the fundamentals of natural language processing.
- Data Scientists: For developing and testing machine learning models for text classification and spam detection.
- Analytics Professionals: For understanding and applying text analytics techniques.
- Researchers: Studying patterns in email communication and developing advanced spam filtering algorithms.
Dataset Name Suggestions
- Spam Classification for Basic NLP
- Spam Detection Dataset: Message Classification
- Email Spam Message Classification
- Raw Mail NLP Dataset
- Text Spam Detector Data
Attributes
Original Data Source: Spam Classification for Basic NLP