Dark Mode

Home

Data Categories

AI & ML Data

Email Spam Message Classification Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Email Spam Message Classification Dataset

Data Science and Analytics

Tags and Keywords

Classification

Nlp

Binary

Naive

Trusted By

Email Spam Message Classification Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset consists of raw mail messages, designed for tasks such as spam classification and natural language processing (NLP) pre-processing. It provides a foundation for developing models to identify unwanted emails. The messages include plain text, headers, and some HTML tags, making it suitable for a range of NLP techniques. It serves as a foundational resource for those new to the NLP domain, assisting with steps like tokenisation, stop word removal, stemming, and HTML tag parsing, and is compatible with various NLP libraries for vectorisation and analysis.

Columns

Category: Specifies whether a mail message is spam or not. This is a binary classification, where '1' indicates spam and '0' indicates not spam.
Message: Contains the raw text content of the mail messages. These messages can be combinations of plain text, include headers, and may also contain HTML tags.
File_Name: Provides a unique identifier for each message within the dataset.

Distribution

The dataset is typically provided in a CSV file format. It comprises 5,796 records. Specifically, there are 3,900 messages classified as 'not spam' (Category 0) and 1,896 messages classified as 'spam' (Category 1). The dataset contains 5,796 unique file names and 5,625 unique message entries, indicating some messages may be duplicates.

Usage

This dataset is ideal for a variety of applications, including:

Spam Detection and Classification: Training machine learning models to identify and filter spam emails.
Natural Language Processing (NLP) Pre-processing: Practising fundamental NLP steps such as tokenisation, removing stop words, stemming, and parsing HTML tags.
Text Classification: Building and evaluating text classification algorithms.
Library Compatibility: Working with NLP libraries for tasks like vectorisation and feature extraction.
Educational Purposes: A valuable resource for individuals entering the field of NLP and data science.

Coverage

The dataset's region of coverage is global. Specific time ranges or demographic scopes for the raw mail messages are not detailed in the available information.

License

CCO

Who Can Use It

This dataset is particularly useful for:

NLP Beginners: Individuals learning the fundamentals of natural language processing.
Data Scientists: For developing and testing machine learning models for text classification and spam detection.
Analytics Professionals: For understanding and applying text analytics techniques.
Researchers: Studying patterns in email communication and developing advanced spam filtering algorithms.

Dataset Name Suggestions

Spam Classification for Basic NLP
Spam Detection Dataset: Message Classification
Email Spam Message Classification
Raw Mail NLP Dataset
Text Spam Detector Data

Attributes

Original Data Source: Spam Classification for Basic NLP

Listing Stats

VIEWS

DOWNLOADS

LISTED

11/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...