Dark Mode

Home

Data Categories

General & Miscellaneous Data

Synthetic NLP Email Classification Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Synthetic NLP Email Classification Dataset

Name: Synthetic NLP Email Classification Dataset
Creator: FREE DATASET LIBRARY
Published: 1749120652863
Keywords: Synthetic,Email,Messages,Labeled

Fraud Detection & Risk Management

Tags and Keywords

Synthetic

Email

Messages

Labeled

Trusted By

Synthetic NLP Email Classification Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides 1,000 synthetic email messages, each clearly labelled as either 'spam' or 'ham' (non-spam). Its primary purpose is to assist users in developing and evaluating text classification models, leveraging fundamental natural language processing (NLP) techniques. It offers a practical resource for those looking to build robust email classification systems without the need for real personal email data.

Columns

email_text: This column contains the full text content of the synthetic email message.
label: This column indicates the classification of the email, with possible values being 'spam' for unwanted emails and 'ham' for legitimate emails.

Distribution

The dataset is typically provided in a CSV file format. It comprises 1,000 individual records, with each row representing a unique email message and its corresponding label. Specific file size details are not provided, but the structure is straightforward for easy processing.

Usage

This dataset is ideally suited for a variety of machine learning and NLP applications, including:

Training a spam filter using algorithms such as Naive Bayes, Support Vector Machines (SVM), or Logistic Regression.
Practising essential text preprocessing techniques, including text cleaning, tokenisation, and TF-IDF (Term Frequency-Inverse Document Frequency) vectorisation.
Developing and testing email classification models in a controlled environment.

Coverage

The dataset's coverage is global, making it suitable for applications without specific regional limitations. While it consists of synthetic data, it aims to represent a general scope of email content relevant to spam detection. The data quality is rated highly, ensuring its suitability for model development. The dataset was listed on 5th June 2025.

License

CCO-BY-SA

Who Can Use It

This dataset is intended for a range of users interested in data science, machine learning, and natural language processing. Ideal users include:

Data scientists and machine learning engineers developing spam detection solutions.
Students and researchers learning about text classification and NLP fundamentals.
Developers building applications that require email filtering capabilities.

Dataset Name Suggestions

Spam-Ham Email Classifier Dataset
Synthetic Email Spam Detection Dataset
NLP Email Classification Data
Mail Spam Filter Training Set

Attributes

Original Data Source: Spam Mail Classifier Dataset

Listing Stats

VIEWS

DOWNLOADS

LISTED

05/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

FREE DATASET LIBRARY

£0

Synthetic NLP Email Classification Dataset

Fraud Detection & Risk Management

Tags and Keywords

Synthetic

Email

Messages

Labeled

Trusted By

Free

About

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Listing Stats

Free

Download Dataset in CSV Format

RECOMMENDED DATASETS