Opendatabay APP

Synthetic NLP Email Classification Dataset

Fraud Detection & Risk Management

Tags and Keywords

Synthetic

Email

Messages

Labeled

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Synthetic NLP Email Classification Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides 1,000 synthetic email messages, each clearly labelled as either 'spam' or 'ham' (non-spam). Its primary purpose is to assist users in developing and evaluating text classification models, leveraging fundamental natural language processing (NLP) techniques. It offers a practical resource for those looking to build robust email classification systems without the need for real personal email data.

Columns

  • email_text: This column contains the full text content of the synthetic email message.
  • label: This column indicates the classification of the email, with possible values being 'spam' for unwanted emails and 'ham' for legitimate emails.

Distribution

The dataset is typically provided in a CSV file format. It comprises 1,000 individual records, with each row representing a unique email message and its corresponding label. Specific file size details are not provided, but the structure is straightforward for easy processing.

Usage

This dataset is ideally suited for a variety of machine learning and NLP applications, including:
  • Training a spam filter using algorithms such as Naive Bayes, Support Vector Machines (SVM), or Logistic Regression.
  • Practising essential text preprocessing techniques, including text cleaning, tokenisation, and TF-IDF (Term Frequency-Inverse Document Frequency) vectorisation.
  • Developing and testing email classification models in a controlled environment.

Coverage

The dataset's coverage is global, making it suitable for applications without specific regional limitations. While it consists of synthetic data, it aims to represent a general scope of email content relevant to spam detection. The data quality is rated highly, ensuring its suitability for model development. The dataset was listed on 5th June 2025.

License

CCO-BY-SA

Who Can Use It

This dataset is intended for a range of users interested in data science, machine learning, and natural language processing. Ideal users include:
  • Data scientists and machine learning engineers developing spam detection solutions.
  • Students and researchers learning about text classification and NLP fundamentals.
  • Developers building applications that require email filtering capabilities.

Dataset Name Suggestions

  • Spam-Ham Email Classifier Dataset
  • Synthetic Email Spam Detection Dataset
  • NLP Email Classification Data
  • Mail Spam Filter Training Set

Attributes

Original Data Source: Spam Mail Classifier Dataset

Listing Stats

VIEWS

2

DOWNLOADS

2

LISTED

05/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format