Synthetic NLP Email Classification Dataset
Fraud Detection & Risk Management
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides 1,000 synthetic email messages, each clearly labelled as either 'spam' or 'ham' (non-spam). Its primary purpose is to assist users in developing and evaluating text classification models, leveraging fundamental natural language processing (NLP) techniques. It offers a practical resource for those looking to build robust email classification systems without the need for real personal email data.
Columns
- email_text: This column contains the full text content of the synthetic email message.
- label: This column indicates the classification of the email, with possible values being 'spam' for unwanted emails and 'ham' for legitimate emails.
Distribution
The dataset is typically provided in a CSV file format. It comprises 1,000 individual records, with each row representing a unique email message and its corresponding label. Specific file size details are not provided, but the structure is straightforward for easy processing.
Usage
This dataset is ideally suited for a variety of machine learning and NLP applications, including:
- Training a spam filter using algorithms such as Naive Bayes, Support Vector Machines (SVM), or Logistic Regression.
- Practising essential text preprocessing techniques, including text cleaning, tokenisation, and TF-IDF (Term Frequency-Inverse Document Frequency) vectorisation.
- Developing and testing email classification models in a controlled environment.
Coverage
The dataset's coverage is global, making it suitable for applications without specific regional limitations. While it consists of synthetic data, it aims to represent a general scope of email content relevant to spam detection. The data quality is rated highly, ensuring its suitability for model development. The dataset was listed on 5th June 2025.
License
CCO-BY-SA
Who Can Use It
This dataset is intended for a range of users interested in data science, machine learning, and natural language processing. Ideal users include:
- Data scientists and machine learning engineers developing spam detection solutions.
- Students and researchers learning about text classification and NLP fundamentals.
- Developers building applications that require email filtering capabilities.
Dataset Name Suggestions
- Spam-Ham Email Classifier Dataset
- Synthetic Email Spam Detection Dataset
- NLP Email Classification Data
- Mail Spam Filter Training Set
Attributes
Original Data Source: Spam Mail Classifier Dataset