Opendatabay APP

Email Threat Detection Dataset

Data Science and Analytics

Tags and Keywords

Phishing

Email

Detection

Security

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Email Threat Detection Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset was assembled for the purpose of studying phishing email tactics. It brings together emails from various origins to create a valuable resource for analysis. The dataset allows researchers to examine the content of phishing emails and the context in which they are sent, with the aim of improving detection methods.

Columns

  • sender: This column contains the email address of the sender.
  • receiver: This column holds the email address of the recipient.
  • date: This column provides the date and time the email was sent.
  • subject: This column includes the subject line of the email.
  • body: This column contains the main text content of the email.
  • label: This column indicates the classification of the email, typically as 'spam' (phishing) or 'legitimate'.
  • urls: This column lists any URLs found within the email content.

Distribution

The dataset is typically provided in a CSV format, such as CEAS_08.csv, which has a file size of approximately 67.9 MB. It is a combined resource, drawing information from initial datasets like Enron, Ling, CEAS, Nazario, Nigerian Fraud, and SpamAssassin. The final dataset contains approximately 82,500 emails, comprising 42,891 spam emails and 39,595 legitimate emails. Each email record includes core content like subject lines, email body text, and classification labels, as well as broader context such as sender, recipient, and date information.

Usage

This dataset is ideal for applications aimed at understanding and combating phishing. It can be used to:
  • Study evolving phishing email tactics.
  • Develop and enhance automated phishing email detection methods.
  • Conduct analysis on email content and the contextual factors surrounding email transmission.
  • Train models for natural language processing (NLP), deep learning, and artificial intelligence, particularly for binary classification tasks.

Coverage

The dataset's content spans a broad time range, with email dates observed from 1980-01-04 to 2100-05-28, though the most frequent dates appear between 2004 and 2010. There is no specific geographic or demographic scope mentioned, as the dataset is compiled from a variety of sources, including general email collections and specific fraud datasets.

License

CC BY-SA 4.0

Who Can Use It

This dataset is primarily intended for researchers and data scientists who are interested in:
  • Cybersecurity: Analysing and defending against email-based threats.
  • Machine Learning Engineers: Developing and training machine learning models for email classification.
  • Natural Language Processing Specialists: Extracting insights and patterns from email text.
  • Academics: Conducting studies on phishing tactics and detection methodologies.

Dataset Name Suggestions

  • Phishing Email Dataset
  • Phish No More Email Collection
  • Email Threat Detection Dataset
  • Multi-Source Phishing Corpus

Attributes

Original Data Source: Email Threat Detection Dataset

Listing Stats

VIEWS

1

DOWNLOADS

1

LISTED

29/07/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format