Email Threat Detection Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset was assembled for the purpose of studying phishing email tactics. It brings together emails from various origins to create a valuable resource for analysis. The dataset allows researchers to examine the content of phishing emails and the context in which they are sent, with the aim of improving detection methods.
Columns
- sender: This column contains the email address of the sender.
- receiver: This column holds the email address of the recipient.
- date: This column provides the date and time the email was sent.
- subject: This column includes the subject line of the email.
- body: This column contains the main text content of the email.
- label: This column indicates the classification of the email, typically as 'spam' (phishing) or 'legitimate'.
- urls: This column lists any URLs found within the email content.
Distribution
The dataset is typically provided in a CSV format, such as
CEAS_08.csv
, which has a file size of approximately 67.9 MB. It is a combined resource, drawing information from initial datasets like Enron, Ling, CEAS, Nazario, Nigerian Fraud, and SpamAssassin. The final dataset contains approximately 82,500 emails, comprising 42,891 spam emails and 39,595 legitimate emails. Each email record includes core content like subject lines, email body text, and classification labels, as well as broader context such as sender, recipient, and date information.Usage
This dataset is ideal for applications aimed at understanding and combating phishing. It can be used to:
- Study evolving phishing email tactics.
- Develop and enhance automated phishing email detection methods.
- Conduct analysis on email content and the contextual factors surrounding email transmission.
- Train models for natural language processing (NLP), deep learning, and artificial intelligence, particularly for binary classification tasks.
Coverage
The dataset's content spans a broad time range, with email dates observed from 1980-01-04 to 2100-05-28, though the most frequent dates appear between 2004 and 2010. There is no specific geographic or demographic scope mentioned, as the dataset is compiled from a variety of sources, including general email collections and specific fraud datasets.
License
CC BY-SA 4.0
Who Can Use It
This dataset is primarily intended for researchers and data scientists who are interested in:
- Cybersecurity: Analysing and defending against email-based threats.
- Machine Learning Engineers: Developing and training machine learning models for email classification.
- Natural Language Processing Specialists: Extracting insights and patterns from email text.
- Academics: Conducting studies on phishing tactics and detection methodologies.
Dataset Name Suggestions
- Phishing Email Dataset
- Phish No More Email Collection
- Email Threat Detection Dataset
- Multi-Source Phishing Corpus
Attributes
Original Data Source: Email Threat Detection Dataset