Newsletter Spam URL Classification
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Contains approximately 87,500 URLs, classified as either spam or not spam, making it ideal for developing binary classification models. About one-third of the URLs in this collection are designated as spam. The data originates from links found in over 100 newsletters, which are parsed every half-hour. A link is programmatically flagged as spam if it appears three or more times within a single newsletter or includes a likely subscribe/unsubscribe URL. This dataset was created by 'The Pudding'.
Columns
- url: The specific URL string.
- is_spam: A boolean value indicating whether the URL is classified as spam (
true
) or not (false
).
Distribution
The dataset is provided as a single CSV file (
url_spam_classification.csv
) with a size of 11.58 MB. It contains two columns and approximately 148,000 records.Usage
Ideal applications for this dataset include training and evaluating machine learning models for spam detection, content filtering systems, and cybersecurity research. It can be used to build a binary classification model to automatically identify and flag malicious or unwanted URLs.
Coverage
The dataset consists of URLs collected from a wide variety of internet newsletters without specific geographical or demographic limitations. The data represents a snapshot of links appearing in these newsletters over a period of time, and it is not expected to be updated.
License
CC0: Public Domain
Who Can Use It
- Data Scientists and Machine Learning Engineers: Can use this dataset to build, train, and validate spam URL classification models.
- Cybersecurity Analysts: Can leverage this data for research into malicious link patterns and to enhance security protocols.
- Software Developers: Can integrate models trained on this data into applications to filter spam content and protect users.
Dataset Name Suggestions
- Newsletter Spam URL Classification
- Spam vs. Ham URL Links
- URL Spam Detection Dataset
- Binary Classification of Web Links
- Spam URL Collection
Attributes
Original Data Source: Newsletter Spam URL Classification