Malicious URL Classifier Dataset
NLP / Natural Language Processing
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset addresses the critical cybersecurity threat posed by malicious URLs or websites. These URLs host unwanted content such as spam, phishing attempts, and drive-by downloads, which can ensnare unsuspecting users. Victims may suffer monetary losses, theft of private information, and malware installations, leading to billions of dollars in losses annually. The dataset has been curated to provide a substantial collection of examples of malicious URLs, facilitating the development of machine learning models capable of identifying and stopping these threats proactively before they infect computer systems or spread across the internet.
Columns
- url: This column contains the actual URL string. There are 641,119 unique URL values out of 651,191 total URLs in the dataset, indicating a high proportion of distinct entries. All 651,000 entries are valid.
- type: This column categorises the class of URL. There are four unique types of URLs represented.
Distribution
The dataset is provided in a CSV file format, specifically
malicious_phish.csv
, with a file size of 45.66 MB. It contains a total of 651,191 URLs. The distribution of URL types within the dataset is as follows:- Benign (safe) URLs: 428,103 entries, accounting for 66% of the dataset.
- Defacement URLs: 96,457 entries, representing 15% of the dataset.
- Phishing URLs: 94,111 entries.
- Malware URLs: 32,520 entries. The remaining 19% of the dataset (126,631 entries) comprises the phishing and malware URLs. The dataset was assembled from five distinct sources, including the ISCX-URL-2016 dataset for various URL types, the Malware Domain Blacklist dataset for increasing phishing and malware URLs, a Faizan Git repository for additional benign URLs, and the Phishtank and PhishStorm datasets for more phishing URLs. URLs were initially collected into separate data frames before being merged, retaining only the URL and its classification.
Usage
This dataset is ideal for developing machine learning-based models aimed at identifying malicious URLs. It can be used to train algorithms that predict and classify URLs to prevent cyber-attacks, including spam, phishing scams, and malware distribution. The dataset supports research and development in cybersecurity for proactive threat detection and mitigation.
Coverage
The provided sources do not specify the geographic scope, exact time range, or demographic coverage of the URLs included in the dataset.
License
CC0: Public Domain
Who Can Use It
This dataset is suitable for:
- Machine learning engineers and data scientists developing and training URL classification models.
- Cybersecurity researchers studying malicious URL patterns and defensive strategies.
- Academic institutions for educational purposes in computer science and internet security.
- Security analysts looking to understand and mitigate online threats.
Dataset Name Suggestions
- Malicious URL Classifier Dataset
- Cyber Threat URL Collection
- Phishing and Malware URL Data
- Secure URL Detection Dataset
- Web Security URL Identifier
Attributes
Original Data Source: Malicious URL Classifier Dataset