Opendatabay APP

Malicious URL Classifier Dataset

NLP / Natural Language Processing

Tags and Keywords

Malicious

Urls

Phishing

Malware

Cybersecurity

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Malicious URL Classifier Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset addresses the critical cybersecurity threat posed by malicious URLs or websites. These URLs host unwanted content such as spam, phishing attempts, and drive-by downloads, which can ensnare unsuspecting users. Victims may suffer monetary losses, theft of private information, and malware installations, leading to billions of dollars in losses annually. The dataset has been curated to provide a substantial collection of examples of malicious URLs, facilitating the development of machine learning models capable of identifying and stopping these threats proactively before they infect computer systems or spread across the internet.

Columns

  • url: This column contains the actual URL string. There are 641,119 unique URL values out of 651,191 total URLs in the dataset, indicating a high proportion of distinct entries. All 651,000 entries are valid.
  • type: This column categorises the class of URL. There are four unique types of URLs represented.

Distribution

The dataset is provided in a CSV file format, specifically malicious_phish.csv, with a file size of 45.66 MB. It contains a total of 651,191 URLs. The distribution of URL types within the dataset is as follows:
  • Benign (safe) URLs: 428,103 entries, accounting for 66% of the dataset.
  • Defacement URLs: 96,457 entries, representing 15% of the dataset.
  • Phishing URLs: 94,111 entries.
  • Malware URLs: 32,520 entries. The remaining 19% of the dataset (126,631 entries) comprises the phishing and malware URLs. The dataset was assembled from five distinct sources, including the ISCX-URL-2016 dataset for various URL types, the Malware Domain Blacklist dataset for increasing phishing and malware URLs, a Faizan Git repository for additional benign URLs, and the Phishtank and PhishStorm datasets for more phishing URLs. URLs were initially collected into separate data frames before being merged, retaining only the URL and its classification.

Usage

This dataset is ideal for developing machine learning-based models aimed at identifying malicious URLs. It can be used to train algorithms that predict and classify URLs to prevent cyber-attacks, including spam, phishing scams, and malware distribution. The dataset supports research and development in cybersecurity for proactive threat detection and mitigation.

Coverage

The provided sources do not specify the geographic scope, exact time range, or demographic coverage of the URLs included in the dataset.

License

CC0: Public Domain

Who Can Use It

This dataset is suitable for:
  • Machine learning engineers and data scientists developing and training URL classification models.
  • Cybersecurity researchers studying malicious URL patterns and defensive strategies.
  • Academic institutions for educational purposes in computer science and internet security.
  • Security analysts looking to understand and mitigate online threats.

Dataset Name Suggestions

  • Malicious URL Classifier Dataset
  • Cyber Threat URL Collection
  • Phishing and Malware URL Data
  • Secure URL Detection Dataset
  • Web Security URL Identifier

Attributes

Listing Stats

VIEWS

2

DOWNLOADS

1

LISTED

14/07/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format