Global Phishing URL Detection Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
A substantial collection comprising 134,850 legitimate and 100,945 phishing URLs, developed to facilitate the training of machine learning frameworks for cybersecurity. Sourced from the latest web data available during the 2024 study, this repository includes features extracted directly from the source code of webpages and URLs. It provides raw data alongside derived metrics such as character probability and title match scores, serving as the foundation for the PhiUSIIL phishing detection framework published in Computers & Security.
Columns
- URL: The specific web address string.
- URLLength: The character length of the URL.
- Domain: The domain name associated with the URL.
- DomainLength: The character length of the domain.
- IsDomainIP: Binary indicator classifying if the domain is an IP address.
- TLD: Top-Level Domain (e.g., com, org).
- URLSimilarityIndex: A derived score indicating the similarity index of the URL.
- CharContinuationRate: A feature derived from the source code regarding character usage.
- TLDLegitimateProb: The probability score of the TLD being legitimate.
- URLTitleMatchScore: A score derived from matching the URL to the page title.
- URLCharProb: Character probability metrics derived from the URL.
- Label: Classification tag where 1 corresponds to a legitimate URL and 0 to a phishing URL.
- FILENAME: A system column which can be ignored during analysis.
Distribution
- Format: CSV
- Size: 56.85 MB
- Structure: 56 columns
- Records: 235,795 unique values (134,850 legitimate and 100,945 phishing)
Usage
- Training incremental learning models for phishing detection.
- Analysing URL structures to identify malicious patterns in social networks and email.
- Cybersecurity education and simulation of attack vectors.
- Evaluating feature extraction techniques from web source code.
Coverage
- Geographic/Scope: Global internet URLs.
- Demographic: Covers various sectors including Social Networks, Email, Messaging, Mobile, and Wireless.
- Time Range: Contains the latest URLs analysed during the construction of the 2024 PhiUSIIL framework.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
- Cybersecurity Researchers
- Machine Learning Engineers
- Data Scientists specialising in fraud detection
- Network Security Analysts
Dataset Name Suggestions
- PhiUSIIL Phishing and Legitimate URL Repository
- Malicious Webpage Feature Collection
- Global Phishing URL Detection Dataset
Attributes
Original Data Source: Global Phishing URL Detection Dataset
Loading...
