Malicious URL Classification Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This resource offers crucial insights for detecting and analyzing phishing domains embedded within URLs. It encompasses a wide array of features extracted from URLs, significantly bolstering the capacity to uncover potential phishing attempts. The data provides a detailed view, shedding light on attributes frequently associated with malicious activities. It is derived from a consolidation of other datasets, with added features incorporated for completeness.
Columns
The data contains 13 distinct features related to URL structure and content. Key columns include:
- Phising (Label): Indicates whether a URL is classified as phishing (1) or not (0).
- NumDots: The count of dot symbols (.) found in the URL.
- UrlLength: The total length of the URL string.
- AtSymbol: Registers the presence of the "@" symbol in the URL.
- NumDash: The count of dash marks (-) present in the URL.
- NumPercent: The count of percent marks (%) found in the URL.
- NumQueryComponents: The count of question marks (?) in the URL, used to determine the number of query sections.
- IpAddress: A binary indicator (1/0) noting if the URL uses a direct IP address.
- HttpsInHostname: Notes the presence of 'https' within the hostname portion of the URL.
- PathLevel: Defines the depth of the directory hierarchy in the path of a URL.
- PathLength: Represents the total number of segments in the URL path.
- NumNumericChars: The count of numeric characters (0-9) within the URL.
Distribution
The data is provided in a file named Phising_dataset_predict.csv, with a size of 23.11 MB. It consists of 13 columns and contains approximately 663,000 records (rows). Note that 5% of the records in the primary "Phising" label column are missing, resulting in 630,000 valid entries for that feature. The dataset is static and has an expected update frequency of 'Never'.
Usage
This data is ideally suited for developing and evaluating models designed for cyber security. Specific use cases include training machine learning algorithms (such as classification models) to detect malicious URLs, enabling automated phishing prevention systems, and facilitating research into the structural indicators of web attacks.
Coverage
The scope is focused on features derived from the structure and composition of URLs. It does not contain geographic, time range, or demographic information, concentrating solely on attributes relevant to identifying phishing attempts based on URL metrics like length, character counts, and symbol presence.
License
CC0: Public Domain
Who Can Use It
Intended users include Cyber security researchers who need labelled data for attack pattern analysis; Machine learning engineers developing fraud detection or filtering software; Data scientists building URL reputation scoring systems; and Students engaged in academic projects concerning network security.
Dataset Name Suggestions
- Phishing URL Detection Features
- Web Attack Feature Set
- Malicious URL Classification Data
- URL Phishing Indicator Metrics
Attributes
Original Data Source: Malicious URL Classification Data
Loading...
