Binary Feature URL Classifier Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Consists of derived features specifically tailored for machine learning model development focused on detecting phishing and malicious Uniform Resource Locators (URLs). The collection was created by compiling features related to known phishing characteristics, including a novel feature that uses Optical Character Recognition (OCR) to assess suspicious keywords found within website images. The features are presented as binary values, typically represented by 1, -1, or 0, making the data highly suitable for classification tasks.
Columns
The dataset includes derived feature data from a collection of legitimate and fraudulent URLs, primarily sourced from phishtank.com. The key attributes included are:
- having_ip_address: Identifies if the URL uses an IP address instead of a domain name.
- length_of_url: Indicates suspicious activity if the URL exceeds a typical length threshold.
- shortening_services: Flags the use of third-party URL shortening services, which is often a suspicious indicator.
- having_at_symbol: Notes the presence of the '@' symbol, a characteristic often used in deceptive URLs.
- double-slash_redirection: Detects the presence of double slashes (//) within the URL path.
- prefix and suffix: Scans for suspicious structural components like prefixes and suffixes.
- sub_domains: Assesses if the URL contains an excessive number of subdomains.
- ssl_state: Analyses suspicious behaviours related to the Secure Sockets Layer (SSL) certificate status.
- domain_registered: Examines registration details for suspicious elements.
- favicons: Evaluates the URL for any suspicious use of favicons.
Distribution
The expected data file format is CSV, with a sample file available on the platform. The dataset currently contains approximately 14.1 thousand valid records. The structure is tabular, with each row representing a processed URL and columns holding the associated binary derived feature values.
Usage
This data product is ideal for training and evaluating supervised machine learning algorithms, such as Random Forest, aimed at binary classification. It is primarily used for creating robust systems for URL analysis and developing effective solutions for cyber security, specifically focused on identifying and mitigating phishing threats.
Coverage
The underlying URL data was primarily taken from phishtank.com, which provides a large volume of URL contents from diverse sources. The scope covers the feature characteristics derived from these URLs, designed to capture general patterns of phishing behaviour.
License
CC0: Public Domain
Who Can Use It
The dataset is intended for developers, researchers, and students interested in cyber security, machine learning, and network protection. It is suitable for users across beginner, intermediate, and advanced skill levels looking to apply algorithms for classification problems.
Dataset Name Suggestions
- Phishing URL Feature Set for ML
- Binary Feature URL Classifier Data
- Cyber Security Phishing Detection Features
- Suspicious URL Characteristics Dataset
Attributes
Original Data Source: Binary Feature URL Classifier Data
Loading...
