Opendatabay APP

Binary Feature URL Classifier Data

Data Science and Analytics

Tags and Keywords

Phishing

Url

Detection

Security

Binary

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Binary Feature URL Classifier Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Consists of derived features specifically tailored for machine learning model development focused on detecting phishing and malicious Uniform Resource Locators (URLs). The collection was created by compiling features related to known phishing characteristics, including a novel feature that uses Optical Character Recognition (OCR) to assess suspicious keywords found within website images. The features are presented as binary values, typically represented by 1, -1, or 0, making the data highly suitable for classification tasks.

Columns

The dataset includes derived feature data from a collection of legitimate and fraudulent URLs, primarily sourced from phishtank.com. The key attributes included are:
  • having_ip_address: Identifies if the URL uses an IP address instead of a domain name.
  • length_of_url: Indicates suspicious activity if the URL exceeds a typical length threshold.
  • shortening_services: Flags the use of third-party URL shortening services, which is often a suspicious indicator.
  • having_at_symbol: Notes the presence of the '@' symbol, a characteristic often used in deceptive URLs.
  • double-slash_redirection: Detects the presence of double slashes (//) within the URL path.
  • prefix and suffix: Scans for suspicious structural components like prefixes and suffixes.
  • sub_domains: Assesses if the URL contains an excessive number of subdomains.
  • ssl_state: Analyses suspicious behaviours related to the Secure Sockets Layer (SSL) certificate status.
  • domain_registered: Examines registration details for suspicious elements.
  • favicons: Evaluates the URL for any suspicious use of favicons.

Distribution

The expected data file format is CSV, with a sample file available on the platform. The dataset currently contains approximately 14.1 thousand valid records. The structure is tabular, with each row representing a processed URL and columns holding the associated binary derived feature values.

Usage

This data product is ideal for training and evaluating supervised machine learning algorithms, such as Random Forest, aimed at binary classification. It is primarily used for creating robust systems for URL analysis and developing effective solutions for cyber security, specifically focused on identifying and mitigating phishing threats.

Coverage

The underlying URL data was primarily taken from phishtank.com, which provides a large volume of URL contents from diverse sources. The scope covers the feature characteristics derived from these URLs, designed to capture general patterns of phishing behaviour.

License

CC0: Public Domain

Who Can Use It

The dataset is intended for developers, researchers, and students interested in cyber security, machine learning, and network protection. It is suitable for users across beginner, intermediate, and advanced skill levels looking to apply algorithms for classification problems.

Dataset Name Suggestions

  • Phishing URL Feature Set for ML
  • Binary Feature URL Classifier Data
  • Cyber Security Phishing Detection Features
  • Suspicious URL Characteristics Dataset

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

12/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format