Opendatabay APP

Website Categorisation Dataset

Website Analytics & User Experience

Tags and Keywords

Beginner

Data

Classification

Nlp

Multiclass

Category

Website

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Website Categorisation Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a collection of website URLs and their corresponding cleaned text content, which have been categorised into various topics. It is designed to facilitate website classification tasks, offering valuable insights for web analytics and user experience analysis. The data was created by extracting and cleaning text from different websites, then assigning categories based on this content.

Columns

  • index: An identifier for each row in the dataset.
  • website_url: The URL link of the website.
  • cleaned_website_text: The cleaned text content extracted from the website URL.
  • Category: The assigned category of the URL.

Distribution

The dataset comprises 1408 rows of data. It is typically available in a CSV file format. The categories present in the dataset include 'Education' (8%), 'Business/Corporate' (8%), and 'Other' (84%), reflecting a diverse range of website types. There are 1375 unique website URLs and 1407 unique categories.

Usage

This dataset is ideal for various applications, including:
  • Website classification: Training models to automatically assign categories to new websites.
  • Website analytics: Understanding the topical distribution of websites.
  • User experience studies: Analysing website content for improved user engagement.
  • Data visualisation: Creating visual representations of website categories.
  • Natural Language Processing (NLP) tasks: Developing and testing NLP models for text extraction and categorisation.
  • Multiclass classification problems: Serving as a foundation for building complex classification algorithms.

Coverage

The dataset offers global coverage, encompassing websites from various regions.

License

CCO

Who Can Use It

This dataset is suitable for:
  • Beginner data scientists and analysts looking to practice classification, NLP, and data visualisation.
  • Machine learning engineers developing and testing multiclass classification models.
  • Researchers interested in web content analysis and automatic categorisation.
  • Developers building applications that require website categorisation capabilities.

Dataset Name Suggestions

  • Website Categorisation Dataset
  • Web Content Classification
  • URL Classification Data
  • Cleaned Website Text Categories
  • Web Page Classification Repository

Attributes

Original Data Source: Website Classification

Listing Stats

VIEWS

0

DOWNLOADS

5

LISTED

08/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free