Website Categorisation Dataset
Website Analytics & User Experience
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a collection of website URLs and their corresponding cleaned text content, which have been categorised into various topics. It is designed to facilitate website classification tasks, offering valuable insights for web analytics and user experience analysis. The data was created by extracting and cleaning text from different websites, then assigning categories based on this content.
Columns
- index: An identifier for each row in the dataset.
- website_url: The URL link of the website.
- cleaned_website_text: The cleaned text content extracted from the website URL.
- Category: The assigned category of the URL.
Distribution
The dataset comprises 1408 rows of data. It is typically available in a CSV file format. The categories present in the dataset include 'Education' (8%), 'Business/Corporate' (8%), and 'Other' (84%), reflecting a diverse range of website types. There are 1375 unique website URLs and 1407 unique categories.
Usage
This dataset is ideal for various applications, including:
- Website classification: Training models to automatically assign categories to new websites.
- Website analytics: Understanding the topical distribution of websites.
- User experience studies: Analysing website content for improved user engagement.
- Data visualisation: Creating visual representations of website categories.
- Natural Language Processing (NLP) tasks: Developing and testing NLP models for text extraction and categorisation.
- Multiclass classification problems: Serving as a foundation for building complex classification algorithms.
Coverage
The dataset offers global coverage, encompassing websites from various regions.
License
CCO
Who Can Use It
This dataset is suitable for:
- Beginner data scientists and analysts looking to practice classification, NLP, and data visualisation.
- Machine learning engineers developing and testing multiclass classification models.
- Researchers interested in web content analysis and automatic categorisation.
- Developers building applications that require website categorisation capabilities.
Dataset Name Suggestions
- Website Categorisation Dataset
- Web Content Classification
- URL Classification Data
- Cleaned Website Text Categories
- Web Page Classification Repository
Attributes
Original Data Source: Website Classification