YouTube Content Classification Dataset
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides YouTube video metadata, suitable for practising text classification using Natural Language Processing (NLP) techniques. It includes video IDs, titles, descriptions, and categories, making it a valuable resource for those looking to apply and refine their NLP skills. The dataset was generated by scraping YouTube, offering a real-world scenario for data cleaning and analysis, including challenges such as missing values and class imbalance.
Columns
- Video ID: A unique identifier for each YouTube video. Note that this column contains some missing data.
- title: The title of the YouTube video.
- description: The textual description associated with the YouTube video.
- category: The category under which the video was classified when scraped.
- link: A direct URL to the YouTube video.
Distribution
The dataset is typically provided in a CSV file format. It contains approximately 3,400 video records, derived from an initial scrape of 3,600 videos. The dataset is known to be untidy, featuring missing values and imbalanced classes across its categories, presenting an opportunity for data cleaning and preprocessing exercises.
Usage
This dataset is ideally suited for:
- Practising basic text classification using various NLP techniques.
- Learning how to handle common data issues such as missing values and imbalanced classes.
- Developing and applying data cleaning and preprocessing methods.
- Experimenting with different machine learning algorithms for text analysis.
Coverage
The dataset has a global reach, as it comprises YouTube videos accessible worldwide. It was listed on 08/06/2025. The video categories included in the dataset were specifically queried across four main areas: Travel Vlogs, Food, Art and Music, and History. Users should be aware that the data includes missing values and exhibits class imbalance across these categories.
License
CCO
Who Can Use It
This dataset is intended for individuals and researchers, particularly those at an intermediate skill level, who wish to practise and improve their text classification and NLP capabilities. It is also highly beneficial for anyone looking to gain practical experience in data cleaning, handling missing data, and addressing class imbalance in real-world datasets.
Dataset Name Suggestions
- YouTube Video Classification Data
- NLP YouTube Metadata Dataset
- YouTube Content Classification Dataset
- Video Description Text Analysis Dataset
Attributes
Original Data Source: Youtube Videos Dataset (~3400 videos)