YouTuber Subtitle Data
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a tabular collection of over 2,500 YouTube videos and their corresponding subtitles. Its primary purpose is to enable analysis of video content and to explore whether computers can differentiate video genres based on their titles and subtitles. The dataset includes subtitles from over 91 different YouTubers, covering a wide array of categories. While the auto-generated subtitles are generally good, their reliability may vary, and a specific attribute indicates whether the subtitle is manual or machine-generated. The figures for channel subscribers and video views reflect the state at the time of data collection. The most recent version of this dataset was generated on 05-February-2022.
Columns
The dataset contains 11 columns, each providing distinct information about the YouTube videos:
- Id: A unique identifier for each video (e.g., dQw4w9WgXcQ).
- Channel: The name of the YouTube channel that published the video.
- Subscribers: The number of subscribers the channel had when the dataset was created.
- Title: The title of the YouTube video.
- CC: An indicator of subtitle origin: 1 signifies manual subtitles, while 0 means auto-generated subtitles, which might be less reliable.
- URL: The direct URL to the YouTube video.
- Released: The date when the video was released.
- Views: The total number of views the video had at the time the dataset was collected.
- Category: The category of the YouTube channel.
- Transcript: The subtitle text for the respective video.
Distribution
This is a tabular dataset comprising 2,515 unique videos and their subtitles. It includes 11 distinct columns of data. The data was collected and cleaned, with the most recent version having been generated on 05-February-2022. Specific numbers for rows or records beyond the total unique videos are not explicitly detailed, but it's understood to be a consistent structure.
Usage
This dataset is ideal for applications and use cases involving:
- Natural Language Processing (NLP) tasks, such as text analysis, sentiment analysis, and topic modelling of video transcripts.
- Machine learning model training for video genre classification or content categorisation based on textual data.
- Research into audience engagement and content trends on YouTube.
- Developing recommendation systems or content filtering tools.
- Linguistic studies on spoken language in online video content.
Coverage
The dataset has a global regional coverage. It includes subtitles from over 91 diverse YouTubers, encompassing a wide range of channel categories, such as Science, Food, Parks and Recreation, and A&E, along with many others. The data reflects subscriber and view counts as of the dataset generation date, 05-February-2022, and includes videos released at various points up to that time. The presence of both manual and auto-generated subtitles offers insights into data reliability for different video segments.
License
CC0
Who Can Use It
This dataset is suited for a variety of users and their specific applications:
- Data Scientists and NLP Engineers: For training and evaluating language models, performing text mining, and developing new NLP applications on video content.
- Researchers: Those studying media consumption, digital culture, and linguistic patterns in online videos.
- Content Analysts and Marketers: To understand trends, identify popular topics, and analyse audience preferences within the YouTube ecosystem.
- Academics: For academic studies in areas such as computational linguistics, media studies, and artificial intelligence.
Dataset Name Suggestions
- YouTube Video Transcripts
- YouTuber Subtitle Data
- Video Content Transcripts
- YouTube Channel Dialogue
- Digital Video Subtitles
Attributes
Original Data Source: YouTubers saying things