Opendatabay APP

YouTuber Subtitle Data

Entertainment & Media Consumption

Tags and Keywords

Arts

Entertainment

Internet

Text

Nlp

Popular

Culture

Trusted By
Trusted by company1Trusted by company2Trusted by company3
YouTuber Subtitle Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a tabular collection of over 2,500 YouTube videos and their corresponding subtitles. Its primary purpose is to enable analysis of video content and to explore whether computers can differentiate video genres based on their titles and subtitles. The dataset includes subtitles from over 91 different YouTubers, covering a wide array of categories. While the auto-generated subtitles are generally good, their reliability may vary, and a specific attribute indicates whether the subtitle is manual or machine-generated. The figures for channel subscribers and video views reflect the state at the time of data collection. The most recent version of this dataset was generated on 05-February-2022.

Columns

The dataset contains 11 columns, each providing distinct information about the YouTube videos:
  • Id: A unique identifier for each video (e.g., dQw4w9WgXcQ).
  • Channel: The name of the YouTube channel that published the video.
  • Subscribers: The number of subscribers the channel had when the dataset was created.
  • Title: The title of the YouTube video.
  • CC: An indicator of subtitle origin: 1 signifies manual subtitles, while 0 means auto-generated subtitles, which might be less reliable.
  • URL: The direct URL to the YouTube video.
  • Released: The date when the video was released.
  • Views: The total number of views the video had at the time the dataset was collected.
  • Category: The category of the YouTube channel.
  • Transcript: The subtitle text for the respective video.

Distribution

This is a tabular dataset comprising 2,515 unique videos and their subtitles. It includes 11 distinct columns of data. The data was collected and cleaned, with the most recent version having been generated on 05-February-2022. Specific numbers for rows or records beyond the total unique videos are not explicitly detailed, but it's understood to be a consistent structure.

Usage

This dataset is ideal for applications and use cases involving:
  • Natural Language Processing (NLP) tasks, such as text analysis, sentiment analysis, and topic modelling of video transcripts.
  • Machine learning model training for video genre classification or content categorisation based on textual data.
  • Research into audience engagement and content trends on YouTube.
  • Developing recommendation systems or content filtering tools.
  • Linguistic studies on spoken language in online video content.

Coverage

The dataset has a global regional coverage. It includes subtitles from over 91 diverse YouTubers, encompassing a wide range of channel categories, such as Science, Food, Parks and Recreation, and A&E, along with many others. The data reflects subscriber and view counts as of the dataset generation date, 05-February-2022, and includes videos released at various points up to that time. The presence of both manual and auto-generated subtitles offers insights into data reliability for different video segments.

License

CC0

Who Can Use It

This dataset is suited for a variety of users and their specific applications:
  • Data Scientists and NLP Engineers: For training and evaluating language models, performing text mining, and developing new NLP applications on video content.
  • Researchers: Those studying media consumption, digital culture, and linguistic patterns in online videos.
  • Content Analysts and Marketers: To understand trends, identify popular topics, and analyse audience preferences within the YouTube ecosystem.
  • Academics: For academic studies in areas such as computational linguistics, media studies, and artificial intelligence.

Dataset Name Suggestions

  • YouTube Video Transcripts
  • YouTuber Subtitle Data
  • Video Content Transcripts
  • YouTube Channel Dialogue
  • Digital Video Subtitles

Attributes

Original Data Source: YouTubers saying things

Listing Stats

VIEWS

0

DOWNLOADS

1

LISTED

24/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format