YouTube Comment Spam Identification Dataset
Social Media and Posts
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Comments and video details from four different YouTubers are collected here for the primary purpose of identifying spam comments. The main goal is to use unsupervised learning techniques to cluster the comments, which will help isolate a potential group of spam messages. The YouTubers featured are Cleo Abram, Physics Girl, Jet Lag: The Game, and Neo. The selection is currently limited due to API restrictions, with plans for future expansion.
Columns
- User: The name of the YouTuber or channel.
- Video Title: The title of the YouTube video.
- Video Description: The descriptive text accompanying the YouTube video.
- Video ID: The unique identifier for the YouTube video.
- Comment (Displayed): The comment text as it appears on YouTube.
- Comment (Actual): The actual, unformatted comment text.
- Comment Author: The name of the user who posted the comment.
- Comment Author Channel ID: The unique channel ID of the user who posted the comment.
- Comment Time: The date and time when the comment was posted.
Distribution
The data is provided as a single CSV file named
YT_Videos_Comments.csv
with a size of 616.45 MB. It is a tabular dataset containing approximately 380,000 valid records across nine columns.Usage
This dataset is well-suited for unsupervised learning tasks, particularly for spam detection. Ideal applications include building clustering models to identify spam groups and developing zero-shot text classification systems. It serves as a practical resource for text analysis projects.
Coverage
The dataset's temporal coverage spans from January 2016 to March 2023. It includes comment data from videos published by four specific English-language YouTubers.
License
CC0: Public Domain
Who Can Use It
- Data Scientists: For developing and testing spam detection models using clustering and text classification algorithms.
- Researchers: To analyse online communication patterns and the characteristics of spam on social media platforms.
- Students and Beginners: As a real-world dataset for learning about natural language processing and unsupervised machine learning techniques.
Dataset Name Suggestions
- YouTube Comment Spam Identification Dataset
- Multichannel YouTube Comment Corpus for Spam Analysis
- YouTube Video Comments for Unsupervised Spam Detection
- Spam Analysis Dataset from YouTube Comments
Attributes
Original Data Source: YouTube Comment Spam Identification Dataset