Opendatabay APP

YouTube Comment Spam Predictor

Social Media and Posts

Tags and Keywords

Spam

Youtube

Comments

Classification

Text

Trusted By
Trusted by company1Trusted by company2Trusted by company3
YouTube Comment Spam Predictor Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is designed for research into predicting YouTube comment spam. It offers an insightful collection of real messages, serving as a valuable resource for text classification model development. The public dataset comprises 1,956 actual comments meticulously extracted from five of the most viewed YouTube videos during the collection period, providing a realistic basis for training and evaluating spam detection algorithms.

Columns

  • comment_id: A unique identifier assigned to each comment, though it is considered irrelevant for the purpose of predictive modelling.
  • Author: Represents the name of the individual who posted the comment.
  • Date: Indicates the specific date on which the comment was originally made.
  • Content: Contains the actual text of the comment itself, which is the primary feature used for training text classification models.
  • video_name: States the name of the YouTube video under which the comment was posted.
  • class: This is the target variable, a binary indicator showing whether a comment is classified as spam (represented by '1') or not spam (represented by '0').

Distribution

The dataset is typically provided as a CSV (Comma Separated Values) file. It consists of 6 distinct columns and contains 1,956 individual records or messages. The file size is approximately 412.26 kB. While most columns are fully populated, there are 245 missing values within the 'Date' column.

Usage

This dataset is ideally suited for various applications, particularly within the fields of machine learning and natural language processing. It can be effectively used for:
  • Developing and evaluating text classification models to identify spam.
  • Training machine learning algorithms for automated comment moderation systems.
  • Conducting natural language processing (NLP) research on online communication patterns and malicious content.
  • Exploring techniques for spam detection and filtering in user-generated content platforms.

Coverage

The comments included in this dataset were collected within a time range spanning from 13th July 2013 to 6th June 2015. The data is derived from five specific YouTube videos. Geographic and demographic scopes are not explicitly detailed in the provided information. It is noted that the 'Date' column has approximately 13% missing values across the dataset.

License

Attribution 4.0 International (CC BY 4.0)

Who Can Use It

This dataset is beneficial for a wide range of professionals and researchers:
  • Data Scientists and Machine Learning Engineers: To build, test, and refine models for identifying and preventing comment spam.
  • Researchers in NLP and AI: For academic studies on text analysis, content moderation, and online behaviour.
  • Platform Developers and Administrators: To implement or enhance automated systems for filtering undesirable content on social media or video-sharing platforms.
  • Students: As a practical resource for learning about text classification and data analysis.

Dataset Name Suggestions

  • YouTube Comment Spam Predictor
  • Video Comment Spam Detection Dataset
  • Social Media Text Spam Corpus
  • YouTube Spam Classification Data
  • Online Comment Moderation Dataset

Attributes

Original Data Source: YouTube Comment Spam Predictor

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

26/08/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format