Dark Mode

Home

Data Categories

Web & Social Media Data

YouTube Comment Spam Predictor

FREE DATASET LIBRARY

Verified Data Provider

£0

YouTube Comment Spam Predictor

Social Media and Posts

Tags and Keywords

Spam

Youtube

Comments

Classification

Text

Trusted By

YouTube Comment Spam Predictor Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is designed for research into predicting YouTube comment spam. It offers an insightful collection of real messages, serving as a valuable resource for text classification model development. The public dataset comprises 1,956 actual comments meticulously extracted from five of the most viewed YouTube videos during the collection period, providing a realistic basis for training and evaluating spam detection algorithms.

Columns

comment_id: A unique identifier assigned to each comment, though it is considered irrelevant for the purpose of predictive modelling.
Author: Represents the name of the individual who posted the comment.
Date: Indicates the specific date on which the comment was originally made.
Content: Contains the actual text of the comment itself, which is the primary feature used for training text classification models.
video_name: States the name of the YouTube video under which the comment was posted.
class: This is the target variable, a binary indicator showing whether a comment is classified as spam (represented by '1') or not spam (represented by '0').

Distribution

The dataset is typically provided as a CSV (Comma Separated Values) file. It consists of 6 distinct columns and contains 1,956 individual records or messages. The file size is approximately 412.26 kB. While most columns are fully populated, there are 245 missing values within the 'Date' column.

Usage

This dataset is ideally suited for various applications, particularly within the fields of machine learning and natural language processing. It can be effectively used for:

Developing and evaluating text classification models to identify spam.
Training machine learning algorithms for automated comment moderation systems.
Conducting natural language processing (NLP) research on online communication patterns and malicious content.
Exploring techniques for spam detection and filtering in user-generated content platforms.

Coverage

The comments included in this dataset were collected within a time range spanning from 13th July 2013 to 6th June 2015. The data is derived from five specific YouTube videos. Geographic and demographic scopes are not explicitly detailed in the provided information. It is noted that the 'Date' column has approximately 13% missing values across the dataset.

License

Attribution 4.0 International (CC BY 4.0)

Who Can Use It

This dataset is beneficial for a wide range of professionals and researchers:

Data Scientists and Machine Learning Engineers: To build, test, and refine models for identifying and preventing comment spam.
Researchers in NLP and AI: For academic studies on text analysis, content moderation, and online behaviour.
Platform Developers and Administrators: To implement or enhance automated systems for filtering undesirable content on social media or video-sharing platforms.
Students: As a practical resource for learning about text classification and data analysis.

Dataset Name Suggestions

YouTube Comment Spam Predictor
Video Comment Spam Detection Dataset
Social Media Text Spam Corpus
YouTube Spam Classification Data
Online Comment Moderation Dataset

Attributes

Original Data Source: YouTube Comment Spam Predictor

Listing Stats

VIEWS

DOWNLOADS

LISTED

26/08/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

FREE DATASET LIBRARY

£0

YouTube Comment Spam Predictor

Social Media and Posts

Tags and Keywords

Spam

Youtube

Comments

Classification

Text

Trusted By

Free

About

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Listing Stats

Free

Download Dataset in CSV Format

RECOMMENDED DATASETS