YouTube Comment Spam Predictor
Social Media and Posts
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed for research into predicting YouTube comment spam. It offers an insightful collection of real messages, serving as a valuable resource for text classification model development. The public dataset comprises 1,956 actual comments meticulously extracted from five of the most viewed YouTube videos during the collection period, providing a realistic basis for training and evaluating spam detection algorithms.
Columns
- comment_id: A unique identifier assigned to each comment, though it is considered irrelevant for the purpose of predictive modelling.
- Author: Represents the name of the individual who posted the comment.
- Date: Indicates the specific date on which the comment was originally made.
- Content: Contains the actual text of the comment itself, which is the primary feature used for training text classification models.
- video_name: States the name of the YouTube video under which the comment was posted.
- class: This is the target variable, a binary indicator showing whether a comment is classified as spam (represented by '1') or not spam (represented by '0').
Distribution
The dataset is typically provided as a CSV (Comma Separated Values) file. It consists of 6 distinct columns and contains 1,956 individual records or messages. The file size is approximately 412.26 kB. While most columns are fully populated, there are 245 missing values within the 'Date' column.
Usage
This dataset is ideally suited for various applications, particularly within the fields of machine learning and natural language processing. It can be effectively used for:
- Developing and evaluating text classification models to identify spam.
- Training machine learning algorithms for automated comment moderation systems.
- Conducting natural language processing (NLP) research on online communication patterns and malicious content.
- Exploring techniques for spam detection and filtering in user-generated content platforms.
Coverage
The comments included in this dataset were collected within a time range spanning from 13th July 2013 to 6th June 2015. The data is derived from five specific YouTube videos. Geographic and demographic scopes are not explicitly detailed in the provided information. It is noted that the 'Date' column has approximately 13% missing values across the dataset.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
This dataset is beneficial for a wide range of professionals and researchers:
- Data Scientists and Machine Learning Engineers: To build, test, and refine models for identifying and preventing comment spam.
- Researchers in NLP and AI: For academic studies on text analysis, content moderation, and online behaviour.
- Platform Developers and Administrators: To implement or enhance automated systems for filtering undesirable content on social media or video-sharing platforms.
- Students: As a practical resource for learning about text classification and data analysis.
Dataset Name Suggestions
- YouTube Comment Spam Predictor
- Video Comment Spam Detection Dataset
- Social Media Text Spam Corpus
- YouTube Spam Classification Data
- Online Comment Moderation Dataset
Attributes
Original Data Source: YouTube Comment Spam Predictor