Opendatabay APP

Online Comment Toxicity Labels Dataset

Social Media and Networking

Tags and Keywords

Health

Nlp

Multiclass

Text

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Online Comment Toxicity Labels Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset contains hand-labelled toxicity data from 1000 comments, which were crawled from YouTube videos related to the Ferguson unrest in 2014. It is designed to assist in categorising online comment toxicity, featuring labels for multiple subclassifications that form a hierarchical structure. Each comment can have one or more of these labels assigned.

Columns

  • CommentId: A unique identifier for each comment.
  • VideoId: The YouTube video identifier from which the comment originated.
  • Text: The full text of the comment.
  • IsToxic: A boolean indicating whether the comment is considered toxic.
  • IsAbusive: A boolean indicating if the comment is abusive.
  • IsThreat: A boolean indicating if the comment contains a threat.
  • IsProvocative: A boolean indicating if the comment is provocative.
  • IsObscene: A boolean indicating if the comment is obscene.
  • IsHatespeech: A boolean indicating if the comment contains hate speech.
  • IsRacist: A boolean indicating if the comment is racist.

Distribution

The dataset comprises 1000 unique comments and is typically provided in a CSV file format. It details various toxicity subclassifications with their respective distributions:
  • IsToxic: 46% of comments are labelled as true.
  • IsAbusive: 35% of comments are labelled as true.
  • IsThreat: 2% of comments are labelled as true.
  • IsProvocative: 16% of comments are labelled as true.
  • IsObscene: 10% of comments are labelled as true.
  • IsHatespeech: 14% of comments are labelled as true.
  • IsRacist: 13% of comments are labelled as true.

Usage

This dataset is ideal for a variety of applications, including:
  • Developing and evaluating machine learning models for natural language processing (NLP).
  • Training systems for multiclass classification of text data.
  • Performing text mining operations to identify patterns in online discourse.
  • Building tools for automated content moderation and detection of abusive language or hate speech.

Coverage

The dataset covers YouTube comments from videos related to the Ferguson unrest in 2014. It has a global region scope, focusing on comments from this specific period and event.

License

CC-BY

Who Can Use It

  • Researchers: For academic studies on online social behaviour, hate speech, and natural language processing.
  • Data Scientists and Machine Learning Engineers: For building and refining models for content moderation, sentiment analysis, and toxicity detection.
  • Developers: To integrate toxicity analysis features into social media platforms or other applications.
  • Organisations/Companies: For enhancing platform safety and managing user-generated content.

Dataset Name Suggestions

  • YouTube Toxic Comments Dataset
  • Online Comment Toxicity Labels
  • Ferguson Unrest YouTube Comments
  • Hand-Labelled Toxicity Data

Attributes

Original Data Source: Youtube toxic comments

Listing Stats

VIEWS

4

DOWNLOADS

0

LISTED

08/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free