Online Comment Toxicity Labels Dataset
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains hand-labelled toxicity data from 1000 comments, which were crawled from YouTube videos related to the Ferguson unrest in 2014. It is designed to assist in categorising online comment toxicity, featuring labels for multiple subclassifications that form a hierarchical structure. Each comment can have one or more of these labels assigned.
Columns
- CommentId: A unique identifier for each comment.
- VideoId: The YouTube video identifier from which the comment originated.
- Text: The full text of the comment.
- IsToxic: A boolean indicating whether the comment is considered toxic.
- IsAbusive: A boolean indicating if the comment is abusive.
- IsThreat: A boolean indicating if the comment contains a threat.
- IsProvocative: A boolean indicating if the comment is provocative.
- IsObscene: A boolean indicating if the comment is obscene.
- IsHatespeech: A boolean indicating if the comment contains hate speech.
- IsRacist: A boolean indicating if the comment is racist.
Distribution
The dataset comprises 1000 unique comments and is typically provided in a CSV file format. It details various toxicity subclassifications with their respective distributions:
- IsToxic: 46% of comments are labelled as true.
- IsAbusive: 35% of comments are labelled as true.
- IsThreat: 2% of comments are labelled as true.
- IsProvocative: 16% of comments are labelled as true.
- IsObscene: 10% of comments are labelled as true.
- IsHatespeech: 14% of comments are labelled as true.
- IsRacist: 13% of comments are labelled as true.
Usage
This dataset is ideal for a variety of applications, including:
- Developing and evaluating machine learning models for natural language processing (NLP).
- Training systems for multiclass classification of text data.
- Performing text mining operations to identify patterns in online discourse.
- Building tools for automated content moderation and detection of abusive language or hate speech.
Coverage
The dataset covers YouTube comments from videos related to the Ferguson unrest in 2014. It has a global region scope, focusing on comments from this specific period and event.
License
CC-BY
Who Can Use It
- Researchers: For academic studies on online social behaviour, hate speech, and natural language processing.
- Data Scientists and Machine Learning Engineers: For building and refining models for content moderation, sentiment analysis, and toxicity detection.
- Developers: To integrate toxicity analysis features into social media platforms or other applications.
- Organisations/Companies: For enhancing platform safety and managing user-generated content.
Dataset Name Suggestions
- YouTube Toxic Comments Dataset
- Online Comment Toxicity Labels
- Ferguson Unrest YouTube Comments
- Hand-Labelled Toxicity Data
Attributes
Original Data Source: Youtube toxic comments