Dark Mode

Home

Data Categories

AI Training Data

Toxic Comment Classification Dataset

Opendatabay Labs

Licensed LLM Data Provider

£0

Toxic Comment Classification Dataset

Name: Toxic Comment Classification Dataset
Creator: Opendatabay Labs
Published: 2025-01-24T08:28:35.823Z
License: https://docs.opendatabay.com/ai-training-and-model-development-licenses/general-ai-training-and-fine-tuning-data-license

Fraud Detection & Risk Management

Tags and Keywords

Toxic

Natural

Sentiment

Machine

Content

Dataset

Free

About

This dataset comprises comments, each labelled to indicate whether it contains toxic content. The primary purpose is to facilitate the development and evaluation of models aimed at detecting and mitigating online toxicity, thereby promoting healthier online interactions.

Dataset Features

TC_ID: A unique identifier is assigned to each comment.
comment_text: The actual text of the comment extracted from Wikipedia's talk pages.
toxic: A binary label where '1' denotes a toxic comment and '0' indicates a non-toxic comment.

Distribution

Data Volume: The dataset contains 70157 rows and 3 columns in the provided sample.
Format: Structured in a tabular format with columns representing unique identifiers, comment texts, and toxicity labels.

Usage

This dataset is ideal for a variety of applications:

Toxicity Detection: Training machine learning models to identify and filter toxic comments in online platforms.
Sentiment Analysis: Analyzing the sentiment of user interactions to understand community dynamics.
Natural Language Processing (NLP): Developing and testing NLP algorithms focused on content moderation and abusive language detection.

Coverage

Geographic Coverage: Global, encompassing comments from Wikipedia users worldwide.
Time Range: The dataset includes comments from various periods, reflecting the diverse history of Wikipedia's discussions.
Demographics: Covers a wide range of contributors, including editors, administrators, and general users, without specific demographic distinctions.

License

CC0 (Public Domain)

Who Can Use It

Data Scientists: For developing and refining algorithms to detect toxic language.
Researchers: For studying online behavior, discourse analysis, and the effectiveness of moderation strategies.
Businesses: For implementing content moderation systems and enhancing user experience on their platforms.

Listing Stats

VIEWS

DELIVERY

INSTANT DOWNLOAD

LISTED

24/01/2025

UPDATED

15/06/2026

REGION

GLOBAL

TRUST

5 / 5

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...