Toxic Comment Classification Dataset
Fraud Detection & Risk Management
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset comprises comments, each labelled to indicate whether it contains toxic content. The primary purpose is to facilitate the development and evaluation of models aimed at detecting and mitigating online toxicity, thereby promoting healthier online interactions.
Dataset Features
- TC_ID: A unique identifier is assigned to each comment.
- comment_text: The actual text of the comment extracted from Wikipedia's talk pages.
- toxic: A binary label where '1' denotes a toxic comment and '0' indicates a non-toxic comment.
Distribution
- Data Volume: The dataset contains 70157 rows and 3 columns in the provided sample.
- Format: Structured in a tabular format with columns representing unique identifiers, comment texts, and toxicity labels.
Usage
This dataset is ideal for a variety of applications:
- Toxicity Detection: Training machine learning models to identify and filter toxic comments in online platforms.
- Sentiment Analysis: Analyzing the sentiment of user interactions to understand community dynamics.
- Natural Language Processing (NLP): Developing and testing NLP algorithms focused on content moderation and abusive language detection.
Coverage
- Geographic Coverage: Global, encompassing comments from Wikipedia users worldwide.
- Time Range: The dataset includes comments from various periods, reflecting the diverse history of Wikipedia's discussions.
- Demographics: Covers a wide range of contributors, including editors, administrators, and general users, without specific demographic distinctions.
License
CC0 (Public Domain)
Who Can Use It
- Data Scientists: For developing and refining algorithms to detect toxic language.
- Researchers: For studying online behavior, discourse analysis, and the effectiveness of moderation strategies.
- Businesses: For implementing content moderation systems and enhancing user experience on their platforms.