Social Media Toxicity Analysis Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, known as Ruddit, comprises English language comments from Reddit that have been assigned fine-grained, real-valued scores. These scores range from -1 (maximally supportive) to 1 (maximally offensive), making it highly valuable for Natural Language Processing (NLP) tasks. The purpose of this dataset is to provide a resource for analysing and detecting toxicity within online discourse. Detailed procedures for data sampling, annotation, and analysis are discussed in the accompanying research paper.
Columns
- comment_id: A unique identifier for each comment. There are 5966 unique comment IDs in the dataset.
- body: The main text content of the Reddit comment. Notably, 4% of comments are marked as '[deleted]' and 0% as '[removed]'.
- score: A real-valued score representing the comment's toxicity. Scores range from -1 (indicating less or no toxicity) to 1 (indicating higher toxicity). The distribution of scores across ranges is as follows:
- -0.89 to -0.70: 36 comments
- -0.70 to -0.52: 300 comments
- -0.52 to -0.33: 742 comments
- -0.33 to -0.14: 1,326 comments
- -0.14 to 0.04: 1,433 comments
- 0.04 to 0.23: 961 comments
- 0.23 to 0.42: 521 comments
- 0.42 to 0.61: 329 comments
- 0.61 to 0.79: 227 comments
- 0.79 to 0.98: 91 comments
Distribution
The dataset is structured for easy use, typically provided in a CSV format. It contains 5966 records, with each record representing a single Reddit comment and its associated toxicity score. Specific file size details are not provided, but the structure is tabular, organised by the listed columns.
Usage
This dataset is ideally suited for a variety of applications and use cases, including:
- Training and evaluating NLP models for sentiment analysis and toxicity detection.
- Research in computational linguistics and social media analysis.
- Developing automated content moderation systems for online platforms.
- Analysing patterns of online discourse and communication.
Coverage
The dataset consists of English language comments. Its geographical scope is global, reflecting the worldwide user base of Reddit. There is no specific time range or detailed demographic breakdown of the comment authors provided. The data primarily focuses on the toxicity scoring of comments.
License
CC0
Who Can Use It
This dataset is suitable for:
- Data Scientists and Analysts working on text data or social media projects.
- Researchers focused on NLP, linguistics, and the study of online communities.
- Developers creating applications that require sentiment analysis or toxicity flagging.
- Individuals at beginner to intermediate levels looking to engage with real-world text data for machine learning tasks.
Dataset Name Suggestions
- Reddit Comment Toxicity Scores
- Ruddit NLP Toxicity Dataset
- English Reddit Toxicity Classifier Data
- Social Media Toxicity Analysis Dataset
- Reddit Sentiment and Offensive Language Data
Attributes
Original Data Source: Reddit's toxicity comments scored for NLP use