Dark Mode

Home

Data Categories

Web & Social Media Data

Social Media Toxicity Analysis Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Social Media Toxicity Analysis Dataset

Data Science and Analytics

Tags and Keywords

Health

Beginner

Text

Intermediate

Nlp

Linguistics

Trusted By

Social Media Toxicity Analysis Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, known as Ruddit, comprises English language comments from Reddit that have been assigned fine-grained, real-valued scores. These scores range from -1 (maximally supportive) to 1 (maximally offensive), making it highly valuable for Natural Language Processing (NLP) tasks. The purpose of this dataset is to provide a resource for analysing and detecting toxicity within online discourse. Detailed procedures for data sampling, annotation, and analysis are discussed in the accompanying research paper.

Columns

comment_id: A unique identifier for each comment. There are 5966 unique comment IDs in the dataset.
body: The main text content of the Reddit comment. Notably, 4% of comments are marked as '[deleted]' and 0% as '[removed]'.
score: A real-valued score representing the comment's toxicity. Scores range from -1 (indicating less or no toxicity) to 1 (indicating higher toxicity). The distribution of scores across ranges is as follows:
- -0.89 to -0.70: 36 comments
- -0.70 to -0.52: 300 comments
- -0.52 to -0.33: 742 comments
- -0.33 to -0.14: 1,326 comments
- -0.14 to 0.04: 1,433 comments
- 0.04 to 0.23: 961 comments
- 0.23 to 0.42: 521 comments
- 0.42 to 0.61: 329 comments
- 0.61 to 0.79: 227 comments
- 0.79 to 0.98: 91 comments

Distribution

The dataset is structured for easy use, typically provided in a CSV format. It contains 5966 records, with each record representing a single Reddit comment and its associated toxicity score. Specific file size details are not provided, but the structure is tabular, organised by the listed columns.

Usage

This dataset is ideally suited for a variety of applications and use cases, including:

Training and evaluating NLP models for sentiment analysis and toxicity detection.
Research in computational linguistics and social media analysis.
Developing automated content moderation systems for online platforms.
Analysing patterns of online discourse and communication.

Coverage

The dataset consists of English language comments. Its geographical scope is global, reflecting the worldwide user base of Reddit. There is no specific time range or detailed demographic breakdown of the comment authors provided. The data primarily focuses on the toxicity scoring of comments.

License

CC0

Who Can Use It

This dataset is suitable for:

Data Scientists and Analysts working on text data or social media projects.
Researchers focused on NLP, linguistics, and the study of online communities.
Developers creating applications that require sentiment analysis or toxicity flagging.
Individuals at beginner to intermediate levels looking to engage with real-world text data for machine learning tasks.

Dataset Name Suggestions

Reddit Comment Toxicity Scores
Ruddit NLP Toxicity Dataset
English Reddit Toxicity Classifier Data
Social Media Toxicity Analysis Dataset
Reddit Sentiment and Offensive Language Data

Attributes

Original Data Source: Reddit's toxicity comments scored for NLP use

Listing Stats

VIEWS

DOWNLOADS

LISTED

26/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...