Opendatabay APP

Social Media Toxicity Analysis Dataset

Data Science and Analytics

Tags and Keywords

Health

Beginner

Text

Intermediate

Nlp

Linguistics

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Social Media Toxicity Analysis Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, known as Ruddit, comprises English language comments from Reddit that have been assigned fine-grained, real-valued scores. These scores range from -1 (maximally supportive) to 1 (maximally offensive), making it highly valuable for Natural Language Processing (NLP) tasks. The purpose of this dataset is to provide a resource for analysing and detecting toxicity within online discourse. Detailed procedures for data sampling, annotation, and analysis are discussed in the accompanying research paper.

Columns

  • comment_id: A unique identifier for each comment. There are 5966 unique comment IDs in the dataset.
  • body: The main text content of the Reddit comment. Notably, 4% of comments are marked as '[deleted]' and 0% as '[removed]'.
  • score: A real-valued score representing the comment's toxicity. Scores range from -1 (indicating less or no toxicity) to 1 (indicating higher toxicity). The distribution of scores across ranges is as follows:
    • -0.89 to -0.70: 36 comments
    • -0.70 to -0.52: 300 comments
    • -0.52 to -0.33: 742 comments
    • -0.33 to -0.14: 1,326 comments
    • -0.14 to 0.04: 1,433 comments
    • 0.04 to 0.23: 961 comments
    • 0.23 to 0.42: 521 comments
    • 0.42 to 0.61: 329 comments
    • 0.61 to 0.79: 227 comments
    • 0.79 to 0.98: 91 comments

Distribution

The dataset is structured for easy use, typically provided in a CSV format. It contains 5966 records, with each record representing a single Reddit comment and its associated toxicity score. Specific file size details are not provided, but the structure is tabular, organised by the listed columns.

Usage

This dataset is ideally suited for a variety of applications and use cases, including:
  • Training and evaluating NLP models for sentiment analysis and toxicity detection.
  • Research in computational linguistics and social media analysis.
  • Developing automated content moderation systems for online platforms.
  • Analysing patterns of online discourse and communication.

Coverage

The dataset consists of English language comments. Its geographical scope is global, reflecting the worldwide user base of Reddit. There is no specific time range or detailed demographic breakdown of the comment authors provided. The data primarily focuses on the toxicity scoring of comments.

License

CC0

Who Can Use It

This dataset is suitable for:
  • Data Scientists and Analysts working on text data or social media projects.
  • Researchers focused on NLP, linguistics, and the study of online communities.
  • Developers creating applications that require sentiment analysis or toxicity flagging.
  • Individuals at beginner to intermediate levels looking to engage with real-world text data for machine learning tasks.

Dataset Name Suggestions

  • Reddit Comment Toxicity Scores
  • Ruddit NLP Toxicity Dataset
  • English Reddit Toxicity Classifier Data
  • Social Media Toxicity Analysis Dataset
  • Reddit Sentiment and Offensive Language Data

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

26/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format