Opendatabay APP

1M NLP Social Comment Dataset

Data Science and Analytics

Tags and Keywords

Reddit

Nlp

Comments

Social

Controversy

Trusted By
Trusted by company1Trusted by company2Trusted by company3
1M NLP Social Comment Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This collection provides 1 million anonymised comments extracted from 40 highly frequented subreddits, specifically focusing on data from May 2019. It is primarily designed as an NLP dataset, engineered to contain balanced categorical data by sampling 25,000 comments from each of the 40 selected subreddits. The extracted information includes key features chosen for type variety, such as comment body, an aggregated controversiality metric, and the net score (upvotes minus downvotes) of the post. This resource allows users to build simple models and gain interesting insights into online discourse.

Columns

  • subreddit (Categorical): Identifies the specific subreddit on which the comment was posted. The dataset contains 40 unique subreddit names.
  • body (String): The complete content of the comment.
  • controversiality (Binary): A metric aggregated by Reddit indicating how controversial the comment was (labels include 0 and 1).
  • score (Scalar): The total number of upvotes minus downvotes received by the comment.

Distribution

The data is delivered in a CSV format, packaged as kaggle_RC_2019-05.csv, with a file size of approximately 185.93 MB. It contains exactly 1 million records, distributed uniformly with 25,000 comments assigned to each of the 40 distinct subreddits. The dataset excludes comments that were removed, comments whose authors were deleted, and any comments containing fewer than 4 tokens.

Usage

This data is ideal for various analytical and machine learning tasks. It is specifically suited for Natural Language Processing applications suchsing classification tasks, such as predicting comment controversiality or score based on content. Researchers can utilise the uniform structure to perform categorical analysis on the subreddit feature, comparing language and engagement metrics across different online communities.

Coverage

The data exclusively covers comments posted during a specific one-month period: May 2019. The scope includes 40 of the most frequented subreddits active during that period. The data is anonymised, focusing on the content and associated metrics rather than individual author identification.

License

CC0: Public Domain

Who Can Use It

  • NLP Practitioners: To train and evaluate models for sentiment, topic modelling, and text classification using real-world social data.
  • Data Scientists and Analysts: For conducting studies on online community dynamics, quantifying controversial discussions, and exploring engagement metrics.
  • Students and Educators: The balanced structure makes it excellent for teaching principles of categorical data handling and binary classification using large text corpora.

Dataset Name Suggestions

  • Reddit Comments May 2019 - Balanced Subreddit Extract
  • 1M NLP Social Comment Dataset
  • Online Community Discourse Analytics

Attributes

Original Data Source: 1M NLP Social Comment Dataset

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

16/11/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format