Opendatabay APP

SocialGrep Reddit Comment & Sentiment

Data Science and Analytics

Tags and Keywords

Computer

Online

Data

Text

Social

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
SocialGrep Reddit Comment & Sentiment Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides an in-depth corpus of posts and comments from the Reddit board /r/datasets, covering its entire history up to 1st March 2022. Its primary purpose is to serve as a collection of datasets related to Reddit content, enabling analysts and data scientists to explore online community data. The data was acquired using SocialGrep. To safeguard user privacy, usernames have been excluded from this dataset, preventing targeted harassment and preserving anonymity. It includes details such as comment body text, sentiment analysis, and comment scores, offering a rich resource for various analytical tasks.

Columns

  • type: Denotes the type of the data point.
  • id: A unique Base-36 identifier for each comment.
  • subreddit.id: A unique Base-36 identifier for the subreddit where the comment was posted.
  • subreddit.name: The human-readable name of the subreddit.
  • subreddit.nsfw: Indicates whether the comment's subreddit is Not Safe For Work (NSFW).
  • created_utc: The timestamp in Coordinated Universal Time (UTC) when the comment was created.
  • permalink: The permanent link to the comment on Reddit.
  • body: The main text content of the comment.
  • sentiment: The analysed sentiment score for the comment's body text.
  • score: The numerical score assigned to the comment.

Distribution

The dataset is structured as a table containing all comments. While the specific file format is typically CSV, the total number of values for key columns such as id, subreddit.id, created_utc, permalink, body, sentiment, and score is 54,848 records. For the subreddit.nsfw column, all 54,848 values indicate 'false', meaning no NSFW subreddits are included in this specific count. The body column shows that 5% of comments are '[deleted]', 2% are '[removed]', and the remaining 93% consist of other content. Sentiment scores range from -1.00 to 1.00, with varying distributions across different ranges. Comment scores range from -65 to 195, also with varying frequencies across score bands.

Usage

This dataset is ideally suited for data science and analytics projects. It can be used for:
  • Natural Language Processing (NLP) tasks, such as text analysis and sentiment classification.
  • Studying the dynamics of online communities and social networks.
  • Analyzing user sentiment towards various topics discussed on Reddit.
  • Exploring the factors influencing comment scores and engagement.
  • Developing models for content moderation or recommendation based on Reddit data.

Coverage

The dataset spans a significant time range, including all posts and comments from the inception of the /r/datasets board up to 1st March 2022. Its geographic scope is global, representing activity across Reddit's platform without specific regional limitations. The demographic scope primarily focuses on the users interacting within the /r/datasets community on Reddit. As mentioned, usernames are specifically excluded to ensure user anonymity.

License

CC-BY

Who Can Use It

This dataset is valuable for a wide range of users, including:
  • Data scientists and analysts looking for real-world social media data for their projects.
  • Researchers in fields such as computer science, social networks, and linguistics, for studying online behaviour and communication patterns.
  • Developers creating applications that involve text analysis or sentiment prediction.
  • Anyone interested in gaining insights into Reddit communities and their discussions.

Dataset Name Suggestions

  • Reddit /r/datasets Comment Log
  • Analysed Reddit Community Posts
  • SocialGrep Reddit Comment & Sentiment
  • Reddit Data Science Discussions
  • Online Community Text Data

Attributes

Original Data Source: The Reddit Dataset Dataset

Listing Stats

VIEWS

2

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free