SocialGrep Reddit Comment & Sentiment
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides an in-depth corpus of posts and comments from the Reddit board /r/datasets, covering its entire history up to 1st March 2022. Its primary purpose is to serve as a collection of datasets related to Reddit content, enabling analysts and data scientists to explore online community data. The data was acquired using SocialGrep. To safeguard user privacy, usernames have been excluded from this dataset, preventing targeted harassment and preserving anonymity. It includes details such as comment body text, sentiment analysis, and comment scores, offering a rich resource for various analytical tasks.
Columns
- type: Denotes the type of the data point.
- id: A unique Base-36 identifier for each comment.
- subreddit.id: A unique Base-36 identifier for the subreddit where the comment was posted.
- subreddit.name: The human-readable name of the subreddit.
- subreddit.nsfw: Indicates whether the comment's subreddit is Not Safe For Work (NSFW).
- created_utc: The timestamp in Coordinated Universal Time (UTC) when the comment was created.
- permalink: The permanent link to the comment on Reddit.
- body: The main text content of the comment.
- sentiment: The analysed sentiment score for the comment's body text.
- score: The numerical score assigned to the comment.
Distribution
The dataset is structured as a table containing all comments. While the specific file format is typically CSV, the total number of values for key columns such as
id
, subreddit.id
, created_utc
, permalink
, body
, sentiment
, and score
is 54,848 records. For the subreddit.nsfw
column, all 54,848 values indicate 'false', meaning no NSFW subreddits are included in this specific count. The body
column shows that 5% of comments are '[deleted]', 2% are '[removed]', and the remaining 93% consist of other content. Sentiment scores range from -1.00 to 1.00, with varying distributions across different ranges. Comment scores range from -65 to 195, also with varying frequencies across score bands.Usage
This dataset is ideally suited for data science and analytics projects. It can be used for:
- Natural Language Processing (NLP) tasks, such as text analysis and sentiment classification.
- Studying the dynamics of online communities and social networks.
- Analyzing user sentiment towards various topics discussed on Reddit.
- Exploring the factors influencing comment scores and engagement.
- Developing models for content moderation or recommendation based on Reddit data.
Coverage
The dataset spans a significant time range, including all posts and comments from the inception of the /r/datasets board up to 1st March 2022. Its geographic scope is global, representing activity across Reddit's platform without specific regional limitations. The demographic scope primarily focuses on the users interacting within the /r/datasets community on Reddit. As mentioned, usernames are specifically excluded to ensure user anonymity.
License
CC-BY
Who Can Use It
This dataset is valuable for a wide range of users, including:
- Data scientists and analysts looking for real-world social media data for their projects.
- Researchers in fields such as computer science, social networks, and linguistics, for studying online behaviour and communication patterns.
- Developers creating applications that involve text analysis or sentiment prediction.
- Anyone interested in gaining insights into Reddit communities and their discussions.
Dataset Name Suggestions
- Reddit /r/datasets Comment Log
- Analysed Reddit Community Posts
- SocialGrep Reddit Comment & Sentiment
- Reddit Data Science Discussions
- Online Community Text Data
Attributes
Original Data Source: The Reddit Dataset Dataset