Dark Mode

Home

Data Categories

Web & Social Media Data

Reddit Data Science Community Posts

FREE DATASET LIBRARY

Verified Data Provider

£0

Reddit Data Science Community Posts

Reddit & Forum Data

Tags and Keywords

Datascience

Machinelearning

Nlp

Analytics

Trusted By

Reddit Data Science Community Posts Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This collection contains over 500,000 posts dedicated to the field of Data Science, harvested from 19 prominent Data Science subreddits. It provides valuable material for professionals and enthusiasts who wish to study changes in community trends over time, discover popular topics, or engage in natural language processing tasks. The data offers an insightful look into the growth and interests of the online Data Science community.

Columns

The dataset contains 13 fields offering detailed information about each Reddit post:

row index: A sequential identifier.
created_date: The publication date of the post (in date format).
created_timestamp: The publication date of the post (in timestamp format).
subreddit: The name of the subreddit where the post originated.
title: The title of the post.
id: The unique identifier for the operation.
author: The nickname of the post author.
author_created_utc: The registration date of the author's profile.
full_link: The hyperlink directly to the post.
score: The ratio of likes and dislikes received by the post.
num_comments: The total count of comments on the post.
num_crossposts: The total count of times the post was crossposted.
subreddit_subscribers: The number of subscribers the subreddit had when the post was published.
post: The text body of the post.

Distribution

The data is typically provided in a CSV file format, for example, reddit_database.csv, which is approximately 340.11 MB in size. The dataset currently includes 545,000 validated records. The data is actively maintained and expected to increase in size, with stated goals to expand the collection to 750,000 posts and ultimately reach one million posts. Updates are expected on a weekly basis.

Usage

Ideal applications for this data include:

Trend Analysis: Studying shifts in Data Science topics and community interests over the past decade.
Predictive Modelling: Developing models to forecast the potential popularity of new Reddit posts based on characteristics like title and content.
Topic Discovery: Identifying novel or interesting themes within the Data Science domain.
Text Analysis: Applying Natural Language Processing (NLP) techniques to post titles and bodies.
Community Study: Analysing engagement metrics such as scores, comments, and crossposts.

Coverage

The temporal coverage spans over 14 years, beginning on March 19, 2008, and extending through to May 9, 2022. The posts are sourced from 19 specific Data Science subreddits, including r/MachineLearning, r/datascience, r/deeplearning, and r/kaggle. Geographically, the data represents content shared by global Reddit users interested in these technical fields. Note that there are significant gaps in certain fields; specifically, 50% of the post body text and 83% of the author creation date fields are currently missing.

License

CC0: Public Domain

Who Can Use It

This dataset is suitable for:

Data Scientists: Building statistical models, particularly for text classification and popularity prediction.
Market Researchers: Understanding public discourse and enthusiasm around specific technical concepts like AI, deep learning, or analytics.
Software Engineers: Practicing skills related to web data handling and large-scale textual analysis.
Academics: Conducting research into online technical communities and information diffusion.

Dataset Name Suggestions

Reddit Data Science Community Posts
Global Data Science Engagement Log
Social Media ML/AI Discussion Archive
Data Science Trend Insights

Attributes

Original Data Source: Reddit Data Science Community Posts

Listing Stats

VIEWS

DOWNLOADS

LISTED

08/11/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...