Reddit Data Science Community Posts
Reddit & Forum Data
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This collection contains over 500,000 posts dedicated to the field of Data Science, harvested from 19 prominent Data Science subreddits. It provides valuable material for professionals and enthusiasts who wish to study changes in community trends over time, discover popular topics, or engage in natural language processing tasks. The data offers an insightful look into the growth and interests of the online Data Science community.
Columns
The dataset contains 13 fields offering detailed information about each Reddit post:
- row index: A sequential identifier.
- created_date: The publication date of the post (in date format).
- created_timestamp: The publication date of the post (in timestamp format).
- subreddit: The name of the subreddit where the post originated.
- title: The title of the post.
- id: The unique identifier for the operation.
- author: The nickname of the post author.
- author_created_utc: The registration date of the author's profile.
- full_link: The hyperlink directly to the post.
- score: The ratio of likes and dislikes received by the post.
- num_comments: The total count of comments on the post.
- num_crossposts: The total count of times the post was crossposted.
- subreddit_subscribers: The number of subscribers the subreddit had when the post was published.
- post: The text body of the post.
Distribution
The data is typically provided in a CSV file format, for example,
reddit_database.csv, which is approximately 340.11 MB in size. The dataset currently includes 545,000 validated records. The data is actively maintained and expected to increase in size, with stated goals to expand the collection to 750,000 posts and ultimately reach one million posts. Updates are expected on a weekly basis.Usage
Ideal applications for this data include:
- Trend Analysis: Studying shifts in Data Science topics and community interests over the past decade.
- Predictive Modelling: Developing models to forecast the potential popularity of new Reddit posts based on characteristics like title and content.
- Topic Discovery: Identifying novel or interesting themes within the Data Science domain.
- Text Analysis: Applying Natural Language Processing (NLP) techniques to post titles and bodies.
- Community Study: Analysing engagement metrics such as scores, comments, and crossposts.
Coverage
The temporal coverage spans over 14 years, beginning on March 19, 2008, and extending through to May 9, 2022. The posts are sourced from 19 specific Data Science subreddits, including r/MachineLearning, r/datascience, r/deeplearning, and r/kaggle. Geographically, the data represents content shared by global Reddit users interested in these technical fields. Note that there are significant gaps in certain fields; specifically, 50% of the post body text and 83% of the author creation date fields are currently missing.
License
CC0: Public Domain
Who Can Use It
This dataset is suitable for:
- Data Scientists: Building statistical models, particularly for text classification and popularity prediction.
- Market Researchers: Understanding public discourse and enthusiasm around specific technical concepts like AI, deep learning, or analytics.
- Software Engineers: Practicing skills related to web data handling and large-scale textual analysis.
- Academics: Conducting research into online technical communities and information diffusion.
Dataset Name Suggestions
- Reddit Data Science Community Posts
- Global Data Science Engagement Log
- Social Media ML/AI Discussion Archive
- Data Science Trend Insights
Attributes
Original Data Source: Reddit Data Science Community Posts
Loading...
