Machine Learning and Data Science Reddit Dataset
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains a collection of Reddit post submissions from various Machine Learning and Data Science subreddits. It offers valuable insight into the discussions, popular topics, and engagement dynamics within these online communities. The dataset can be used to understand trends, analyse community behaviour, and support research in the fields of artificial intelligence, machine learning, and data science.
Columns
- title: The heading or subject line of the Reddit post.
- id: A unique identifier assigned to each post.
- redditor: The username of the Reddit member who created the post.
- num_upvotes: The total number of upvotes the post has received.
- subreddit: The name of the specific subreddit (e.g.,
learnpython
,LanguageTechnology
) where the post was published. - url: The direct web address to the Reddit post.
- num_comments: The count of comments posted in response to the submission.
- created_on: The date and time when the post was originally submitted.
- body: The main text content of the post.
- upvote_ratio: The ratio of positive votes (upvotes) to the total votes cast for the post.
- over_18: A boolean indicator specifying if the post is marked as adult content.
- link_flair_text: The text of any flair applied to the post's link.
- edited: A boolean indicator showing if the post has been edited after its initial submission.
Distribution
The dataset is typically supplied in CSV format. While the exact total record count is not specified, the data includes numerous entries, with counts for
num_upvotes
exceeding 30,000 for posts with lower upvote ranges and similar figures for num_comments
. The posts included span a significant period, from 25th February 2009 to 4th January 2022.Usage
- Analyse trends: Identify emerging themes and popular discussions in Machine Learning and Data Science.
- Study community engagement: Understand how users interact with content and each other within specialised subreddits.
- Develop NLP applications: Utilise post titles and bodies for text classification, topic modelling, and sentiment analysis.
- Inform content strategies: Pinpoint effective content types and discussion starters for technical communities.
- Research online behaviour: Examine patterns of upvotes, comments, and post creation over time.
Coverage
- Geographic Scope: Global.
- Time Range: Posts created between 25th February 2009 and 4th January 2022.
- Demographic Scope: Focuses on content and discussions from Reddit users interested in Machine Learning and Data Science.
License
CC-BY-SA
Who Can Use It
- Data Scientists: For exploring social media data, training models, and gaining industry insights.
- Machine Learning Engineers: To understand real-world text data for model development and evaluation.
- Academics and Researchers: For studies on online communities, information propagation, and technical communication.
- Social Media Analysts: To monitor and understand discussions in specific technical niches.
- Students: As a practical dataset for learning data analysis, NLP, and social computing.
Dataset Name Suggestions
- Reddit Machine Learning & Data Science Posts
- ML & Data Science Subreddit Activity
- Reddit ML/DS Community Discussions
- Machine Learning and Data Science Reddit Dataset
- Social Media Data for AI & Data Science
Attributes
Original Data Source: Reddit - Machine Learning and Data Science