Opendatabay APP

Machine Learning and Data Science Reddit Dataset

Social Media and Networking

Tags and Keywords

Computer

Science

Education

Online

Communities

Beginner

Social

Networks

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Machine Learning and Data Science Reddit Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset contains a collection of Reddit post submissions from various Machine Learning and Data Science subreddits. It offers valuable insight into the discussions, popular topics, and engagement dynamics within these online communities. The dataset can be used to understand trends, analyse community behaviour, and support research in the fields of artificial intelligence, machine learning, and data science.

Columns

  • title: The heading or subject line of the Reddit post.
  • id: A unique identifier assigned to each post.
  • redditor: The username of the Reddit member who created the post.
  • num_upvotes: The total number of upvotes the post has received.
  • subreddit: The name of the specific subreddit (e.g., learnpython, LanguageTechnology) where the post was published.
  • url: The direct web address to the Reddit post.
  • num_comments: The count of comments posted in response to the submission.
  • created_on: The date and time when the post was originally submitted.
  • body: The main text content of the post.
  • upvote_ratio: The ratio of positive votes (upvotes) to the total votes cast for the post.
  • over_18: A boolean indicator specifying if the post is marked as adult content.
  • link_flair_text: The text of any flair applied to the post's link.
  • edited: A boolean indicator showing if the post has been edited after its initial submission.

Distribution

The dataset is typically supplied in CSV format. While the exact total record count is not specified, the data includes numerous entries, with counts for num_upvotes exceeding 30,000 for posts with lower upvote ranges and similar figures for num_comments. The posts included span a significant period, from 25th February 2009 to 4th January 2022.

Usage

  • Analyse trends: Identify emerging themes and popular discussions in Machine Learning and Data Science.
  • Study community engagement: Understand how users interact with content and each other within specialised subreddits.
  • Develop NLP applications: Utilise post titles and bodies for text classification, topic modelling, and sentiment analysis.
  • Inform content strategies: Pinpoint effective content types and discussion starters for technical communities.
  • Research online behaviour: Examine patterns of upvotes, comments, and post creation over time.

Coverage

  • Geographic Scope: Global.
  • Time Range: Posts created between 25th February 2009 and 4th January 2022.
  • Demographic Scope: Focuses on content and discussions from Reddit users interested in Machine Learning and Data Science.

License

CC-BY-SA

Who Can Use It

  • Data Scientists: For exploring social media data, training models, and gaining industry insights.
  • Machine Learning Engineers: To understand real-world text data for model development and evaluation.
  • Academics and Researchers: For studies on online communities, information propagation, and technical communication.
  • Social Media Analysts: To monitor and understand discussions in specific technical niches.
  • Students: As a practical dataset for learning data analysis, NLP, and social computing.

Dataset Name Suggestions

  • Reddit Machine Learning & Data Science Posts
  • ML & Data Science Subreddit Activity
  • Reddit ML/DS Community Discussions
  • Machine Learning and Data Science Reddit Dataset
  • Social Media Data for AI & Data Science

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format