Opendatabay APP

Reddit Question Score Data

Reddit & Forum Data

Tags and Keywords

Reddit

Questions

Nlp

Askreddit

Text

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Reddit Question Score Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Millions of questions sourced from the popular online community, /r/AskReddit, are contained within this corpus. This resource presents one million individual posts, collected using the SocialGrep tool. The aim of curating this data is to assist in solving long-standing difficulties within the Natural Language Processing (NLP) domain, specifically regarding automated question answering. Each entry includes important metadata such as the precise date of creation and the score assigned to the post by the Reddit community.

Columns

This collection consists of twelve distinct fields:
  • type: Denotes the content type; every entry in this specific dataset is labelled as a 'post'.
  • id: The unique base36 identifier assigned to the post.
  • subreddit.id: The base36 identifier for the originating subreddit.
  • subreddit.name: The human-readable name of the subreddit, which is consistently 'askreddit'.
  • subreddit.nsfw: A boolean value indicating if the subreddit is Not Safe For Work (NSFW), which is recorded as 'false' for all posts (100%).
  • created_utc: The precise Coordinated Universal Time (UTC) stamp when the post was created.
  • permalink: The permanent link that directs users to the original post's page.
  • domain: The post's originating domain, which is invariably 'self.askreddit'.
  • url: The URL link associated with the post, noting that approximately 48% of these values are absent.
  • selftext: Any accompanying self-text from the post; roughly 52% of the values are missing. Note that 34% of the existing unique values are recorded as [removed].
  • title: The title of the post, which constitutes the question itself.
  • score: The numerical score assigned to the post by Reddit users, with an average (mean) score of 19.8.

Distribution

The dataset is packaged as a single CSV file, titled one-million-reddit-questions.csv. The file size is 294.02 MB. The dataset contains exactly one million total records across the primary fields.

Usage

This data is highly suitable for research in Natural Language Processing (NLP), especially for developing and testing automated question answering systems. It can also inspire analytical projects focusing on understanding what characteristics define a popular or effective question on the Reddit platform. Furthermore, it is suitable for regression modelling.

Coverage

This data is exclusively confined to posts originating from the /r/AskReddit community. The collected posts include data going back from September 2021. There are no scheduled updates for this specific version of the dataset, as the expected update frequency is listed as "Never".

License

Attribution 4.0 International (CC BY 4.0)

Who Can Use It

  • NLP Researchers: To train machine learning models for processing conversational data and generating effective responses.
  • Data Scientists: To perform text analysis, sentiment scoring, and regression analysis correlating title features with post scores.
  • Students/Beginners: The dataset is suitable for beginners looking to engage in large text processing or business analysis projects.

Dataset Name Suggestions

  • AskReddit 1M Question Corpus
  • One Million Social Media Questions
  • Reddit Question Score Data
  • NLP Question Source

Attributes

Original Data Source: Reddit Question Score Data

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

31/10/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format