Dark Mode

Home

Data Categories

Web & Social Media Data

Reddit Question Score Data

FREE DATASET LIBRARY

Verified Data Provider

£0

Reddit Question Score Data

Reddit & Forum Data

Tags and Keywords

Questions

Nlp

Askreddit

Text

Trusted By

Reddit Question Score Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Millions of questions sourced from the popular online community, /r/AskReddit, are contained within this corpus. This resource presents one million individual posts, collected using the SocialGrep tool. The aim of curating this data is to assist in solving long-standing difficulties within the Natural Language Processing (NLP) domain, specifically regarding automated question answering. Each entry includes important metadata such as the precise date of creation and the score assigned to the post by the Reddit community.

Columns

This collection consists of twelve distinct fields:

type: Denotes the content type; every entry in this specific dataset is labelled as a 'post'.
id: The unique base36 identifier assigned to the post.
subreddit.id: The base36 identifier for the originating subreddit.
subreddit.name: The human-readable name of the subreddit, which is consistently 'askreddit'.
subreddit.nsfw: A boolean value indicating if the subreddit is Not Safe For Work (NSFW), which is recorded as 'false' for all posts (100%).
created_utc: The precise Coordinated Universal Time (UTC) stamp when the post was created.
permalink: The permanent link that directs users to the original post's page.
domain: The post's originating domain, which is invariably 'self.askreddit'.
url: The URL link associated with the post, noting that approximately 48% of these values are absent.
selftext: Any accompanying self-text from the post; roughly 52% of the values are missing. Note that 34% of the existing unique values are recorded as [removed].
title: The title of the post, which constitutes the question itself.
score: The numerical score assigned to the post by Reddit users, with an average (mean) score of 19.8.

Distribution

The dataset is packaged as a single CSV file, titled one-million-reddit-questions.csv. The file size is 294.02 MB. The dataset contains exactly one million total records across the primary fields.

Usage

This data is highly suitable for research in Natural Language Processing (NLP), especially for developing and testing automated question answering systems. It can also inspire analytical projects focusing on understanding what characteristics define a popular or effective question on the Reddit platform. Furthermore, it is suitable for regression modelling.

Coverage

This data is exclusively confined to posts originating from the /r/AskReddit community. The collected posts include data going back from September 2021. There are no scheduled updates for this specific version of the dataset, as the expected update frequency is listed as "Never".

License

Attribution 4.0 International (CC BY 4.0)

Who Can Use It

NLP Researchers: To train machine learning models for processing conversational data and generating effective responses.
Data Scientists: To perform text analysis, sentiment scoring, and regression analysis correlating title features with post scores.
Students/Beginners: The dataset is suitable for beginners looking to engage in large text processing or business analysis projects.

Dataset Name Suggestions

AskReddit 1M Question Corpus
One Million Social Media Questions
Reddit Question Score Data
NLP Question Source

Attributes

Original Data Source: Reddit Question Score Data

Listing Stats

VIEWS

DOWNLOADS

LISTED

31/10/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

FREE DATASET LIBRARY

£0

Reddit Question Score Data

Reddit & Forum Data

Tags and Keywords

Reddit

Questions

Nlp

Askreddit

Text

Trusted By

Free

About

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Listing Stats

Free

Download Dataset in CSV Format

RECOMMENDED DATASETS