Dark Mode

Home

Data Categories

Web & Social Media Data

Reddit Comedy Text Generation Repository

FREE DATASET LIBRARY

Verified Data Provider

£0

Reddit Comedy Text Generation Repository

Reddit & Forum Data

Tags and Keywords

Jokes

Nlp

Humour

Sentiment

Trusted By

Reddit Comedy Text Generation Repository Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Humorous content sourced from the popular r/jokes subreddit provides a vast repository for linguistic and sentiment research. By aggregating one million unique entries, the material allows for the exploration of comedic structures, audience engagement patterns through Reddit scores, and the evolution of digital humour. Each entry is annotated with its popularity metric, making it a significant resource for training machine learning models to understand context, irony, and the nuances of short-form social text.

Columns

type: Categorises the entry as either a 'post' or a 'comment'.
id: A unique base-36 identifier assigned to each specific data point on the platform.
subreddit.id: The unique base-36 identifier for the specific subreddit community.
subreddit.name: The human-readable name of the subreddit, which is consistently "jokes".
subreddit.nsfw: A boolean flag indicating whether the content is marked as Not Safe For Work.
created_utc: A UTC timestamp recording the exact moment the post was created.
permalink: A direct URL serving as a reference link to the original post on Reddit.
score: A numerical value reflecting the popularity or unpopularity of the post based on user votes.
domain: The internet domain associated with the data point, often "self.jokes".
url: The specific web address linked within the post, where applicable.
selftext: The primary body text of the joke, containing the setup or punchline.
title: The headline or opening line of the Reddit post.

Distribution

The information is delivered in a single CSV file titled one-million-reddit-jokes.csv, with a total file size of 299.96 MB. The collection contains exactly 1,000,000 records. Data integrity is exceptionally high, with a 100% validity rate across core fields such as ID, type, and subreddit name. The material is provided as a static release with an expected update frequency of Never.

Usage

This resource is ideal for natural language processing tasks, including text generation, sentiment analysis, and joke classification. It can be used to study the correlation between linguistic patterns and post popularity, or to develop algorithms capable of identifying humorous intent. Additionally, the data supports text segmentation and classification projects focused on distinguishing between safe and "NSFW" content in social media environments.

Coverage

The scope is digital and platform-specific, focusing on the /r/jokes community on Reddit. The content is primarily in English and spans a multi-year period, with timestamps ranging approximately from 2015 to 2020. While the geographic location of users is not restricted, the demographic reach reflects the global, English-speaking user base of the Reddit platform during this timeframe.

License

Attribution 4.0 International (CC BY 4.0)

Who Can Use It

Academic researchers in the fields of linguistics and social science can utilise this corpus for large-scale studies on digital humour. Data scientists and AI engineers can employ the records to train and refine generative text models. Furthermore, social media analysts can use the scores and timestamps to investigate trends in viral content and the mechanics of online engagement.

Dataset Name Suggestions

One Million Reddit Jokes Corpus
r/jokes Linguistic and Sentiment Analysis Dataset
Social Media Humour and Popularity Index
Reddit Comedy Text Generation Repository

Attributes

Original Data Source: Reddit Comedy Text Generation Repository

Listing Stats

VIEWS

DOWNLOADS

LISTED

19/12/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

FREE DATASET LIBRARY

£0

Reddit Comedy Text Generation Repository

Reddit & Forum Data

Tags and Keywords

Jokes

Reddit

Nlp

Humour

Sentiment

Trusted By

Free

About

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Listing Stats

Free

Download Dataset in CSV Format

RECOMMENDED DATASETS