Opendatabay APP

Social Media Comedy Data Set

Reddit & Forum Data

Tags and Keywords

Jokes

Reddit

Nlp

Humour

Score

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Social Media Comedy Data Set Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This collection comprises one million joke posts scraped from the /r/jokes subreddit, providing a significant volume of social media text focused on humour. The primary purpose is to enable analysis of textual humour and audience reception, as every post is annotated with its popularity score. The dataset serves as inspiration for researchers and developers attempting to uncover the mechanics of comedy, addressing the challenge of defining what makes content funny in a crowd-sourced environment.

Columns

  • type: Indicates the nature of the data, which is consistently 'post' across all records.
  • id: The unique Base36 identification tag assigned to the individual post.
  • subreddit.id: The Base36 identification tag for the subreddit from which the post originated.
  • subreddit.name: The human-readable name of the source subreddit, which is uniformly 'jokes'.
  • subreddit.nsfw: A boolean field indicating if the subreddit is Not Safe For Work, which is always recorded as 'false'.
  • created_utc: The UTC timestamp recording the exact time the post was created.
  • permalink: The direct link to the specific post on Reddit.
  • domain: The domain associated with the post's link, primarily recording 'self.jokes'.
  • url: The URL link of the post, if one exists (a high percentage of values are missing).
  • selftext: The body text of the post itself, sometimes appearing as '[removed]' or '[deleted]'.
  • title: The primary title text of the post.
  • score: The rating or popularity score awarded to the post by the community.

Distribution

The data is delivered in a CSV file format named one-million-reddit-jokes.csv. The file size is approximately 299.96 MB. It contains exactly one million unique records structured across 12 distinct columns. The data is static, with an expected update frequency of never.

Usage

This dataset is ideally suited for Natural Language Processing (NLP) tasks, particularly in areas like sentiment analysis, text generation, and linguistic pattern recognition related to humour. It can be utilised for training models to classify or create humorous content, and for performing cultural studies on social media text based on popularity scores.

Coverage

The dataset covers posts spanning a time range up to and including April 1, 2020, and backwards. The content is sourced globally from the English-language social network Reddit, specifically the /r/jokes community. Coverage reflects the content posted by users within this specific subreddit during the defined timeframe.

License

Attribution 4.0 International (CC BY 4.0)

Who Can Use It

  • NLP Engineers and Data Scientists: For developing models that understand, classify, or generate humour based on public reception scores.
  • Academic Researchers: To conduct studies on social media dynamics, linguistic evolution, and cultural trends in popular comedy.
  • Developers: For creating tools that analyse or recommend humorous social media content.

Dataset Name Suggestions

  • Reddit Humour Post Corpus
  • One Million Reddit Jokes Archive
  • Social Media Comedy Data Set
  • Joke Score Prediction Data

Attributes

Original Data Source: Social Media Comedy Data Set

Listing Stats

VIEWS

2

DOWNLOADS

0

LISTED

19/10/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format