Social Media Comedy Data Set
Reddit & Forum Data
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This collection comprises one million joke posts scraped from the /r/jokes subreddit, providing a significant volume of social media text focused on humour. The primary purpose is to enable analysis of textual humour and audience reception, as every post is annotated with its popularity score. The dataset serves as inspiration for researchers and developers attempting to uncover the mechanics of comedy, addressing the challenge of defining what makes content funny in a crowd-sourced environment.
Columns
- type: Indicates the nature of the data, which is consistently 'post' across all records.
- id: The unique Base36 identification tag assigned to the individual post.
- subreddit.id: The Base36 identification tag for the subreddit from which the post originated.
- subreddit.name: The human-readable name of the source subreddit, which is uniformly 'jokes'.
- subreddit.nsfw: A boolean field indicating if the subreddit is Not Safe For Work, which is always recorded as 'false'.
- created_utc: The UTC timestamp recording the exact time the post was created.
- permalink: The direct link to the specific post on Reddit.
- domain: The domain associated with the post's link, primarily recording 'self.jokes'.
- url: The URL link of the post, if one exists (a high percentage of values are missing).
- selftext: The body text of the post itself, sometimes appearing as '[removed]' or '[deleted]'.
- title: The primary title text of the post.
- score: The rating or popularity score awarded to the post by the community.
Distribution
The data is delivered in a CSV file format named
one-million-reddit-jokes.csv. The file size is approximately 299.96 MB. It contains exactly one million unique records structured across 12 distinct columns. The data is static, with an expected update frequency of never.Usage
This dataset is ideally suited for Natural Language Processing (NLP) tasks, particularly in areas like sentiment analysis, text generation, and linguistic pattern recognition related to humour. It can be utilised for training models to classify or create humorous content, and for performing cultural studies on social media text based on popularity scores.
Coverage
The dataset covers posts spanning a time range up to and including April 1, 2020, and backwards. The content is sourced globally from the English-language social network Reddit, specifically the /r/jokes community. Coverage reflects the content posted by users within this specific subreddit during the defined timeframe.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
- NLP Engineers and Data Scientists: For developing models that understand, classify, or generate humour based on public reception scores.
- Academic Researchers: To conduct studies on social media dynamics, linguistic evolution, and cultural trends in popular comedy.
- Developers: For creating tools that analyse or recommend humorous social media content.
Dataset Name Suggestions
- Reddit Humour Post Corpus
- One Million Reddit Jokes Archive
- Social Media Comedy Data Set
- Joke Score Prediction Data
Attributes
Original Data Source: Social Media Comedy Data Set
Loading...
