Reddit Comedy Text Generation Repository
Reddit & Forum Data
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Humorous content sourced from the popular r/jokes subreddit provides a vast repository for linguistic and sentiment research. By aggregating one million unique entries, the material allows for the exploration of comedic structures, audience engagement patterns through Reddit scores, and the evolution of digital humour. Each entry is annotated with its popularity metric, making it a significant resource for training machine learning models to understand context, irony, and the nuances of short-form social text.
Columns
- type: Categorises the entry as either a 'post' or a 'comment'.
- id: A unique base-36 identifier assigned to each specific data point on the platform.
- subreddit.id: The unique base-36 identifier for the specific subreddit community.
- subreddit.name: The human-readable name of the subreddit, which is consistently "jokes".
- subreddit.nsfw: A boolean flag indicating whether the content is marked as Not Safe For Work.
- created_utc: A UTC timestamp recording the exact moment the post was created.
- permalink: A direct URL serving as a reference link to the original post on Reddit.
- score: A numerical value reflecting the popularity or unpopularity of the post based on user votes.
- domain: The internet domain associated with the data point, often "self.jokes".
- url: The specific web address linked within the post, where applicable.
- selftext: The primary body text of the joke, containing the setup or punchline.
- title: The headline or opening line of the Reddit post.
Distribution
The information is delivered in a single CSV file titled
one-million-reddit-jokes.csv, with a total file size of 299.96 MB. The collection contains exactly 1,000,000 records. Data integrity is exceptionally high, with a 100% validity rate across core fields such as ID, type, and subreddit name. The material is provided as a static release with an expected update frequency of Never.Usage
This resource is ideal for natural language processing tasks, including text generation, sentiment analysis, and joke classification. It can be used to study the correlation between linguistic patterns and post popularity, or to develop algorithms capable of identifying humorous intent. Additionally, the data supports text segmentation and classification projects focused on distinguishing between safe and "NSFW" content in social media environments.
Coverage
The scope is digital and platform-specific, focusing on the /r/jokes community on Reddit. The content is primarily in English and spans a multi-year period, with timestamps ranging approximately from 2015 to 2020. While the geographic location of users is not restricted, the demographic reach reflects the global, English-speaking user base of the Reddit platform during this timeframe.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
Academic researchers in the fields of linguistics and social science can utilise this corpus for large-scale studies on digital humour. Data scientists and AI engineers can employ the records to train and refine generative text models. Furthermore, social media analysts can use the scores and timestamps to investigate trends in viral content and the mechanics of online engagement.
Dataset Name Suggestions
- One Million Reddit Jokes Corpus
- r/jokes Linguistic and Sentiment Analysis Dataset
- Social Media Humour and Popularity Index
- Reddit Comedy Text Generation Repository
Attributes
Original Data Source: Reddit Comedy Text Generation Repository
Loading...
