1M NLP Social Comment Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This collection provides 1 million anonymised comments extracted from 40 highly frequented subreddits, specifically focusing on data from May 2019. It is primarily designed as an NLP dataset, engineered to contain balanced categorical data by sampling 25,000 comments from each of the 40 selected subreddits. The extracted information includes key features chosen for type variety, such as comment body, an aggregated controversiality metric, and the net score (upvotes minus downvotes) of the post. This resource allows users to build simple models and gain interesting insights into online discourse.
Columns
- subreddit (Categorical): Identifies the specific subreddit on which the comment was posted. The dataset contains 40 unique subreddit names.
- body (String): The complete content of the comment.
- controversiality (Binary): A metric aggregated by Reddit indicating how controversial the comment was (labels include 0 and 1).
- score (Scalar): The total number of upvotes minus downvotes received by the comment.
Distribution
The data is delivered in a CSV format, packaged as
kaggle_RC_2019-05.csv, with a file size of approximately 185.93 MB. It contains exactly 1 million records, distributed uniformly with 25,000 comments assigned to each of the 40 distinct subreddits. The dataset excludes comments that were removed, comments whose authors were deleted, and any comments containing fewer than 4 tokens.Usage
This data is ideal for various analytical and machine learning tasks. It is specifically suited for Natural Language Processing applications suchsing classification tasks, such as predicting comment controversiality or score based on content. Researchers can utilise the uniform structure to perform categorical analysis on the subreddit feature, comparing language and engagement metrics across different online communities.
Coverage
The data exclusively covers comments posted during a specific one-month period: May 2019. The scope includes 40 of the most frequented subreddits active during that period. The data is anonymised, focusing on the content and associated metrics rather than individual author identification.
License
CC0: Public Domain
Who Can Use It
- NLP Practitioners: To train and evaluate models for sentiment, topic modelling, and text classification using real-world social data.
- Data Scientists and Analysts: For conducting studies on online community dynamics, quantifying controversial discussions, and exploring engagement metrics.
- Students and Educators: The balanced structure makes it excellent for teaching principles of categorical data handling and binary classification using large text corpora.
Dataset Name Suggestions
- Reddit Comments May 2019 - Balanced Subreddit Extract
- 1M NLP Social Comment Dataset
- Online Community Discourse Analytics
Attributes
Original Data Source: 1M NLP Social Comment Dataset
Loading...
