A Year on r/India Sentiment Data
Reddit & Forum Data
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This collection provides raw textual data alongside critical metadata for advanced social analysis. It allows users to study the evolution of discussions, identify key topics, and measure emotional responses within one of India’s most significant online forums. It is designed to support detailed natural language processing tasks and quantitative sociology studies.
Columns
- type: Denotes the type of data point, typically 'comment'.
- id: The unique Base-36 identifier assigned to the comment.
- subreddit.id: The unique Base-36 identifier for the subreddit the comment belongs to.
- subreddit.name: The human-readable name of the subreddit, which is consistently
/r/India. - subreddit.nsfw: A boolean value indicating whether the comment originated from a Not Safe For Work subreddit.
- created_utc: The UTC timestamp detailing when the comment was created.
- permalink: The permanent URL link to the comment on Reddit.
- body: The main text content of the comment, which includes markers like
[removed]or[deleted]for comments taken down. - sentiment: The analysed sentiment score for the comment, ranging from -1 (negative) to 1 (positive).
- score: The numeric score assigned to the comment.
Distribution
The data is contained within a single CSV file,
one-year-of-r-india-comments.csv, which has a size of 435.47 MB. It features approximately 1.39 million total records across 10 distinct columns. While most columns are 100% valid, the sentiment analysis field contains approximately 26% missing values. The overall data quality is high, with the majority of fields showing 100% validity. The expected update frequency for this dataset is never.Usage
This data is highly suitable for training Natural Language Processing (NLP) models, specifically for tasks like text classification, topic modelling, and sentiment analysis focused on regional dialects and cultural nuances. It can be used by researchers to monitor social shifts, track trends in Indian popular culture, and analyse the performance and impact of online community moderation.
Coverage
The data provides specific geographic coverage focusing exclusively on online interactions related to India, as filtered through the dedicated
/r/India subreddit. The temporal coverage spans precisely one year, starting on 30 September 2020 and concluding on 30 September 2021. No updates are anticipated for this historical snapshot.License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
- Data Scientists and ML Engineers: For developing and evaluating models for sentiment analysis and text generation based on real-world community input.
- Cultural Researchers and Sociologists: To perform quantitative studies on contemporary Indian public opinion and popular discourse.
- Academics: For research into online community dynamics and machine learning applications related to text data.
Dataset Name Suggestions
- Indian Reddit Discourse 2020-2021
- A Year on r/India Sentiment Data
- Online Community Pulse: India
- SocialGrep Indian Reddit Collection
Attributes
Original Data Source:A Year on r/India Sentiment Data
Loading...
