WSB User Sentiment Analysis Data
Social Media and Posts
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Data provides a record of posts and comments collected from the Reddit community /r/WallStreetBets for the full month of August 2021. This community became globally recognized earlier in the year due to its coordinated actions, such as the GameStop squeeze, which demonstrated how social media can significantly influence market data. This resource was compiled specifically to assist data scientists in investigating this powerful intersection of public opinion and finance. The dataset includes content, timestamps, and sentiment analysis results for both posts and comments. Usernames were omitted during procurement to maintain trader anonymity and prevent potential harassment.
Columns
The dataset contains 10 attributes detailing user activity and content analysis, totalling 1,001,160 records:
- type: Categorizes the entry as either 'post' or 'comment'. The most common entry is 'comment'. This field is 100% valid.
- id: The unique Base36 ID for the entry. This field contains 1,001,160 unique values and is 100% valid.
- subreddit.id: The Subreddit's unique Base36 ID. This field is 100% valid and holds one unique value.
- subreddit.name: The Subreddit's readable name, which is consistently 'wallstreetbets'. This field is 100% valid.
- subreddit.nsfw: A Boolean indicator for whether the subreddit is Not Safe For Work. All 1,001,160 records are marked as false, and the column is 100% valid.
- created_utc: The timestamp of the comment's creation. This field is 100% valid.
- permalink: The direct link to the comment on Reddit. This field is 100% valid and contains 1,001,160 unique values.
- body: The text body of the comment. This field is 99.999% valid, with 1 missing value. The body content frequently includes the tags '[removed]' (14%) and '[deleted]' (4%).
- sentiment: The result of a sentiment analysis pipeline applied to the text body. This field is 70% valid, with 295k missing values. The mean sentiment score is 0.07, with a standard deviation of 0.43.
- score: The score of the comment. Values range widely, from a minimum of -391 up to a maximum of 21.1k. The mean score is 5.2, and the standard deviation is 70.6. This field is 100% valid.
Distribution
The material is distributed in a single CSV file named
wsb-aug-2021-comments.csv, which is 248.74 MB in size. The dataset includes 1,001,160 records. Data quality is generally strong across identifier fields (100% valid), but the calculated 'sentiment' field is only 70% valid. The expected update frequency for this resource is Never.Usage
This resource is ideally suited for data scientists seeking to understand the relationship between social media dialogue and financial markets. It can be utilized to predict stock trends based on Reddit activity, or to analyse how public perception of a stock shifts in response to news events. Furthermore, researchers can seek other potential insights from the board that influenced the Short Squeeze of 2021.
Coverage
The scope covers the full range of posts and comments made on the Reddit /r/WallStreetBets board during a single month, from August 1 to August 31 of 2021. The content includes raw text, timestamps, and analytical results such as sentiment scores. The data compilation focuses on activity related to market and financial discussion.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
The dataset is intended for data scientists, academics, and researchers interested in economics, market volatility, natural language processing (NLP), and the study of online communities' influence on popular culture and finance. It holds a maximum usability rating of 10.00.
Dataset Name Suggestions
- WallStreetBets Reddit Posts and Comments August 2021
- Social Media Market Influence Data
- WSB User Sentiment Analysis Data
Attributes
Original Data Source: WSB User Sentiment Analysis Data
Loading...
