Capitol Protest Social Media Dataset
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides over 80,000 tweets from 6th January 2021, the day of the Capitol Hill riots. Created using the Twitter Developer API and Tweepy, it offers valuable social media data for analysis. While not as extensive as Parler data dumps, it is well-suited for Natural Language Processing (NLP) tasks. The tweets have had mentions, hyperlinks, emojis, and punctuation removed, and all text is converted to lowercase for consistency. Some tweets include geographical coordinates if users had geotagging enabled. Information on verified users is included via their usernames, and user location is provided based on their self-reported profile details, with blanks for locations outside of US states or DC.
Columns
- id: A unique identifier for each tweet.
- text: The content of the tweet.
- query: The search query used to retrieve the tweet.
- usr_id: A unique identifier for the user who posted the tweet.
- username: The username associated with the tweet's author.
- followers: The number of followers the user has.
- tweet count: The total number of tweets posted by the user.
- number of likes: The total count of likes the tweet received.
- number of retweets: The total count of retweets the tweet received.
- location: The user's self-reported location from their profile.
Distribution
The dataset is provided as a CSV file and contains over 80,000 individual tweet records. Each record is structured according to the columns listed above, offering a clear tabular format for data manipulation and analysis.
Usage
This dataset is ideal for a range of analytical applications, particularly for those with NLP experience. Potential use cases include:
- Sentiment analysis of public opinion surrounding the Capitol Riot.
- Topic modelling to identify key themes and narratives in the social discourse.
- Trend analysis of how discussions evolved throughout the day.
- Social network analysis focusing on user behaviour, verified accounts, and location-based insights related to the event.
- Academic research into political events and social media's role.
Coverage
The dataset's time range is strictly limited to 6th January 2021. Geographic coverage is primarily based on user-reported locations within the US (including DC), with some tweets containing precise coordinates if geotagging was active. Demographic scope includes information on verified users. Notably, the raw text data has been pre-processed, meaning mentions, hyperlinks, emojis, and punctuation have been removed, and all text is in lowercase.
License
CC0
Who Can Use It
This dataset is suitable for:
- NLP practitioners and data scientists seeking real-world social media data for model training and linguistic analysis.
- Researchers and academics in fields such as political science, sociology, and media studies, for investigating public discourse during significant events.
- Journalists and analysts interested in understanding the social media landscape of the Capitol Riot.
Dataset Name Suggestions
- Capitol Riot Tweets 2021
- January 6th Twitter Data
- US Capitol Protest Social Media
- 2021 Capitol Hill Tweets Dataset
Attributes
Original Data Source: Capitol Riot Tweets