Olympic Sentiment Analysis Dataset
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a collection of tweets related to the Tokyo Olympics 2020. The data was gathered continually by accessing the Twitter API using the Tweepy Python package, specifically targeting the '#Tokyo2020' hashtag. The collection process involves a script running on a Google Cloud Jupyter instance that merges new tweets with previously collected data, saving the results in CSV format. The accumulated dataset is then uploaded to Kaggle regularly. This dataset offers significant utility for various analytical tasks, including studying the subjects of public discourse, performing Natural Language Processing tasks like topic modelling and sentiment analysis, identifying tweets associated with specific sports, countries, or athletes, and tracking news trends during the Olympic Games.
Columns
- id: A distinct identifier assigned to each tweet.
- user_name: The name displayed for the user who posted the tweet.
- user_location: The geographical location optionally provided by the user.
- user_description: A brief description provided by the user on their profile.
- user_created: The timestamp indicating when the user's Twitter account was established.
- user_followers: The numerical count of individuals following the user.
- user_friends: The numerical count of accounts the user is following.
- user_favourites: The total number of tweets the user has liked or marked as a favourite.
- user_verified: A boolean flag indicating whether the user's account is verified by Twitter.
- date: The date and time when the tweet was published.
Distribution
The dataset is presented in a CSV file format. It contains tweets predominantly from the period of 24th July 2021 to 27th July 2021. The dataset includes a substantial number of records, with approximately 160,545 individual tweets based on the distribution of user verification status. For instance, there are 27,354 verified users and 133,172 unverified users. The creation dates for user accounts within the dataset span a broad range, from 4th September 2006 up to 27th July 2021. A notable portion of user locations and descriptions are unspecified within the dataset.
Usage
This dataset is well-suited for:
- Analysing prevailing subjects and conversations within tweets concerning the Tokyo Olympics.
- Executing Natural Language Processing (NLP) techniques such as topic modelling, to discover underlying themes, and sentiment analysis, to gauge public mood.
- Pinpointing tweets related to specific sports, participating countries, or individual athletes.
- Monitoring and tracking developing trends in news and public opinion during the Olympic event.
- Conducting in-depth sentiment analysis across the entire tweet corpus, or segmenting it by specific criteria such as sports or nationalities.
- Examining the patterns of distribution and frequency of hashtags used in the tweets.
Coverage
- Geographic: The dataset includes information on user-provided locations, with observations showing a percentage of users from India, a larger percentage from other locations, and a significant portion where location is not specified. The overall data collection has a global reach.
- Time Range: The tweets contained within the dataset primarily originate from a concentrated period between 24th July 2021 and 27th July 2021. However, the associated user accounts reveal a much longer historical context, with creation dates stretching from 4th September 2006 through to 27th July 2021.
- Demographic: User profiles offer insights into demographics through elements like user descriptions, follower counts, and verification status, which can inform studies on user engagement and characteristics.
License
CC0
Who Can Use It
This dataset is particularly beneficial for:
- Data Scientists and Natural Language Processing (NLP) Researchers: For building and refining models that classify text, analyse sentiment, and extract key information.
- Social Media Analysts: To comprehend public sentiment, track engagement metrics, and identify influential voices linked to major international events.
- Market Researchers: To assess public perception and brand visibility during significant sporting spectacles.
- Journalists and Media Organisations: For data-driven reporting that reflects public reactions and emerging narratives.
- Sports Analysts: To delve into fan discussions and reactions concerning specific Olympic events or competitors.
Dataset Name Suggestions
- Tokyo Olympics 2020 Tweets
- Olympics 2020 Social Media Activity
- Tokyo Games Twitter Public Discourse
- Olympic Sentiment Analysis Dataset
Attributes
Original Data Source: Tokyo Olympics 2020 Tweets