Global Covid-19 Tweets with Sentiment Analysis
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset captures Twitter activity related to Covid-19, focusing on the initial phase of the pandemic from April to June 2020 [1, 2]. It comprises 235,240 worldwide tweets in English, streamed live at a rate of approximately 10,000 tweets per day after the World Health Organisation declared Covid-19 a pandemic [1, 2]. The tweets were collected using relevant hashtags such as #covid-19, #coronavirus, #covid, #covaccine, #lockdown, #homequarantine, #quarantinecenter, #socialdistancing, #stayhome, and #staysafe [1, 2].
The data has undergone pre-processing, which involved converting all tweets to lowercase, removing extra white spaces, numbers, special characters, ASCII characters, URLs, punctuations, and stopwords [2]. Additionally, all instances of 'covid' were converted to 'covid19', and stemming was applied to reduce inflected words to their root forms [2]. Sentiment analysis has been performed on each cleaned tweet using an NLTK-based Sentiment Analyser, providing sentiment scores for positive, negative, and neutral categories, and a compound sentiment score [2]. Tweets are classified as Positive, Negative, or Neutral based on these scores [2].
Columns
- id: Unique identifier for the tweet [1].
- Tweet ID: Unique identifier for the tweet [2]. (Note: Appears to be the same as 'id')
- created_at: The date and time when the tweet was created [1].
- Creation Date & Time: The date and time when the tweet was created [2]. (Note: Appears to be the same as 'created_at')
- source: The source link from which the tweet was posted [1].
- Source Link: The source link from which the tweet was posted [2]. (Note: Appears to be the same as 'source')
- original_text: The full text of the original tweet [1].
- Original Tweet: The full text of the original tweet [2]. (Note: Appears to be the same as 'original_text')
- lang: The language of the tweet [1].
- favorite_count: The number of times the tweet was favourited [1].
- Favorite Count: The number of times the tweet was favourited [2]. (Note: Appears to be the same as 'favorite_count')
- retweet_count: The number of times the tweet was retweeted [1].
- Retweet Count: The number of times the tweet was retweeted [2]. (Note: Appears to be the same as 'retweet_count')
- original_author: The original author of the tweet [3].
- Original Author: The original author of the tweet [2]. (Note: Appears to be the same as 'original_author')
- hashtags: Hashtags included in the tweet [3].
- Hashtags: Hashtags included in the tweet [2]. (Note: Appears to be the same as 'hashtags')
- user_mentions: User mentions within the tweet [3].
- User Mentions: User mentions within the tweet [2]. (Note: Appears to be the same as 'user_mentions')
- Place: Location associated with the tweet [2].
Distribution
The dataset consists of 235,240 tweets from the first phase of collection [1, 2]. Data files are typically provided in CSV format [4]. The tweets were collected from 19th April to 20th June 2020 [1].
Usage
This dataset is ideal for various data science and analytics applications, including Natural Language Processing (NLP), Deep Learning, Text Classification, and Ensembling [2]. Its pre-processed nature and included sentiment scores make it particularly useful for sentiment analysis research related to public opinion during the Covid-19 pandemic [2].
Coverage
The dataset covers a time range from 19th April to 20th June 2020 [1]. It includes worldwide tweets [2] and is limited to English language content [2]. Tweet sources are primarily Twitter for Android (31%) and Twitter for iPhone (28%), with 41% originating from other sources [5].
License
CC-BY-SA
Who Can Use It
- Data Scientists and Analysts: For conducting social media analysis, trend identification, and public sentiment tracking during the pandemic [2].
- Researchers in NLP and Machine Learning: To train and evaluate text classification models, conduct deep learning experiments, and explore ensembling techniques [2].
- Public Health Researchers: To understand public response, concerns, and sentiment towards Covid-19, lockdowns, and vaccines [2].
- Academics and Students: For academic projects, dissertations, and learning about real-world social media data analysis and sentiment classification [2].
Dataset Name Suggestions
- COVID-19 Twitter Sentiment (Apr-Jun 2020)
- Pandemic Twitter Activity Dataset (Phase 1)
- Global Covid-19 Tweets with Sentiment Analysis
- Social Media Response to Covid-19: April-June 2020
- Twitter Covid-19 Discourse (Early Pandemic)
Attributes
Original Data Source: Covid-19 Twitter Dataset