Opendatabay APP

Global Covid-19 Tweets with Sentiment Analysis

Data Science and Analytics

Tags and Keywords

Nlp

Deep

Coronavirus

Text

Ensembling

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Global Covid-19 Tweets with Sentiment Analysis Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset captures Twitter activity related to Covid-19, focusing on the initial phase of the pandemic from April to June 2020 [1, 2]. It comprises 235,240 worldwide tweets in English, streamed live at a rate of approximately 10,000 tweets per day after the World Health Organisation declared Covid-19 a pandemic [1, 2]. The tweets were collected using relevant hashtags such as #covid-19, #coronavirus, #covid, #covaccine, #lockdown, #homequarantine, #quarantinecenter, #socialdistancing, #stayhome, and #staysafe [1, 2].
The data has undergone pre-processing, which involved converting all tweets to lowercase, removing extra white spaces, numbers, special characters, ASCII characters, URLs, punctuations, and stopwords [2]. Additionally, all instances of 'covid' were converted to 'covid19', and stemming was applied to reduce inflected words to their root forms [2]. Sentiment analysis has been performed on each cleaned tweet using an NLTK-based Sentiment Analyser, providing sentiment scores for positive, negative, and neutral categories, and a compound sentiment score [2]. Tweets are classified as Positive, Negative, or Neutral based on these scores [2].

Columns

  • id: Unique identifier for the tweet [1].
  • Tweet ID: Unique identifier for the tweet [2]. (Note: Appears to be the same as 'id')
  • created_at: The date and time when the tweet was created [1].
  • Creation Date & Time: The date and time when the tweet was created [2]. (Note: Appears to be the same as 'created_at')
  • source: The source link from which the tweet was posted [1].
  • Source Link: The source link from which the tweet was posted [2]. (Note: Appears to be the same as 'source')
  • original_text: The full text of the original tweet [1].
  • Original Tweet: The full text of the original tweet [2]. (Note: Appears to be the same as 'original_text')
  • lang: The language of the tweet [1].
  • favorite_count: The number of times the tweet was favourited [1].
  • Favorite Count: The number of times the tweet was favourited [2]. (Note: Appears to be the same as 'favorite_count')
  • retweet_count: The number of times the tweet was retweeted [1].
  • Retweet Count: The number of times the tweet was retweeted [2]. (Note: Appears to be the same as 'retweet_count')
  • original_author: The original author of the tweet [3].
  • Original Author: The original author of the tweet [2]. (Note: Appears to be the same as 'original_author')
  • hashtags: Hashtags included in the tweet [3].
  • Hashtags: Hashtags included in the tweet [2]. (Note: Appears to be the same as 'hashtags')
  • user_mentions: User mentions within the tweet [3].
  • User Mentions: User mentions within the tweet [2]. (Note: Appears to be the same as 'user_mentions')
  • Place: Location associated with the tweet [2].

Distribution

The dataset consists of 235,240 tweets from the first phase of collection [1, 2]. Data files are typically provided in CSV format [4]. The tweets were collected from 19th April to 20th June 2020 [1].

Usage

This dataset is ideal for various data science and analytics applications, including Natural Language Processing (NLP), Deep Learning, Text Classification, and Ensembling [2]. Its pre-processed nature and included sentiment scores make it particularly useful for sentiment analysis research related to public opinion during the Covid-19 pandemic [2].

Coverage

The dataset covers a time range from 19th April to 20th June 2020 [1]. It includes worldwide tweets [2] and is limited to English language content [2]. Tweet sources are primarily Twitter for Android (31%) and Twitter for iPhone (28%), with 41% originating from other sources [5].

License

CC-BY-SA

Who Can Use It

  • Data Scientists and Analysts: For conducting social media analysis, trend identification, and public sentiment tracking during the pandemic [2].
  • Researchers in NLP and Machine Learning: To train and evaluate text classification models, conduct deep learning experiments, and explore ensembling techniques [2].
  • Public Health Researchers: To understand public response, concerns, and sentiment towards Covid-19, lockdowns, and vaccines [2].
  • Academics and Students: For academic projects, dissertations, and learning about real-world social media data analysis and sentiment classification [2].

Dataset Name Suggestions

  • COVID-19 Twitter Sentiment (Apr-Jun 2020)
  • Pandemic Twitter Activity Dataset (Phase 1)
  • Global Covid-19 Tweets with Sentiment Analysis
  • Social Media Response to Covid-19: April-June 2020
  • Twitter Covid-19 Discourse (Early Pandemic)

Attributes

Original Data Source: Covid-19 Twitter Dataset

Listing Stats

VIEWS

4

DOWNLOADS

0

LISTED

05/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free