Twitter-Based Influenza Activity Data
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset aims to forecast the spatiotemporal patterns of influenza outbreaks across different locations and dates. It achieves this by identifying influenza-related tweets, primarily originating from the United States. For each week and state, the dataset provides input data consisting of keyword counts from tweets, with the goal of predicting whether an influenza outbreak will occur in that specific state during the subsequent week. An influenza outbreak is indicated when the activity level, as defined by CDC Flu Activity Map, reaches a high level.
Columns
- ID: An identifier for entries.
- flu_X_tr: Input data for training, representing keyword counts for tweets from various locations and weeks.
- flu_Y_tr: Output data for training, indicating the occurrence of an influenza outbreak (0 for no event, 1 for an event) for specific states in the next week.
- flu_X_te: Input data for testing, similar to flu_X_tr.
- flu_Y_te: Output data for testing, similar to flu_Y_tr.
- flu_locs: A list detailing the states covered by the data.
- flu_keywords: A list of 525 specified keywords used for analysis.
- Label Count: Provides ranges and counts of values, for instance, 0.00 - 52.40 with 53 entries, up to 471.60 - 524.00 with 53 entries.
Distribution
The dataset is typically provided in a CSV file format. It includes 525 distinct keywords. The input data ('flu_X_') consists of keyword counts for all tweets within a state over a week. The output data ('flu_Y_') signifies the occurrence of an influenza outbreak in that specific state for the subsequent week, represented as either zero (no event) or one (event). The dataset contains 524 unique values across its various segments.
Usage
This dataset is ideal for developing predictive models to forecast influenza outbreak events. It can be utilised for research into spatiotemporal disease patterns, enabling the creation of early warning systems for public health initiatives. Additionally, it supports applications focused on identifying and analysing influenza-related social media discussions.
Coverage
The geographic scope of the dataset is limited to the United States, covering various states. The data spans different weeks, with the prediction task focusing on the occurrence of an influenza outbreak for the next date or week. Influenza activity levels are categorised from minimal to high, with an outbreak specifically indicated when the activity level is high according to the CDC Flu Activity Map.
License
CC-BY
Who Can Use It
This dataset is suitable for data scientists and machine learning engineers interested in building predictive models for disease surveillance. Public health researchers and epidemiologists can use it for studying influenza spread patterns and developing intervention strategies. It is also relevant for social media analysts and natural language processing (NLP) practitioners focused on health-related text data.
Dataset Name Suggestions
- Influenza Outbreak Event Prediction via Twitter
- US Flu Outbreak Forecasting Dataset
- Twitter-Based Influenza Activity Data
- Spatiotemporal Flu Prediction Dataset
Attributes
Original Data Source:Influenza outbreak event prediction via Twitter