Disaster Tweets Classification Dataset
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains over 11,000 tweets meticulously collected based on keywords associated with various disaster events, such as "crash", "quarantine", and "bush fires" [1, 2]. Each tweet includes its location and the specific keyword found within the text [1, 2]. A key feature of this dataset is that each tweet has been manually classified to indicate whether it genuinely refers to a real disaster event or is a non-disaster-related mention, such as a joke or a movie review [1, 2]. It serves as a valuable resource for developing and testing binary classification models aimed at discerning authentic disaster reports from incidental mentions in social media [1].
Columns
- id: A unique identifier assigned to each individual tweet [2].
- keyword: The particular keyword from the tweet that led to its inclusion in the dataset [2].
- location: The geographical location from which the tweet was sent, though this field may be blank for some entries [3].
- text: The full textual content of the tweet itself [3].
- target: A binary label indicating whether the tweet is about a real disaster (1) or not (0) [3].
Distribution
The dataset is typically provided in a CSV (Comma Separated Values) format [4]. It comprises approximately 11,369 records, each representing a single tweet [1, 2, 5]. The data is structured in a clear tabular format, with distinct columns as described above [6].
Usage
This dataset is ideally suited for a variety of applications and use cases, including:
- Natural Language Processing (NLP) tasks, particularly text classification and binary classification [1].
- Training and evaluating machine learning models to detect and categorise real disaster events from social media streams [1].
- Social media monitoring for crisis management and real-time event analysis.
- Developing algorithms to filter out irrelevant or non-disaster-related content from large volumes of tweets.
Coverage
The dataset's geographic scope is global [7]. While location data is included, approximately 30% of the tweets may have blank location fields [3, 5]. Among the identified locations, about 1% are from the United States, with the remaining 69% categorised as 'Other' based on the unique values present [5]. The specific time range during which these tweets were collected is not detailed in the available information.
License
CCO
Who Can Use It
This dataset is suitable for a broad range of users:
- Data scientists and machine learning engineers: For building, training, and refining models that classify textual data.
- Researchers: In fields such as natural language processing, social computing, and disaster informatics.
- Organisations involved in disaster response: To develop tools for real-time social media intelligence.
- Students: Undertaking projects related to text mining, classification, and big data analysis.
Dataset Name Suggestions
- Disaster Tweets Classification Dataset
- Social Media Disaster Event Classifier
- Real vs. Fake Disaster Tweets
- Crisis Tweet Text Data
Attributes
Original Data Source: COVID-19 All Vaccines Tweets