English Tweet Hate Speech Classifier Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, named hate_speech_offensive, is a carefully assembled collection of annotated tweets designed for the purpose of detecting hate speech and offensive language. It consists primarily of English tweets and serves as a vital resource for training machine learning models and algorithms in this domain. Researchers and developers can utilise this dataset to build effective systems for identifying and classifying hateful or offensive content, contributing to safer online environments. The dataset is presented in a CSV file format, specifically 'train.csv', and includes detailed annotations for each tweet.
Columns
- count: The total number of annotations provided for each individual tweet. (Integer)
- hate_speech_count: The number of annotations that classified a particular tweet as hate speech. (Integer)
- offensive_language_count: The number of annotations that categorised a tweet as containing offensive language. (Integer)
- neither_count: The number of annotations that identified a tweet as neither hate speech nor offensive language. (Integer)
- class: The classification label for the tweet.
- tweet: The actual tweet content.
Distribution
The dataset is provided in a CSV file format, specifically 'train.csv'. It is structured with each row representing an individual tweet along with its corresponding annotations. The dataset currently comprises a single training split. There are approximately 24,783 unique tweets within the dataset.
Usage
This dataset is ideal for various applications and use cases, including:
- Training machine learning models or algorithms for automated hate speech and offensive language detection.
- Conducting Sentiment Analysis on Twitter data to understand the sentiment behind tweets and identify patterns of negative or offensive language.
- Developing and evaluating Hate Speech Detection systems that can identify and flag hate speech in real-time.
- Improving Content Moderation systems for social media platforms by automatically detecting and removing offensive or hateful content.
- Performing Exploratory Data Analysis (EDA) to gain insights into the distribution of tweet classifications, identify common words associated with each class, and analyse co-occurrences of hate speech and offensive language.
Coverage
The dataset primarily consists of English tweets. Its scope is global in potential application, aiming to address social issues and advocacy related to online discourse. While no specific time range for data collection is provided, the dataset focuses on general English tweet content.
License
CCO
Who Can Use It
This dataset is intended for:
- Researchers and developers seeking to create and improve machine learning models for detecting hate speech and offensive language on social media platforms like Twitter.
- Data scientists and analysts interested in understanding patterns of online discourse and sentiment.
- Social media platforms and their moderation teams aiming to enhance automated content moderation systems.
Dataset Name Suggestions
- Twitter Hate Speech and Offensive Language Dataset
- Annotated Tweet Toxicity Data
- Social Media Content Moderation Tweets
- English Tweet Hate Speech Classifier Data
- Online Language Offensiveness Dataset
Attributes
Original Data Source: Hate Speech and Offensive Language Detection