Opendatabay APP

Hate Speech and Offensive Language Detection

Data Science and Analytics

Tags and Keywords

Computer Science

Social Issues and Advocacy

Text Mining

Gender

NLP

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Hate Speech and Offensive Language Detection Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, named hate_speech_offensive, is a meticulously curated collection of annotated tweets with the specific purpose of detecting hate speech and offensive language. The dataset primarily consists of English tweets and is designed to train machine learning models or algorithms in the task of hate speech detection. It should be noted that the dataset has not been divided into multiple subsets, and only the train split is currently available for use.
The dataset includes several columns that provide valuable information for understanding each tweet's classification. The column count represents the total number of annotations provided for each tweet, whereas hate_speech_count signifies how many annotations classified a particular tweet as hate speech. On the other hand, offensive_language_count indicates the number of annotations categorizing a tweet as containing offensive language. Additionally, neither_count denotes how many annotations identified a tweet as neither hate speech nor offensive language.
For researchers and developers aiming to create effective models or algorithms capable of detecting hate speech and offensive language on Twitter, this comprehensive dataset offers a rich resource for training and evaluation purposes
How to use the dataset Introduction:
Dataset Overview:
The dataset is presented in a CSV file format named 'train.csv'. It consists of annotated tweets with information about their classification as hate speech, offensive language, or neither. Each row represents a tweet along with the corresponding annotations provided by multiple annotators. The main columns that will be essential for your analysis are: count (total number of annotations), hate_speech_count (number of annotations classifying a tweet as hate speech), offensive_language_count (number of annotations classifying a tweet as offensive language), neither_count (number of annotations classifying a tweet as neither hate speech nor offensive language). Data Collection Methodology: The data collection methodology used to create this dataset involved obtaining tweets from Twitter's public API using specific search terms related to hate speech and offensive language. These tweets were then manually labeled by multiple annotators who reviewed them for classification purposes.
Data Quality: Although efforts have been made to ensure the accuracy of the data, it is important to acknowledge that annotations are subjective opinions provided by individual annotators. As such, there may be variations in classifications between annotators.
Preprocessing Techniques: Prior to training machine learning models or algorithms on this dataset, it is recommended to apply standard preprocessing techniques such as removing URLs, usernames/handles, special characters/punctuation marks, stop words removal, tokenization, stemming/lemmatization etc., depending on your analysis requirements.
Exploratory Data Analysis (EDA): Conducting EDA on the dataset will help you gain insights and understand the underlying patterns in hate speech and offensive language. Some potential analysis ideas include:
Distribution of tweet counts per classification category (hate speech, offensive language, neither). Most common words/phrases associated with each class. Co-occurrence analysis to identify correlations between hate speech and offensive language. Building Machine Learning Models: To train models for automatic detection of hate speech and offensive language, you can follow these steps: a) Split the dataset into training and testing sets for model evaluation purposes. b) Choose appropriate features/
Research Ideas Sentiment Analysis: This dataset can be used to train models for sentiment analysis on Twitter data. By classifying tweets as hate speech, offensive language, or neither, the dataset can help in understanding the sentiment behind different tweets and identifying patterns of negative or offensive language. Hate Speech Detection: The dataset can be used to develop models that automatically detect hate speech on Twitter. By training machine learning algorithms on this annotated dataset, it becomes possible to create systems that can identify and flag hate speech in real-time, making social media platforms safer and more inclusive. Content Moderation: Social media platforms can use this dataset to improve their content moderation systems. By using machine learning algorithms trained on this data, it becomes easier to automatically detect and remove offensive or hateful content from the platform, reducing the burden on human moderators and improving user experience by keeping online spaces free from toxic behavior Acknowledgements If you use this dataset in your research, please credit the original authors. Data Source
License License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns File: train.csv
Column name Description count The total number of annotations for each tweet. (Integer) hate_speech_count The number of annotations classifying a tweet as hate speech. (Integer) offensive_language_count The number of annotations classifying a tweet as offensive language. (Integer) neither_count The number of annotations classifying a tweet as neither hate speech nor offensive language. (Integer) Acknowledgements If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit hate_speech_offensive (From Huggingface).

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

05/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free