Opendatabay APP

English Tweet Hate Speech Classifier Data

Data Science and Analytics

Tags and Keywords

Computer

Social

Text

Gender

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
English Tweet Hate Speech Classifier Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, named hate_speech_offensive, is a carefully assembled collection of annotated tweets designed for the purpose of detecting hate speech and offensive language. It consists primarily of English tweets and serves as a vital resource for training machine learning models and algorithms in this domain. Researchers and developers can utilise this dataset to build effective systems for identifying and classifying hateful or offensive content, contributing to safer online environments. The dataset is presented in a CSV file format, specifically 'train.csv', and includes detailed annotations for each tweet.

Columns

  • count: The total number of annotations provided for each individual tweet. (Integer)
  • hate_speech_count: The number of annotations that classified a particular tweet as hate speech. (Integer)
  • offensive_language_count: The number of annotations that categorised a tweet as containing offensive language. (Integer)
  • neither_count: The number of annotations that identified a tweet as neither hate speech nor offensive language. (Integer)
  • class: The classification label for the tweet.
  • tweet: The actual tweet content.

Distribution

The dataset is provided in a CSV file format, specifically 'train.csv'. It is structured with each row representing an individual tweet along with its corresponding annotations. The dataset currently comprises a single training split. There are approximately 24,783 unique tweets within the dataset.

Usage

This dataset is ideal for various applications and use cases, including:
  • Training machine learning models or algorithms for automated hate speech and offensive language detection.
  • Conducting Sentiment Analysis on Twitter data to understand the sentiment behind tweets and identify patterns of negative or offensive language.
  • Developing and evaluating Hate Speech Detection systems that can identify and flag hate speech in real-time.
  • Improving Content Moderation systems for social media platforms by automatically detecting and removing offensive or hateful content.
  • Performing Exploratory Data Analysis (EDA) to gain insights into the distribution of tweet classifications, identify common words associated with each class, and analyse co-occurrences of hate speech and offensive language.

Coverage

The dataset primarily consists of English tweets. Its scope is global in potential application, aiming to address social issues and advocacy related to online discourse. While no specific time range for data collection is provided, the dataset focuses on general English tweet content.

License

CCO

Who Can Use It

This dataset is intended for:
  • Researchers and developers seeking to create and improve machine learning models for detecting hate speech and offensive language on social media platforms like Twitter.
  • Data scientists and analysts interested in understanding patterns of online discourse and sentiment.
  • Social media platforms and their moderation teams aiming to enhance automated content moderation systems.

Dataset Name Suggestions

  • Twitter Hate Speech and Offensive Language Dataset
  • Annotated Tweet Toxicity Data
  • Social Media Content Moderation Tweets
  • English Tweet Hate Speech Classifier Data
  • Online Language Offensiveness Dataset

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

2

LISTED

05/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format