Dark Mode

Home

Data Categories

AI & ML Data

Modern Hate Speech Dataset for NLP

FREE DATASET LIBRARY

Verified Data Provider

£0

Modern Hate Speech Dataset for NLP

Data Science and Analytics

Tags and Keywords

Hate

Speech

Social

Nlp

Text

Trusted By

Modern Hate Speech Dataset for NLP Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is curated for hate speech detection on social media text, addressing the prominent spread of hateful textual content on online platforms [1]. It is designed to capture current trends in hate speech, incorporating elements such as emoticons, emojis, hashtags, slang, and contractions [1, 2]. The dataset is confined to two classes: hateful content and non-hateful content, making it invaluable for training machine learning models to identify and categorise hate speech effectively [1, 2]. Its significance lies in its ability to aid social media managers, administrators, and companies in developing automatic systems to filter out hateful content, thereby promoting safer online environments [2]. Furthermore, it serves as a neutralised benchmark dataset, ensuring that it doesn't include entities or names that could cause cyber harm or impact users, making it suitable for a wide range of research projects [3].

Columns

Content: Represents the input text from social media. It contains approximately 418,000 unique text entries [4].
Label: Denotes the input label for each text entry, indicating whether it is hateful or non-hateful. Labels are binary: '0' for non-hateful and '1' for hateful content. There are roughly 361,594 non-hateful entries and 79,305 hateful entries [3, 4].
Content_int: An integer representation related to the content, with approximately 418,000 unique values [5].
id: An identifier column, also with approximately 418,000 unique values [5].

Distribution

The dataset is provided as a CSV file named HateSpeechDataset.csv, with a size of 201.55 MB [4]. It consists of three primary columns (Content, Label, Content_int/id) [4]. The data itself is text-based and is in an annotated, analysed, and filtered format [2]. It comprises approximately 441,000 records in total, all of which are valid and free from mismatched or missing values in the primary content and label columns [4]. The structure bundles emojis, emoticons, and contractions within the textual content, categorised into either hateful or non-hateful classes [2].

Usage

This dataset is ideally suited for training machine learning models aimed at identifying hate speech on social media [2]. It can be utilised by Deep Learning (DL) and Natural Language Processing (NLP) practitioners for various detection techniques [3]. Social media managers, administrators, or companies can leverage it to develop automated systems for filtering out hateful content, improving content moderation efforts [2]. Additionally, it serves as a benchmark dataset for new research and advancements in hate speech detection [3]. Researchers can benefit from its pre-processed nature and adherence to policy guidelines for their projects [3].

Coverage

The dataset focuses on hate speech in English text collected from social media platforms, reflecting current trends in online communication [1, 2]. While a specific geographic scope is not detailed, its origin from social media implies a broad, potentially global, context. There is no specific time range mentioned beyond "current trends," and the dataset is not expected to be updated frequently [4]. Its design is neutralised, meaning it avoids including specific entities or names, ensuring it can be used widely without privacy or harm concerns related to content generators [3].

License

Attribution 4.0 International (CC BY 4.0) License

Who Can Use It

Social media companies and administrators: To create automatic systems for filtering hateful content [2].
Deep Learning and Natural Language Processing practitioners: For detecting hateful speech using advanced AI techniques [3].
Researchers: To leverage the pre-processed dataset for their projects and as a benchmark for hate speech detection [3].
Anyone: Due to its neutralised nature, it is broadly accessible and safe for various applications [3].

Dataset Name Suggestions

Hate Speech Detection: Curated Social Media Dataset
Social Media Hate Speech Text Classification Dataset
Modern Hate Speech Dataset for NLP
Annotated Social Media Hate Speech

Attributes

Original Data Source: Modern Hate Speech Dataset for NLP

Listing Stats

VIEWS

DOWNLOADS

LISTED

26/07/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format

Recommended Datasets

Loading recommendations...