Modern Hate Speech Dataset for NLP
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is curated for hate speech detection on social media text, addressing the prominent spread of hateful textual content on online platforms [1]. It is designed to capture current trends in hate speech, incorporating elements such as emoticons, emojis, hashtags, slang, and contractions [1, 2]. The dataset is confined to two classes: hateful content and non-hateful content, making it invaluable for training machine learning models to identify and categorise hate speech effectively [1, 2]. Its significance lies in its ability to aid social media managers, administrators, and companies in developing automatic systems to filter out hateful content, thereby promoting safer online environments [2]. Furthermore, it serves as a neutralised benchmark dataset, ensuring that it doesn't include entities or names that could cause cyber harm or impact users, making it suitable for a wide range of research projects [3].
Columns
- Content: Represents the input text from social media. It contains approximately 418,000 unique text entries [4].
- Label: Denotes the input label for each text entry, indicating whether it is hateful or non-hateful. Labels are binary: '0' for non-hateful and '1' for hateful content. There are roughly 361,594 non-hateful entries and 79,305 hateful entries [3, 4].
- Content_int: An integer representation related to the content, with approximately 418,000 unique values [5].
- id: An identifier column, also with approximately 418,000 unique values [5].
Distribution
The dataset is provided as a CSV file named
HateSpeechDataset.csv
, with a size of 201.55 MB [4]. It consists of three primary columns (Content, Label, Content_int/id) [4]. The data itself is text-based and is in an annotated, analysed, and filtered format [2]. It comprises approximately 441,000 records in total, all of which are valid and free from mismatched or missing values in the primary content and label columns [4]. The structure bundles emojis, emoticons, and contractions within the textual content, categorised into either hateful or non-hateful classes [2].Usage
This dataset is ideally suited for training machine learning models aimed at identifying hate speech on social media [2]. It can be utilised by Deep Learning (DL) and Natural Language Processing (NLP) practitioners for various detection techniques [3]. Social media managers, administrators, or companies can leverage it to develop automated systems for filtering out hateful content, improving content moderation efforts [2]. Additionally, it serves as a benchmark dataset for new research and advancements in hate speech detection [3]. Researchers can benefit from its pre-processed nature and adherence to policy guidelines for their projects [3].
Coverage
The dataset focuses on hate speech in English text collected from social media platforms, reflecting current trends in online communication [1, 2]. While a specific geographic scope is not detailed, its origin from social media implies a broad, potentially global, context. There is no specific time range mentioned beyond "current trends," and the dataset is not expected to be updated frequently [4]. Its design is neutralised, meaning it avoids including specific entities or names, ensuring it can be used widely without privacy or harm concerns related to content generators [3].
License
Attribution 4.0 International (CC BY 4.0) License
Who Can Use It
- Social media companies and administrators: To create automatic systems for filtering hateful content [2].
- Deep Learning and Natural Language Processing practitioners: For detecting hateful speech using advanced AI techniques [3].
- Researchers: To leverage the pre-processed dataset for their projects and as a benchmark for hate speech detection [3].
- Anyone: Due to its neutralised nature, it is broadly accessible and safe for various applications [3].
Dataset Name Suggestions
- Hate Speech Detection: Curated Social Media Dataset
- Social Media Hate Speech Text Classification Dataset
- Modern Hate Speech Dataset for NLP
- Annotated Social Media Hate Speech
Attributes
Original Data Source: Modern Hate Speech Dataset for NLP