Opendatabay APP

Hate Speech Detection Sinhala

Entertainment & Media Consumption

Tags and Keywords

Arts

Online

Classification

Nlp

Deep

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Hate Speech Detection Sinhala Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset addresses the critical issue of hate speech prevalent on social media platforms. It provides alphabetically ordered Facebook comments in Sinhala Unicode, each accompanied by a label indicating whether the comment constitutes hate speech or not. The primary purpose of this dataset is to facilitate the prediction of hateful, abusive, or insulting comments made specifically in the Sinhala language. While numerous English datasets are available for hate speech detection, this resource is particularly significant as it offers a much-needed dataset for Sinhala, the native language of Sri Lanka.

Columns

  • id: This column represents the unique row identifier for each record, an incrementally assigned value.
  • comment: This column contains the actual comment posted by a user on Facebook.
  • label: This column holds a binary value (0 or 1), indicating whether the comment is classified as hate speech (1) or not (0).

Distribution

The dataset is presented as a single file, typically in a CSV format. It comprises 6,345 records. Specifically, there are 2,890 comments labelled as not hate speech (0) and 3,455 comments labelled as hate speech (1). Details regarding the specific file size in megabytes or gigabytes are not available in the provided information.

Usage

This dataset is ideally suited for developing and training machine learning models designed to detect hateful, abusive, or insulting comments on social media. Its applications include various natural language processing (NLP) tasks, classification problems, and deep learning initiatives focused on content moderation and enhancing online safety, particularly for content written in the Sinhala language.

Coverage

The dataset has a global regional scope. It specifically covers comments made using Sinhala Unicode. No explicit time range or demographic details beyond the language of the comments are provided.

License

CC By 4.0

Who Can Use It

This dataset is intended for researchers and developers working in the fields of Artificial Intelligence and machine learning, especially those with a focus on natural language processing and deep learning for text classification. It is particularly valuable for individuals or organisations aiming to build or improve automated systems for detecting hate speech in Sinhala language content.

Dataset Name Suggestions

  • Sinhala Unicode Hate Speech Classification
  • Facebook Sinhala Comment Hate Speech
  • Sinhala Social Media Abusive Content Dataset
  • Hate Speech Detection Sinhala
  • Sinhala Language Offensive Comment Dataset

Attributes

Original Data Source: Sinhala Unicode Hate Speech

Listing Stats

VIEWS

8

DOWNLOADS

4

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format