Opendatabay APP

Indonesian Social Media Harmful Content

Social Media and Networking

Tags and Keywords

Computer

Software

Internet

Text

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Indonesian Social Media Harmful Content Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset contains Indonesian Twitter text, designed for the detection of multi-label hate speech and abusive language. It serves as a valuable resource for research and development in natural language processing (NLP), particularly for identifying and classifying harmful online content. The data has undergone preprocessing steps by the original author to enhance its utility for machine learning tasks. It has been used for undergraduate projects focusing on the replication of these preprocessing steps and the development of detection models.

Columns

While an explicit 'Original Data Sample' showing column names is not provided, based on the dataset's purpose for multi-label hate speech and abusive language detection, the dataset is expected to contain at least two primary columns:
  • Text: This column contains the raw Indonesian Twitter text, which may include abusive or hate speech content.
  • Labels: This column (or multiple columns) indicates the classification of the text, likely categorising it as 'Abusive' or 'Hate Speech'. The specific format for multi-labels (e.g., boolean flags, a single categorical string) is not detailed. The data likely includes a vocabulary of abusive and insulting terms, such as 'alay', 'ampas', 'buta', 'keparat', 'anjing', 'anjir', 'babi', 'bacot', 'bajingan', 'banci', 'bandot', 'buaya', 'bangkai', 'bangsat', 'bego', 'bejat', and 'bencong'.

Distribution

The dataset is typically provided in a CSV file format. Specific numbers for rows or records are not available. The structure is tabular, with each record likely representing a Twitter post and its corresponding labels.

Usage

This dataset is ideal for a variety of applications and use cases, including:
  • Developing and training machine learning models for hate speech and abusive language detection in Indonesian.
  • Conducting academic research in social media analysis, computational linguistics, and natural language processing.
  • Replicating and exploring data preprocessing techniques for text classification.
  • Building content moderation systems for social media platforms.
  • Educational purposes, such as undergraduate projects focusing on text analytics.

Coverage

The geographic scope of the data is global, although the content is specifically in Indonesian. Details regarding the exact time range or specific demographic scope of the Twitter data are not available.

License

CC-BY-SA

Who Can Use It

This dataset is primarily intended for:
  • Researchers: For studies in NLP, social media analytics, and hate speech detection.
  • Students: Especially for undergraduate and postgraduate projects related to text classification and abusive language identification.
  • Developers: Those looking to build or enhance content moderation tools or social media monitoring applications.
  • Data Scientists: For exploring text data and applying machine learning techniques to real-world problems.

Dataset Name Suggestions

  • Indonesian Twitter Abusive Language
  • Hate Speech Detection Indonesian
  • Indonesian Social Media Harmful Content
  • Abusive Language ID Twitter
  • Indonesian NLP Hate Speech Dataset

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

08/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format