Indonesian Social Media Harmful Content
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains Indonesian Twitter text, designed for the detection of multi-label hate speech and abusive language. It serves as a valuable resource for research and development in natural language processing (NLP), particularly for identifying and classifying harmful online content. The data has undergone preprocessing steps by the original author to enhance its utility for machine learning tasks. It has been used for undergraduate projects focusing on the replication of these preprocessing steps and the development of detection models.
Columns
While an explicit 'Original Data Sample' showing column names is not provided, based on the dataset's purpose for multi-label hate speech and abusive language detection, the dataset is expected to contain at least two primary columns:
Text
: This column contains the raw Indonesian Twitter text, which may include abusive or hate speech content.Labels
: This column (or multiple columns) indicates the classification of the text, likely categorising it as 'Abusive' or 'Hate Speech'. The specific format for multi-labels (e.g., boolean flags, a single categorical string) is not detailed. The data likely includes a vocabulary of abusive and insulting terms, such as 'alay', 'ampas', 'buta', 'keparat', 'anjing', 'anjir', 'babi', 'bacot', 'bajingan', 'banci', 'bandot', 'buaya', 'bangkai', 'bangsat', 'bego', 'bejat', and 'bencong'.
Distribution
The dataset is typically provided in a CSV file format. Specific numbers for rows or records are not available. The structure is tabular, with each record likely representing a Twitter post and its corresponding labels.
Usage
This dataset is ideal for a variety of applications and use cases, including:
- Developing and training machine learning models for hate speech and abusive language detection in Indonesian.
- Conducting academic research in social media analysis, computational linguistics, and natural language processing.
- Replicating and exploring data preprocessing techniques for text classification.
- Building content moderation systems for social media platforms.
- Educational purposes, such as undergraduate projects focusing on text analytics.
Coverage
The geographic scope of the data is global, although the content is specifically in Indonesian. Details regarding the exact time range or specific demographic scope of the Twitter data are not available.
License
CC-BY-SA
Who Can Use It
This dataset is primarily intended for:
- Researchers: For studies in NLP, social media analytics, and hate speech detection.
- Students: Especially for undergraduate and postgraduate projects related to text classification and abusive language identification.
- Developers: Those looking to build or enhance content moderation tools or social media monitoring applications.
- Data Scientists: For exploring text data and applying machine learning techniques to real-world problems.
Dataset Name Suggestions
- Indonesian Twitter Abusive Language
- Hate Speech Detection Indonesian
- Indonesian Social Media Harmful Content
- Abusive Language ID Twitter
- Indonesian NLP Hate Speech Dataset
Attributes
Original Data Source: Indonesian Abusive and Hate Speech Twitter Text