Opendatabay APP

Cyberbullying Tweets Classification Dataset

Data Science and Analytics

Tags and Keywords

Text

Social

Nlp

People

Cyberbullying

Tweets

Detection

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Cyberbullying Tweets Classification Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset aims to combat the significant rise of cyberbullying, which has been exacerbated by increased social media usage and global events such as the COVID-19 pandemic. Its primary purpose is to enable the creation of models that can automatically flag potentially harmful tweets and help identify patterns of online hatred. As social media is now an essential medium for communication across all age groups, cyberbullying can affect individuals anywhere and at any time, with the internet's relative anonymity making it particularly difficult to stop. The dataset contains over 47,000 tweets that have been carefully labelled according to various types of cyberbullying, including Age, Ethnicity, Gender, Religion, Other type of cyberbullying, and Not cyberbullying. The data has been balanced to contain approximately 8,000 entries for each class, making it suitable for robust model training. This resource is vital given the alarming statistics, where a substantial percentage of students have experienced or observed cyberbullying, leading to severe effects from decreased academic performance to suicidal thoughts.

Columns

  • tweet_text: The full text content of the tweet.
  • cyberbullying_type: The specific category of cyberbullying harassment identified within the tweet.

Distribution

The dataset comprises more than 47,000 individual tweets, with the data meticulously balanced to ensure approximately 8,000 entries for each defined cyberbullying category. While the typical file format for such data is CSV, a sample file would be available separately on the platform. The structure is designed for multi-class classification tasks, categorising tweets into distinct types of cyberbullying or marking them as not cyberbullying.

Usage

This dataset is ideally suited for a variety of applications, including:
  • Developing multiclassification models to accurately predict the specific type of cyberbullying present in a tweet.
  • Building binary classification models to effectively flag tweets that are potentially harmful.
  • Conducting exploratory data analysis to identify and understand the words, phrases, and linguistic patterns associated with each type of cyberbullying.

Coverage

The dataset's scope is global, reflecting the worldwide reach of social media. While a specific time range is not detailed, the data's context relates to an increased risk of cyberbullying during the COVID-19 pandemic, with the original research paper dating from late 2020, suggesting content from around that period. The data covers general social media discourse on Twitter and is relevant to anyone using social media platforms, though statistics highlight the impact on middle and high school students.

License

CC BY

Who Can Use It

This dataset is particularly valuable for:
  • Data scientists and machine learning engineers looking to train and evaluate models for natural language processing (NLP) tasks related to content moderation and sentiment analysis.
  • Academics and researchers focusing on social issues, digital humanities, and the impact of online communication.
  • Organisations and developers aiming to build tools or systems for identifying and combating cyberbullying on social media platforms.

Dataset Name Suggestions

  • Cyberbullying Tweets Classification
  • Social Media Cyberbullying Detector
  • Twitter Bullying Classification
  • Fine-Grained Cyberbullying Dataset

Attributes

Original Data Source: Cyberbullying Classification

Listing Stats

VIEWS

4

DOWNLOADS

0

LISTED

05/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format