Opendatabay APP

Social Media Gender Prediction Dataset

Data Science and Analytics

Tags and Keywords

Earth

And

Nature

Online

Communities

Text

Nlp

Classification

Pre-processing

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Social Media Gender Prediction Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset was originally compiled to train a CrowdFlower artificial intelligence (AI) gender predictor. It comprises data gathered from Twitter profiles, where contributors evaluated each user as either male, female, or a non-individual entity (e.g., a brand). The dataset is valuable for machine learning tasks, particularly for developing and testing gender classification models based on social media profile information.

Columns

  • unit_id: A unique identifier for each entry or unit within the dataset.
  • golden: A boolean indicator specifying if the record is a 'golden' record, often used for quality control or ground truth in annotation tasks.
  • unit_state: The current state of the data unit, such as 'finalised' or 'golden'.
  • trusted_judgments: The count of trusted judgments received for a particular unit.
  • last_judgment_at: The timestamp indicating when the last judgment was made on the unit.
  • gender: The classified or predicted gender, which can be 'male', 'female', or 'brand'.
  • gender_confidence: A numerical score representing the confidence level of the gender classification.
  • profile_yn: A boolean indicator showing whether a profile was available for evaluation.
  • profile_yn_confidence: The confidence score associated with the availability of the profile.
  • created: The timestamp when the data record was created. The dataset also contains details related to Twitter profiles, such as user names, random tweets, account profile and image information, location data, and even link and sidebar colours, which are associated with these judgment metrics.

Distribution

The dataset contains 20,000 rows, typically supplied in a CSV (Comma Separated Values) format. Specific file size details are not provided, but it is structured as a tabular dataset with the aforementioned columns.

Usage

This dataset is ideal for training and evaluating machine learning models focused on gender prediction. It can be used in natural language processing (NLP) research, social media analytics, and for projects requiring demographic inference from online profiles. Researchers and developers can leverage it to understand patterns in social media data and build predictive algorithms.

Coverage

The dataset's coverage is global. The data collection, or at least the period for which data statistics are available, spans from August 2006 to October 2015. The demographic scope is focused on classifying Twitter users by gender (male, female) or as non-individual entities (brands).

License

CC0

Who Can Use It

This dataset is suitable for:
  • Data Scientists and Machine Learning Engineers: For training and validating AI models, especially gender classifiers.
  • Researchers: To conduct studies on social media behaviour, demographics, and online identity.
  • Academics: For educational purposes, demonstrating data analysis and predictive modelling techniques.
  • Developers: To integrate gender prediction capabilities into applications.

Dataset Name Suggestions

  • Twitter Gender Classification Data
  • Social Media Gender Prediction Dataset
  • CrowdFlower Gender Classifier Data
  • Online Profile Gender Dataset

Attributes

Original Data Source: Gender Classifier Data

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

22/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format