Social Media Gender Prediction Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset was originally compiled to train a CrowdFlower artificial intelligence (AI) gender predictor. It comprises data gathered from Twitter profiles, where contributors evaluated each user as either male, female, or a non-individual entity (e.g., a brand). The dataset is valuable for machine learning tasks, particularly for developing and testing gender classification models based on social media profile information.
Columns
- unit_id: A unique identifier for each entry or unit within the dataset.
- golden: A boolean indicator specifying if the record is a 'golden' record, often used for quality control or ground truth in annotation tasks.
- unit_state: The current state of the data unit, such as 'finalised' or 'golden'.
- trusted_judgments: The count of trusted judgments received for a particular unit.
- last_judgment_at: The timestamp indicating when the last judgment was made on the unit.
- gender: The classified or predicted gender, which can be 'male', 'female', or 'brand'.
- gender_confidence: A numerical score representing the confidence level of the gender classification.
- profile_yn: A boolean indicator showing whether a profile was available for evaluation.
- profile_yn_confidence: The confidence score associated with the availability of the profile.
- created: The timestamp when the data record was created. The dataset also contains details related to Twitter profiles, such as user names, random tweets, account profile and image information, location data, and even link and sidebar colours, which are associated with these judgment metrics.
Distribution
The dataset contains 20,000 rows, typically supplied in a CSV (Comma Separated Values) format. Specific file size details are not provided, but it is structured as a tabular dataset with the aforementioned columns.
Usage
This dataset is ideal for training and evaluating machine learning models focused on gender prediction. It can be used in natural language processing (NLP) research, social media analytics, and for projects requiring demographic inference from online profiles. Researchers and developers can leverage it to understand patterns in social media data and build predictive algorithms.
Coverage
The dataset's coverage is global. The data collection, or at least the period for which data statistics are available, spans from August 2006 to October 2015. The demographic scope is focused on classifying Twitter users by gender (male, female) or as non-individual entities (brands).
License
CC0
Who Can Use It
This dataset is suitable for:
- Data Scientists and Machine Learning Engineers: For training and validating AI models, especially gender classifiers.
- Researchers: To conduct studies on social media behaviour, demographics, and online identity.
- Academics: For educational purposes, demonstrating data analysis and predictive modelling techniques.
- Developers: To integrate gender prediction capabilities into applications.
Dataset Name Suggestions
- Twitter Gender Classification Data
- Social Media Gender Prediction Dataset
- CrowdFlower Gender Classifier Data
- Online Profile Gender Dataset
Attributes
Original Data Source: Gender Classifier Data