Gender Prediction from Tweet Typo Data
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides simple Twitter analytics data, focusing on user profiles and tweet content. Its primary purpose is to enable the classification of gender based on tweet characteristics, specifically exploring the likelihood of different genders committing typos on their tweets. It serves as a valuable resource for emerging Natural Language Processing (NLP) enthusiasts looking to apply basic models to real-world social media data. The dataset includes unformatted tweet text, user information, and confidence scores related to various attributes.
Columns
The dataset contains the following key columns:
- _unit_id: A unique identifier for the unit.
- Tweet ID: The unique identifier for a tweet.
- _golden: Indicates whether a user is a Golden User.
- _unit_state: The state of the tweet.
- _trusted_judgments: The level of trust associated with the judgment.
- _last_judgment_at: The timestamp of the last judgment.
- gender: The declared or inferred sex of the user.
- gender:confidence: The confidence level associated with the gender classification.
- profile_yn: A boolean indicating whether the user's profile is active or exists.
- profile_yn:confidence: The confidence level for the profile's existence.
- created: The date and time when the user's account was created.
- Label Count: A count related to various labels within the dataset.
Distribution
The dataset is provided as a single data file, typically in CSV format. It comprises approximately 20,000 records. The structure includes various data types, such as IDs, boolean indicators, numerical confidence scores, and datetime stamps.
Usage
This dataset is ideal for:
- Classifying user gender based on tweet content and user profile information.
- Analysing spelling errors or typos in tweets in relation to user demographics.
- Developing and testing Natural Language Processing (NLP) models, particularly for tasks like text classification and sentiment analysis.
- Exploring patterns in social media behaviour and user characteristics on Twitter.
- Educational purposes for those new to applying machine learning techniques to real-world tweet data.
Coverage
The dataset offers global geographical coverage as indicated by its region. The time range for tweet activity appears to be concentrated around 26th to 27th October 2015. However, the account creation dates for the users span a much broader period, from 5th August 2006 to 26th October 2015. In terms of demographics, the dataset includes gender distribution, with approximately 33% female, 31% male, and 36% categorised as 'Other'.
License
CCO
Who Can Use It
This dataset is primarily intended for:
- Data scientists and analysts interested in social media analytics and user behaviour.
- Machine learning practitioners, especially those working on classification problems and NLP tasks.
- Students and researchers in fields such as computer science, linguistics, and social sciences.
- NLP enthusiasts who are developing or looking to test basic linear or naive models on real-world text data.
Dataset Name Suggestions
- Twitter User Profile & Activity Data
- Gender Prediction from Tweet Typo Data
- Social Media Analytics: Twitter User Gender
- Tweet Classification for Gender Studies
Attributes
Original Data Source: Twitter Data