Dark Mode

Home

Data Categories

AI & ML Data

Gender Prediction from Tweet Typo Data

FREE DATASET LIBRARY

Verified Data Provider

£0

Gender Prediction from Tweet Typo Data

Social Media and Networking

Tags and Keywords

Internet

Online

Social

Classification

Email

Nlp

Gender

Trusted By

Gender Prediction from Tweet Typo Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides simple Twitter analytics data, focusing on user profiles and tweet content. Its primary purpose is to enable the classification of gender based on tweet characteristics, specifically exploring the likelihood of different genders committing typos on their tweets. It serves as a valuable resource for emerging Natural Language Processing (NLP) enthusiasts looking to apply basic models to real-world social media data. The dataset includes unformatted tweet text, user information, and confidence scores related to various attributes.

Columns

The dataset contains the following key columns:

_unit_id: A unique identifier for the unit.
Tweet ID: The unique identifier for a tweet.
_golden: Indicates whether a user is a Golden User.
_unit_state: The state of the tweet.
_trusted_judgments: The level of trust associated with the judgment.
_last_judgment_at: The timestamp of the last judgment.
gender: The declared or inferred sex of the user.
gender:confidence: The confidence level associated with the gender classification.
profile_yn: A boolean indicating whether the user's profile is active or exists.
profile_yn:confidence: The confidence level for the profile's existence.
created: The date and time when the user's account was created.
Label Count: A count related to various labels within the dataset.

Distribution

The dataset is provided as a single data file, typically in CSV format. It comprises approximately 20,000 records. The structure includes various data types, such as IDs, boolean indicators, numerical confidence scores, and datetime stamps.

Usage

This dataset is ideal for:

Classifying user gender based on tweet content and user profile information.
Analysing spelling errors or typos in tweets in relation to user demographics.
Developing and testing Natural Language Processing (NLP) models, particularly for tasks like text classification and sentiment analysis.
Exploring patterns in social media behaviour and user characteristics on Twitter.
Educational purposes for those new to applying machine learning techniques to real-world tweet data.

Coverage

The dataset offers global geographical coverage as indicated by its region. The time range for tweet activity appears to be concentrated around 26th to 27th October 2015. However, the account creation dates for the users span a much broader period, from 5th August 2006 to 26th October 2015. In terms of demographics, the dataset includes gender distribution, with approximately 33% female, 31% male, and 36% categorised as 'Other'.

License

CCO

Who Can Use It

This dataset is primarily intended for:

Data scientists and analysts interested in social media analytics and user behaviour.
Machine learning practitioners, especially those working on classification problems and NLP tasks.
Students and researchers in fields such as computer science, linguistics, and social sciences.
NLP enthusiasts who are developing or looking to test basic linear or naive models on real-world text data.

Dataset Name Suggestions

Twitter User Profile & Activity Data
Gender Prediction from Tweet Typo Data
Social Media Analytics: Twitter User Gender
Tweet Classification for Gender Studies

Attributes

Original Data Source: Twitter Data

Listing Stats

VIEWS

DOWNLOADS

LISTED

16/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

FREE DATASET LIBRARY

£0

Gender Prediction from Tweet Typo Data

Social Media and Networking

Tags and Keywords

Internet

Online

Social

Classification

Email

Nlp

Gender

Trusted By

Free

About

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Listing Stats

Free

Download Dataset in CSV Format

RECOMMENDED DATASETS