Global Name Gender Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset offers a mapping of first names to genders, providing both the raw counts and calculated probabilities for each gender association. It integrates data on male and female baby names from official government sources across several countries, including the United States, the United Kingdom (specifically England and Wales), Canada (British Columbia), and Australia. The primary purpose is to provide a reliable basis for gender attribution based on first names.
Columns
- Name: A string field that lists first or given names. This column features 133,910 unique name entries and is entirely valid across all 147,000 records. 'James' is identified as the most frequently occurring name.
- Gender: This is a categorical string field that indicates the assigned gender, either 'M' for male or 'F' for female. The data shows a distribution where 61% of names are associated with 'F' and 39% with 'M'. There are only two unique values in this column, and it is 100% valid.
- Count: An integer field representing the total occurrences of a specific name-gender combination. The values range significantly, from 1 up to approximately 5.3 million. The average count is around 2,480.
- Probability: A float field indicating the calculated probability of a given name being associated with a particular gender. The probabilities predominantly range from 0.00 to 0.01.
Distribution
The dataset is provided as a CSV file, specifically named
name_gender_dataset.csv
, and has a file size of 3.77 MB. It is structured with 4 distinct columns and contains approximately 147,000 records, with all entries confirmed as valid.Usage
This dataset is highly suitable for a range of analytical and application development purposes. It can be used for text analysis, classification tasks, and clustering initiatives. Specific use cases include building predictive models for gender identification based on names, conducting in-depth demographic research, supporting market segmentation efforts, and enriching various natural language processing applications.
Coverage
The data spans several key geographic regions and timeframes:
- United States: Information is sourced from Baby Names from Social Security Card Applications, covering the period from 1880 to 2019.
- United Kingdom: Data from Baby names in England and Wales Statistical bulletins, covering 2011 to 2018.
- Canada: British Columbia's 100 Years of Popular Baby names, from 1918 to 2018.
- Australia: Popular Baby Names from the Attorney-General's Department, covering 1944 to 2019. The dataset's scope is focused exclusively on first/given names of male and female babies born within these periods.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
This dataset is an ideal resource for:
- Data Scientists and Machine Learning Engineers: For developing and refining models that predict gender based on textual name data.
- Researchers and Academics: Engaging in studies related to demographics, social trends, and linguistic patterns concerning names.
- Marketers and Business Analysts: For segmenting audiences and personalising communication strategies by inferring gender from names.
- Software Developers: For integrating name-gender attribution functionalities into diverse applications and services.
Dataset Name Suggestions
- Global Name Gender Data
- First Name Gender Probabilities
- Gender By Name Attributes
- Multinational Baby Names Gender
- Name Gender Classifier Data
Attributes
Original Data Source: Global Name Gender Data