Opendatabay APP

Global Employer Dataset (Wikidata)

E-commerce & Online Transactions

Tags and Keywords

Business

Computer

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Global Employer Dataset (Wikidata) Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a curated and labeled subset of employer entries derived from Wikidata, with the goal of improving the quality and usability of employer data. While Wikidata is an invaluable open resource, direct use often necessitates cleaning. This dataset addresses that need by offering metadata, statistics, and labels to help users identify and utilise valid employer information. An employer is generally defined here as a company or entity that provides employment paying wages or a salary. The dataset specifically screens out entries that do not represent true employers, such as individuals or plurals. It is particularly useful for tasks involving data cleaning, entity recognition, and understanding employment nomenclature.

Columns

  • item_id: The unique Wikidata item identifier (QCode without the 'Q' prefix).
  • employer_count: The number of Wikidata entries associated with this specific employer reference.
  • employer: The text label of the employer's name, sourced from Kensho's English labels.
  • description: The accompanying description of the Wikidata employer entry, also from Kensho.
  • in_google_news: A binary indicator (0 for no, 1 for yes) showing if the occupation exists within the GoogleNews embedding.
  • language_detected: A three-digit language code, identified using FastText language detection.
  • source: Indicates the origin of the information, such as Wikidata or Wikipedia.
  • label: A binary label (0 for invalid employer, 1 for valid employer) indicating the data's quality.
  • labeled_by: Specifies the method used for labeling, including human, classifier_gnew, classifier_bert, or cleanlab.
  • label_error_reason: Provides the specific reason if a label is deemed an error, such as 'domain' or 'plural'.

Distribution

This dataset is provided as a single CSV file, named employers.wikidata.all.labeled.csv. Its current version is 1.0, with a file size of approximately 5.98 MB. The dataset contains a substantial number of entries, with item_id having 60656 values, employer having 60456 values, and description having 60640 values.

Usage

This dataset is ideal for various applications, including:
  • Detecting new trends in employers, occupations, and employment terminology.
  • Automatic error correction of employer entries.
  • Converting plural forms of entities to singular forms.
  • Training Named Entity Recognition (NER) models to identify employer names.
  • Building Question/Answer models that can understand and respond to queries about employers.
  • Improving the accuracy of FastText language detection models.
  • Assessing FastText accuracy with limited data.

Coverage

The dataset's coverage is global, drawing data from a Wikidata dump dated 2 February 2020. It includes employer entries from various linguistic contexts, as indicated by the language_detected column, showcasing multilingual employer names and descriptions. The content primarily focuses on entities and organisations that meet the definition of an employer, rather than specific demographic groups.

License

CC BY-SA

Who Can Use It

This dataset is suitable for:
  • Data scientists and machine learning engineers working on natural language processing tasks.
  • Researchers interested in data quality, entity resolution, and knowledge graph analysis.
  • Developers building applications that require accurate employer information.
  • Anyone needing to clean and validate employer data for various analytical or operational purposes.

Dataset Name Suggestions

  • Wikidata Labeled Employers
  • ML-Ready Wikidata Employer Data
  • Cleaned Wikidata Employer References
  • Global Employer Dataset (Wikidata)
  • Validated Employer Entities

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free