Opendatabay APP

Global Named Entity Recognition Training Set

Annotation & Labeling Tasks

Tags and Keywords

Ner

Tokens

Nlp

Classification

Text

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Global Named Entity Recognition Training Set Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Identifying and classifying specific entities within a text corpus is a foundational challenge in natural language processing. Labelled examples provided here allow for the extraction of names of individuals, organisations, geographical locations, dates, and monetary values. By offering a structured way to distinguish these elements, the data serves as a vital bridge between raw text and semantic understanding. These records have been curated to support the development of systems that can accurately parse diverse types of information within a textual narrative.

Columns

  • id: A unique numerical identifier for each entry in the collection.
  • tokens: Individual words or strings that act as the primary building blocks for textual analysis and entity identification.
  • ner_tags: Categorical labels assigned to each token to indicate the specific entity type, such as a person name, location, or monetary value.
  • redundant variations: Additional representations of the tokens and tags are included to facilitate easy cross-validation and consistency checks during model development.

Distribution

The information is organised into three distinct CSV files: train.csv, validation.csv, and test.csv. The test file alone is approximately 52.74 MB and contains roughly 108,000 valid tokens. The collection maintains a 100% validity rate with no missing or mismatched entries reported for the core tokens. This resource is provided as a static archive with no future updates expected.

Usage

This resource is ideal for training deep learning models to recognise entities automatically within unstructured text. It functions as a standard benchmark for comparing the performance of different algorithms and neural network architectures. Researchers can also apply regular expressions or string matching to filter out specific tags, such as dates, to tailor their analysis to other entity types.

Coverage

The scope includes a wide variety of named entity types found in text, with a focus on distinguishing between persons, organisations, and locations. The data provides a large number of unique values—over 95,000 in the test set alone—ensuring a broad representation of language patterns. While the data is static, its structured format allows for language-specific research and the study of linguistic phenomena across different cultures.

License

CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication

Who Can Use It

Machine learning researchers can use the data to experiment with new techniques for entity classification and improve the accuracy of automated systems. Linguists might study the distribution and patterns of different entity types to uncover variations in how information is presented in text. Additionally, software developers building information extraction tools can leverage the labelled tokens to refine their production models.

Dataset Name Suggestions

  • Named Entity Recognition (NER) Gold Standard Corpus
  • AUEB NLP Group Tagged Text Collection
  • Multiclass Entity Classification and Token Registry
  • Standardised NLP Tokens and NER Tags Archive
  • Global Named Entity Recognition Training Set

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

29/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in ZIP Format