Opendatabay APP

English Word POS Tag Dataset

Education & Learning Analytics

Tags and Keywords

Education

Text

Intermediate

Nlp

Languages

Trusted By
Trusted by company1Trusted by company2Trusted by company3
English Word POS Tag Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset presents a collection of 370,000 English words, each accompanied by its corresponding Part-of-Speech (POS) tag. It was generated by applying the NLTK POS-tagger to an existing corpus of English words. This resource is highly valuable for various applications in natural language processing (NLP), linguistic analysis, and educational technology, providing a foundational understanding of word functions in text.

Columns

  • index: A numerical identifier for each entry in the dataset.
  • word: The English word itself.
  • pos_tag: The Part-of-Speech tag assigned to the word, adhering to the Penn Treebank tag set. Examples include:
    • CC: coordinating conjunction
    • CD: cardinal digit
    • DT: determiner
    • EX: existential there
    • FW: foreign word
    • IN: preposition/subordinating conjunction
    • JJ: adjective (e.g., large)
    • JJR: adjective, comparative (e.g., larger)
    • JJS: adjective, superlative (e.g., largest)
    • LS: list item marker
    • MD: modal (e.g., could, will)
    • NN: noun, singular
    • NNS: noun plural
    • NNP: proper noun, singular
    • NNPS: proper noun, plural
    • PDT: predeterminer
    • POS: possessive ending (e.g., parent's)
    • PRP: personal pronoun (e.g., hers, himself)
    • PRP$: dollar-sign possessive pronoun (e.g., her, my)
    • RB: adverb (e.g., occasionally, swiftly)
    • RBR: adverb, comparative (e.g., greater)
    • RBS: adverb, superlative (e.g., biggest)
    • RP: particle (e.g., about)
    • SYM: symbol
    • TO: infinite marker (e.g., to)
    • UH: interjection (e.g., goodbye)
    • VB: verb (e.g., ask)
    • VBG: verb gerund (e.g., judging)
    • VBD: verb past tense (e.g., pleaded)
    • VBN: verb past participle (e.g., reunified)
    • VBP: verb, present tense not 3rd person singular (e.g., wrap)
    • VBZ: verb, present tense with 3rd person singular (e.g., bases)
    • WDT: wh-determiner (e.g., that, what)
    • WP: wh- pronoun (e.g., who)
    • WP$: possessive wh-pronoun
    • WRB: wh- adverb (e.g., how)

Distribution

This dataset comprises approximately 370,100 records, each consisting of an English word and its corresponding Part-of-Speech tag. The distribution of POS tags within the dataset indicates that singular nouns (NN) constitute 62% of the entries, plural nouns (NNS) account for 13%, and the remaining tag types collectively make up 24%. The dataset is structured to provide clear word-to-tag mappings.

Usage

This dataset is well-suited for a variety of applications, including:
  • Natural Language Processing (NLP): Essential for training models in tasks such as POS tagging, text classification, and grammar analysis.
  • Linguistic Research: Facilitates the study of English grammatical structures, word morphology, and syntactic patterns.
  • Educational Tools: Ideal for developing language learning apps, grammar checkers, and vocabulary building exercises.
  • Text Mining and Analysis: Enables deeper insights into unstructured text by identifying the grammatical role of individual words.

Coverage

The dataset focuses on the English language and its grammatical components, providing a global scope relevant to English words. It does not include specific geographic, time-based, or demographic limitations related to the words themselves, serving as a general English word corpus.

License

COO

Who Can Use It

  • AI and Machine Learning Developers: For creating and improving NLP models and algorithms.
  • Linguists and Academic Researchers: For conducting scholarly investigations into English grammar and lexicography.
  • Educators and Students: For teaching, learning, and developing educational resources related to language arts.
  • Data Scientists and Analysts: For preparing and enriching text data in diverse analytical projects.
  • Software Developers: Especially those creating applications with language processing functionalities.

Dataset Name Suggestions

  • English Word POS Tag Dataset
  • 370k English Word Corpus with Grammatical Tags
  • Annotated English Lexicon for NLP
  • NLTK Tagged English Words
  • Part-of-Speech Tagged English Dictionary

Attributes

Original Data Source: 370k English words corpus

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

16/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free