Opendatabay APP

Bulgarian PoS and Lemma Dataset

Education & Learning Analytics

Tags and Keywords

Nlp

Languages

Bulgarian

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Bulgarian PoS and Lemma Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset aims to address the limited availability of data for the Bulgarian language, particularly on platforms like Kaggle. It contains a collection of Bulgarian words in various forms, scraped to provide a foundational resource for natural language processing tasks. The primary purpose is to facilitate part-of-speech tagging, lemmatisation, and exploratory data analysis for the Bulgarian language.

Columns

  • word: The word string itself, representing all scraped Bulgarian words.
  • lemma: The lemma or basic form of the word.
  • form: The specific grammatical form of the word as it appears in the 'word' column.
  • pos: The part of speech assigned to the word.

Distribution

The dataset is provided as a single CSV file, bg-pos.csv. It includes a substantial collection of Bulgarian words in their various forms. Specific numbers for rows or records are not available.

Usage

This dataset is ideal for:
  • Exploratory data analysis (EDA) of Bulgarian language structures.
  • Developing and training models for part-of-speech tagging and recognition for Bulgarian text.
  • Implementing and improving lemmatisation algorithms for the Bulgarian language.

Coverage

The dataset focuses exclusively on the Bulgarian language, covering almost all words in their various forms. Its applicability is global, serving anyone working with Bulgarian linguistic data. There are no specific time ranges or demographic scopes noted for the data.

License

CC0

Who Can Use It

  • Linguists and researchers studying Bulgarian morphology and syntax.
  • Data scientists and machine learning engineers developing NLP applications for the Bulgarian language.
  • Academics and students in fields such as computational linguistics, artificial intelligence, and language studies.
  • Anyone interested in the structural analysis of the Bulgarian language.

Dataset Name Suggestions

  • Bulgarian PoS and Lemma Dataset
  • Bulgarian Word Forms Collection
  • Bulgarian Language Morphology Data
  • Bulgarian NLP Starter Pack

Attributes

Original Data Source: Bulgarian Part Of Speech Dataset

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format