Bulgarian PoS and Lemma Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset aims to address the limited availability of data for the Bulgarian language, particularly on platforms like Kaggle. It contains a collection of Bulgarian words in various forms, scraped to provide a foundational resource for natural language processing tasks. The primary purpose is to facilitate part-of-speech tagging, lemmatisation, and exploratory data analysis for the Bulgarian language.
Columns
- word: The word string itself, representing all scraped Bulgarian words.
- lemma: The lemma or basic form of the word.
- form: The specific grammatical form of the word as it appears in the 'word' column.
- pos: The part of speech assigned to the word.
Distribution
The dataset is provided as a single CSV file,
bg-pos.csv
. It includes a substantial collection of Bulgarian words in their various forms. Specific numbers for rows or records are not available.Usage
This dataset is ideal for:
- Exploratory data analysis (EDA) of Bulgarian language structures.
- Developing and training models for part-of-speech tagging and recognition for Bulgarian text.
- Implementing and improving lemmatisation algorithms for the Bulgarian language.
Coverage
The dataset focuses exclusively on the Bulgarian language, covering almost all words in their various forms. Its applicability is global, serving anyone working with Bulgarian linguistic data. There are no specific time ranges or demographic scopes noted for the data.
License
CC0
Who Can Use It
- Linguists and researchers studying Bulgarian morphology and syntax.
- Data scientists and machine learning engineers developing NLP applications for the Bulgarian language.
- Academics and students in fields such as computational linguistics, artificial intelligence, and language studies.
- Anyone interested in the structural analysis of the Bulgarian language.
Dataset Name Suggestions
- Bulgarian PoS and Lemma Dataset
- Bulgarian Word Forms Collection
- Bulgarian Language Morphology Data
- Bulgarian NLP Starter Pack
Attributes
Original Data Source: Bulgarian Part Of Speech Dataset