Bulgarian Poetry Dataset for NLP
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset offers a collection of poems in Bulgarian, originally scraped from Chitanka.info. It serves as a valuable resource for text generation and author categorisation tasks within natural language processing.
Columns
The dataset is structured with three key columns:
author
: The name of the poem's author.title
: The specific title of the poem.poem
: The full text of the poem, where a special token denotes a newline.
Distribution
This dataset is provided as a single CSV file, named
chitanka-corpus.csv
. It comprises three columns. While the exact number of rows or records is not specified, it contains data on over 15,000 unique authors and over 17,000 unique poem titles, indicating a substantial volume of literary works.Usage
Ideal applications for this dataset include:
- Developing and training text generation models for Bulgarian poetry.
- Implementing and evaluating author categorisation algorithms.
- Linguistic research and analysis of Bulgarian literary styles.
Coverage
The dataset focuses exclusively on Bulgarian poems and authors. While specific time ranges are not detailed, it includes works from notable authors such as Борис Младенов-Young (representing 5% of the authors) and Иван Вазов (representing 4% of the authors), with the majority of content attributed to various other authors (91%).
License
CC0
Who Can Use It
This dataset is particularly suitable for:
- Data scientists and machine learning engineers working on natural language processing tasks.
- Linguists and literary scholars interested in Bulgarian language and poetry.
- Researchers developing new algorithms for text analysis and generation.
Dataset Name Suggestions
- Bulgarian Poems Corpus
- Chitanka Literary Collection
- Bulgarian Poetry Dataset for NLP
- Bulgarian Authorial Verse
Attributes
Original Data Source: Bulgarian Poems Dataset