Lotsawa House Tibetan Translation Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides pairs of sentences and phrases, with the first element in Classical Tibetan and its corresponding English translation. It has been programmatically scraped, cleaned, and formatted from texts originally sourced from Lotsawa House. The structure of this dataset is designed to mimic the popular OPUS books dataset, making it easily usable with existing code developed for machine translation tutorials. It serves as a valuable resource for proof-of-concept modelling in areas such as natural language processing and machine translation, and was instrumental in training the 'billingsmoore/phonetic-tibetan-to-english-translation' model for the MLotsawa project. Please note that due to the complexity of assembling data from various translation structures, the quality is suitable primarily for proof-of-concept work.
Columns
The dataset is available in two main forms. For typical data processing, it is provided as a CSV file with two distinct columns:
- bo: Contains the Classical Tibetan text.
- en: Contains the English translation of the Tibetan text.
Alternatively, it is also provided as a pickled pandas dataframe, featuring a single column named 'translation', where each entry is a Python dictionary structured as
{'bo': 'Tibetan text', 'en': 'English text'}
.Distribution
The dataset is distributed in two forms: a pickled pandas dataframe and a CSV file. It consists of pairs of Classical Tibetan sentences or phrases along with their English translations. The specific number of rows or records within the dataset is not specified in the provided information.
Usage
This dataset is ideally suited for various applications, particularly in the realm of machine learning and natural language processing. Key use cases include:
- Machine translation tutorials and development, especially where compatibility with the OPUS books dataset is beneficial.
- Proof-of-concept modelling for new linguistic or translation algorithms.
- Training and fine-tuning machine learning models focused on Tibetan to English translation.
- Linguistic research and analysis of Classical Tibetan.
- Applications in AI & ML data projects requiring parallel text data.
Coverage
The dataset's content is derived from Classical Tibetan texts, focusing on linguistic data for translation purposes. Information regarding specific geographic regions, time ranges, or demographic scopes is not available in the provided sources.
License
CC-BY-NC
Who Can Use It
This dataset is particularly useful for:
- Machine learning engineers and data scientists developing translation models or engaging in linguistic data processing.
- Researchers and academics in the fields of linguistics, Tibetan studies, and natural language processing.
- Students learning about machine translation or working on related academic projects.
- Developers looking for parallel text data for proof-of-concept AI applications.
- Professionals and enthusiasts involved in education and learning analytics related to languages.
Dataset Name Suggestions
- Classical Tibetan-English Parallel Corpus
- Lotsawa House Tibetan Translation Dataset
- Tibetan-English Sentence Pair Collection
- MLotsawa Translation Training Data
- Tibetan Linguistic Translation Set
Attributes
Original Data Source: Classical Tibetan to English Translation Dataset