Opendatabay APP

Lotsawa House Tibetan Translation Dataset

Education & Learning Analytics

Tags and Keywords

Nlp

Languages

Linguistics

Translation

Tibetan

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Lotsawa House Tibetan Translation Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides pairs of sentences and phrases, with the first element in Classical Tibetan and its corresponding English translation. It has been programmatically scraped, cleaned, and formatted from texts originally sourced from Lotsawa House. The structure of this dataset is designed to mimic the popular OPUS books dataset, making it easily usable with existing code developed for machine translation tutorials. It serves as a valuable resource for proof-of-concept modelling in areas such as natural language processing and machine translation, and was instrumental in training the 'billingsmoore/phonetic-tibetan-to-english-translation' model for the MLotsawa project. Please note that due to the complexity of assembling data from various translation structures, the quality is suitable primarily for proof-of-concept work.

Columns

The dataset is available in two main forms. For typical data processing, it is provided as a CSV file with two distinct columns:
  • bo: Contains the Classical Tibetan text.
  • en: Contains the English translation of the Tibetan text.
Alternatively, it is also provided as a pickled pandas dataframe, featuring a single column named 'translation', where each entry is a Python dictionary structured as {'bo': 'Tibetan text', 'en': 'English text'}.

Distribution

The dataset is distributed in two forms: a pickled pandas dataframe and a CSV file. It consists of pairs of Classical Tibetan sentences or phrases along with their English translations. The specific number of rows or records within the dataset is not specified in the provided information.

Usage

This dataset is ideally suited for various applications, particularly in the realm of machine learning and natural language processing. Key use cases include:
  • Machine translation tutorials and development, especially where compatibility with the OPUS books dataset is beneficial.
  • Proof-of-concept modelling for new linguistic or translation algorithms.
  • Training and fine-tuning machine learning models focused on Tibetan to English translation.
  • Linguistic research and analysis of Classical Tibetan.
  • Applications in AI & ML data projects requiring parallel text data.

Coverage

The dataset's content is derived from Classical Tibetan texts, focusing on linguistic data for translation purposes. Information regarding specific geographic regions, time ranges, or demographic scopes is not available in the provided sources.

License

CC-BY-NC

Who Can Use It

This dataset is particularly useful for:
  • Machine learning engineers and data scientists developing translation models or engaging in linguistic data processing.
  • Researchers and academics in the fields of linguistics, Tibetan studies, and natural language processing.
  • Students learning about machine translation or working on related academic projects.
  • Developers looking for parallel text data for proof-of-concept AI applications.
  • Professionals and enthusiasts involved in education and learning analytics related to languages.

Dataset Name Suggestions

  • Classical Tibetan-English Parallel Corpus
  • Lotsawa House Tibetan Translation Dataset
  • Tibetan-English Sentence Pair Collection
  • MLotsawa Translation Training Data
  • Tibetan Linguistic Translation Set

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format