Opendatabay APP

Bd Indigenous Languages Dataset

Knowledge Bundles

Tags and Keywords

Tabular

Classification

Nlp

Lstm

Svm

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Bd Indigenous Languages Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset contains text entries in various ethnic languages of Bangladesh. It is valuable for tasks related to multilingual Natural Language Processing (NLP), language classification, and the preservation of underrepresented languages. The dataset includes 4,713 entries.

Columns

  • Converted Text: This column holds short text samples in various ethnic languages.
  • Language: This column provides the corresponding language label for each text sample. It includes six distinct languages: Chakma, Marma, Tripura, Santali, Garo, and Rakhine.

Distribution

The dataset consists of 4,713 rows and two columns. The distribution of languages within the dataset is as follows: Chakma accounts for 22% of entries, Marma for 20%, and other languages collectively make up 57% of the dataset. The data is typically available in a CSV format.

Usage

Ideal applications and use cases for this dataset include:
  • Developing and testing language classification models.
  • Research in multilingual NLP, particularly for less-resourced languages.
  • Projects focused on the preservation and analysis of indigenous languages.
  • Training machine learning algorithms for text analysis in specific Bangladeshi ethnic languages.

Coverage

This dataset covers text samples from several ethnic languages spoken in Bangladesh, specifically Chakma, Marma, Tripura, Santali, Garo, and Rakhine. The focus is on the linguistic diversity within Bangladesh. No specific time range for data collection is provided.

License

CCO

Who Can Use It

This dataset is suitable for:
  • NLP Researchers: To build and evaluate language models for underrepresented languages.
  • Linguists: For academic study and documentation of Bangladeshi ethnic languages.
  • Machine Learning Engineers: To train and deploy language classification systems.
  • Academics: For educational and research purposes related to language diversity and digital humanities.
  • Developers: To integrate language awareness into applications targeting diverse linguistic groups.

Dataset Name Suggestions

  • Bangladeshi Ethnic Languages Text Collection
  • Bd Indigenous Languages Dataset
  • Bangladesh Minority Languages Text Corpus
  • Chakma Marma Tripura Santali Garo Rakhine Text Dataset

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

16/06/2025

REGION

ASIA

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format