Bd Indigenous Languages Dataset
Knowledge Bundles
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains text entries in various ethnic languages of Bangladesh. It is valuable for tasks related to multilingual Natural Language Processing (NLP), language classification, and the preservation of underrepresented languages. The dataset includes 4,713 entries.
Columns
Converted Text
: This column holds short text samples in various ethnic languages.Language
: This column provides the corresponding language label for each text sample. It includes six distinct languages: Chakma, Marma, Tripura, Santali, Garo, and Rakhine.
Distribution
The dataset consists of 4,713 rows and two columns. The distribution of languages within the dataset is as follows: Chakma accounts for 22% of entries, Marma for 20%, and other languages collectively make up 57% of the dataset. The data is typically available in a CSV format.
Usage
Ideal applications and use cases for this dataset include:
- Developing and testing language classification models.
- Research in multilingual NLP, particularly for less-resourced languages.
- Projects focused on the preservation and analysis of indigenous languages.
- Training machine learning algorithms for text analysis in specific Bangladeshi ethnic languages.
Coverage
This dataset covers text samples from several ethnic languages spoken in Bangladesh, specifically Chakma, Marma, Tripura, Santali, Garo, and Rakhine. The focus is on the linguistic diversity within Bangladesh. No specific time range for data collection is provided.
License
CCO
Who Can Use It
This dataset is suitable for:
- NLP Researchers: To build and evaluate language models for underrepresented languages.
- Linguists: For academic study and documentation of Bangladeshi ethnic languages.
- Machine Learning Engineers: To train and deploy language classification systems.
- Academics: For educational and research purposes related to language diversity and digital humanities.
- Developers: To integrate language awareness into applications targeting diverse linguistic groups.
Dataset Name Suggestions
- Bangladeshi Ethnic Languages Text Collection
- Bd Indigenous Languages Dataset
- Bangladesh Minority Languages Text Corpus
- Chakma Marma Tripura Santali Garo Rakhine Text Dataset
Attributes
Original Data Source: Bd Ethnic Languages Classification