Language identification
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains multilingual text data labelled with various languages and corresponding texts in those languages. It can be used for tasks like language detection, translation, and natural language processing (NLP) applications. The dataset provides a diverse set of sentences in different languages, offering a realistic challenge for multilingual text analysis and model training.
Dataset Features:
- LI_ID: A unique identifier for each record in the dataset.
- Labels: The language label of the text, represented using ISO 639-1 language codes (e.g., pt for Portuguese, bg for Bulgarian).
- Text: A sentence or phrase in the corresponding language, providing content for NLP tasks.
Usage:
This dataset is ideal for multilingual NLP tasks such as:
- Training and testing language detection models.
- Fine-tuning translation systems.
- Analysing language structure and word usage across different languages.
- Benchmarking multilingual NLP algorithms.
Coverage:
The dataset includes sentences from multiple languages, including Portuguese, Bulgarian, Chinese, Thai, Russian, Polish, Urdu, Swahili, and Turkish. It spans a wide range of textual content, from formal statements to colloquial expressions.
License:
CC0 (Public Domain)
Who Can Use It:
This dataset is intended for NLP researchers, machine learning practitioners, linguists, and students interested in multilingual text processing.
How to Use It:
- Train language identification models to detect the language of a given text.
- Perform text preprocessing and cleaning to develop pipelines for multilingual data.
- Fine-tune translation or sequence-to-sequence models for specific language pairs.
- Explore linguistic patterns and differences across various languages.