Global Text Language Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This is a small language detection dataset. It consists of textual details across 17 different languages, designed to facilitate the creation of Natural Language Processing (NLP) models for predicting language from text.
Columns
- Text: This column contains the raw text content for which language identification is to be performed.
- Language: This column specifies the actual language of the corresponding text in the 'Text' column. It contains 10,267 unique language entries.
Distribution
The data file is typically provided in a CSV format. While an exact total number of rows is not explicitly stated, the 'Language' column contains 10,267 unique values, suggesting a dataset size of approximately 10,267 records. The dataset covers 17 distinct languages, including: English (13%), French (10%), Malayalam, Hindi, Tamil, Kannada, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, and Greek. The remaining 77% of the data comprises other languages, accounting for 7,938 entries.
Usage
This dataset is ideal for developing and training Natural Language Processing (NLP) models focused on language prediction and identification. It can be utilised in various data science and analytics applications, including text classification, deep learning projects, and educational purposes for understanding multilingual data.
Coverage
The dataset's scope is global, encompassing text samples from 17 different languages. The languages covered are English, Malayalam, Hindi, Tamil, Kannada, French, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, and Greek. The sources do not provide specific details regarding a time range or demographic scope for the data content itself.
License
CCO
Who Can Use It
This dataset is primarily intended for data scientists, machine learning engineers, and researchers. It is suitable for anyone working on text analysis, building language recognition systems, or exploring multilingual data. Educational institutions and students can also benefit from this resource for learning and experimentation in NLP and artificial intelligence.
Dataset Name Suggestions
- Language Detection Dataset
- Multilingual Text Identifier
- NLP Language Predictor
- Global Text Language Data
- 17-Language Text Corpus
Attributes
Original Data Source: Language Detection