Dark Mode

Home

Data Categories

AI & ML Data

Global Text Language Data

FREE DATASET LIBRARY

Verified Data Provider

£0

Global Text Language Data

Data Science and Analytics

Tags and Keywords

Computer

Text

Nlp

Deep

Nltk

Trusted By

Global Text Language Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This is a small language detection dataset. It consists of textual details across 17 different languages, designed to facilitate the creation of Natural Language Processing (NLP) models for predicting language from text.

Columns

Text: This column contains the raw text content for which language identification is to be performed.
Language: This column specifies the actual language of the corresponding text in the 'Text' column. It contains 10,267 unique language entries.

Distribution

The data file is typically provided in a CSV format. While an exact total number of rows is not explicitly stated, the 'Language' column contains 10,267 unique values, suggesting a dataset size of approximately 10,267 records. The dataset covers 17 distinct languages, including: English (13%), French (10%), Malayalam, Hindi, Tamil, Kannada, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, and Greek. The remaining 77% of the data comprises other languages, accounting for 7,938 entries.

Usage

This dataset is ideal for developing and training Natural Language Processing (NLP) models focused on language prediction and identification. It can be utilised in various data science and analytics applications, including text classification, deep learning projects, and educational purposes for understanding multilingual data.

Coverage

The dataset's scope is global, encompassing text samples from 17 different languages. The languages covered are English, Malayalam, Hindi, Tamil, Kannada, French, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, and Greek. The sources do not provide specific details regarding a time range or demographic scope for the data content itself.

License

CCO

Who Can Use It

This dataset is primarily intended for data scientists, machine learning engineers, and researchers. It is suitable for anyone working on text analysis, building language recognition systems, or exploring multilingual data. Educational institutions and students can also benefit from this resource for learning and experimentation in NLP and artificial intelligence.

Dataset Name Suggestions

Language Detection Dataset
Multilingual Text Identifier
NLP Language Predictor
Global Text Language Data
17-Language Text Corpus

Attributes

Original Data Source: Language Detection

Listing Stats

VIEWS

DOWNLOADS

LISTED

05/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

FREE DATASET LIBRARY

£0

Global Text Language Data

Data Science and Analytics

Tags and Keywords

Computer

Text

Nlp

Deep

Nltk

Trusted By

Free

About

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Listing Stats

Free

Download Dataset in CSV Format

RECOMMENDED DATASETS