Opendatabay APP

Global Text Language Data

Data Science and Analytics

Tags and Keywords

Computer

Text

Nlp

Deep

Nltk

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Global Text Language Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This is a small language detection dataset. It consists of textual details across 17 different languages, designed to facilitate the creation of Natural Language Processing (NLP) models for predicting language from text.

Columns

  • Text: This column contains the raw text content for which language identification is to be performed.
  • Language: This column specifies the actual language of the corresponding text in the 'Text' column. It contains 10,267 unique language entries.

Distribution

The data file is typically provided in a CSV format. While an exact total number of rows is not explicitly stated, the 'Language' column contains 10,267 unique values, suggesting a dataset size of approximately 10,267 records. The dataset covers 17 distinct languages, including: English (13%), French (10%), Malayalam, Hindi, Tamil, Kannada, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, and Greek. The remaining 77% of the data comprises other languages, accounting for 7,938 entries.

Usage

This dataset is ideal for developing and training Natural Language Processing (NLP) models focused on language prediction and identification. It can be utilised in various data science and analytics applications, including text classification, deep learning projects, and educational purposes for understanding multilingual data.

Coverage

The dataset's scope is global, encompassing text samples from 17 different languages. The languages covered are English, Malayalam, Hindi, Tamil, Kannada, French, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, and Greek. The sources do not provide specific details regarding a time range or demographic scope for the data content itself.

License

CCO

Who Can Use It

This dataset is primarily intended for data scientists, machine learning engineers, and researchers. It is suitable for anyone working on text analysis, building language recognition systems, or exploring multilingual data. Educational institutions and students can also benefit from this resource for learning and experimentation in NLP and artificial intelligence.

Dataset Name Suggestions

  • Language Detection Dataset
  • Multilingual Text Identifier
  • NLP Language Predictor
  • Global Text Language Data
  • 17-Language Text Corpus

Attributes

Original Data Source: Language Detection

Listing Stats

VIEWS

1

DOWNLOADS

1

LISTED

05/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format