Dark Mode

Home

Data Categories

AI & ML Data

LinCE Hindi-English LID Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

LinCE Hindi-English LID Dataset

Data Science and Analytics

Tags and Keywords

Computer

Software

Nlp

Data

Text

Linguistics

Trusted By

LinCE Hindi-English LID Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides Hindi-English language identification data, specifically designed for testing machine learning models [1]. It is an integral part of the broader LinCE (Linguistic Code-switching Evaluation) collection, which is an expansive compilation of language technologies and data [2]. This resource facilitates a multitude of purposes, including language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), and sentiment analysis (SA) [2]. It is highly valuable for training robust models efficiently with machine learning techniques, enabling the automatic detection and classification of various linguistic tasks [2]. The LinCE collection itself explores six distinct languages: Spanish, Hindi, Nepali, Spanish-English, Hindi-English, and Spanish Multi-Source-English (MSAEA), making this dataset a valuable tool for those looking to unlock the power of language through analysis within a diverse linguistic context [2].

Columns

The dataset contains the following key columns:

words: The textual words within the dataset, represented as a string [1].
idx: An index identifier for each record [1].
lid: The language identification label assigned to the text [1].

Distribution

The data file is typically provided in a CSV format [3]. This specific test dataset contains 1,853 individual records or rows [1]. While a sample file will be updated separately to the platform, the structured nature of this data allows for straightforward integration into analytical workflows [3].

Usage

This dataset is ideal for a variety of applications and use cases, including:

Testing machine learning models developed for language identification [1].
Training ML models to automatically detect and classify tasks such as POS tagging or NER from different language variations [2].
Building cross-linguistic models across multiple languages [2].
Exploratory research within natural language processing (NLP) [2].
Developing multilingual sentiment analysis systems [2].
Training models to identify and classify named entities across multiple languages, regardless of the specific language or coding scheme [2].
Developing AI-powered cross-lingual translators that accurately translate text between languages [2].

Coverage

The dataset specifically focuses on Hindi-English language identification [1]. As part of the wider LinCE project, it aligns with a collection that encompasses Spanish, Hindi, Nepali, Spanish-English, Hindi-English, and Spanish Multi-Source-English (MSAEA) [2]. While the listed region for the dataset's availability is GLOBAL, specific geographic or time range coverage for the data content itself is not detailed [4].

License

CC0

Who Can Use It

This dataset is particularly useful for:

Data scientists and machine learning engineers focused on natural language processing.
NLP researchers and linguists interested in language analysis, code-switching, and multilingual models [2].
Developers and academics looking to build and test models for language identification, part-of-speech tagging, named-entity recognition, and sentiment analysis [2].
Anyone aiming to uncover the insights from language data and develop advanced multilingual AI applications [2].

Dataset Name Suggestions

Hindi-English Language ID Test Data
LinCE Hindi-English LID Dataset
Multilingual Code-switching Evaluation: Hindi-English
NLP Hindi-English Language Identifier
Cross-lingual Language Detection (Hindi-English)

Attributes

Original Data Source: LinCE (Linguistic Code-switching Evaluation)

Listing Stats

VIEWS

DOWNLOADS

LISTED

17/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...