Dark Mode

Home

Data Categories

AI & ML Data

Tamil-English Code-Switching Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Tamil-English Code-Switching Dataset

Social Media and Networking

Tags and Keywords

Social

Linguistics

Text

Nlp

Tamil

Lexicon

Tanglish

Frequency

Trusted By

Tamil-English Code-Switching Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset offers a rich collection of over 600,000 unique Tanglish words and their cleaned forms. These words were extracted from a large body of more than 650,000 comments and transcripts gathered from 1,260 videos. It serves as a valuable resource for Natural Language Processing (NLP) tasks, particularly those involving Tamil-English mixed text, often referred to as "Tanglish." Key features include a substantial lexicon, preprocessed and cleaned text to ensure high-quality inputs for machine learning, and specific focus on Tamil-English text, making it useful for multilingual and low-resource NLP research. It is applicable to tasks such as text classification, sentiment analysis, and transliteration.

Columns

word: Represents a unique Tanglish or Tamil term.
count: Indicates the frequency of the specific word within the source corpus.

Distribution

The dataset is typically provided in a CSV format. It comprises over 600,000 unique Tanglish words, derived from over 650,000 comments and transcripts. While the exact number of rows in the full dataset is not specified, it represents a substantial collection of word-frequency pairs. The sample provided shows a structure of word and its corresponding count. The dataset was listed on 08/06/2025.

Usage

This dataset is ideal for various applications and use cases, including:

Building and refining language models tailored for Tanglish.
Creating datasets for machine translation and transliteration projects.
Advancing linguistic studies focused on code-switching and low-resource languages.
General NLP tasks such as text classification, sentiment analysis, and transliteration.

Coverage

The dataset's regional coverage is global. Its linguistic scope is focused on Tamil-English mixed text, specifically "Tanglish." The data originates from comments and transcripts collected from 1,260 videos. Specific notes on data availability for certain groups or years are not detailed beyond the general collection from video comments.

License

CCO

Who Can Use It

This dataset is particularly useful for:

The Natural Language Processing (NLP) community.
Researchers and developers working on regional and multilingual languages.
Individuals or teams focused on building and fine-tuning language models for Tanglish.
Those developing solutions for machine translation and transliteration tasks involving Tamil-English content.
Linguists interested in code-switching phenomena and low-resource language studies.

Dataset Name Suggestions

Tamil-Tanglish Word Frequency Lexicon
YouTube Comments Tanglish Word Counts
Tanglish NLP Lexicon
Multilingual Social Media Word List
Tamil-English Code-Switching Dataset

Attributes

Original Data Source: Tamil and Tanglish Transliterated Words Dataset

Listing Stats

VIEWS

DOWNLOADS

LISTED

08/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

FREE DATASET LIBRARY

£0

Tamil-English Code-Switching Dataset

Social Media and Networking

Tags and Keywords

Social

Linguistics

Text

Nlp

Tamil

Lexicon

Tanglish

Frequency

Trusted By

Free

About

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Listing Stats

Free

Download Dataset in CSV Format

RECOMMENDED DATASETS