Tamil-English Code-Switching Dataset
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset offers a rich collection of over 600,000 unique Tanglish words and their cleaned forms. These words were extracted from a large body of more than 650,000 comments and transcripts gathered from 1,260 videos. It serves as a valuable resource for Natural Language Processing (NLP) tasks, particularly those involving Tamil-English mixed text, often referred to as "Tanglish." Key features include a substantial lexicon, preprocessed and cleaned text to ensure high-quality inputs for machine learning, and specific focus on Tamil-English text, making it useful for multilingual and low-resource NLP research. It is applicable to tasks such as text classification, sentiment analysis, and transliteration.
Columns
- word: Represents a unique Tanglish or Tamil term.
- count: Indicates the frequency of the specific word within the source corpus.
Distribution
The dataset is typically provided in a CSV format. It comprises over 600,000 unique Tanglish words, derived from over 650,000 comments and transcripts. While the exact number of rows in the full dataset is not specified, it represents a substantial collection of word-frequency pairs. The sample provided shows a structure of word and its corresponding count. The dataset was listed on 08/06/2025.
Usage
This dataset is ideal for various applications and use cases, including:
- Building and refining language models tailored for Tanglish.
- Creating datasets for machine translation and transliteration projects.
- Advancing linguistic studies focused on code-switching and low-resource languages.
- General NLP tasks such as text classification, sentiment analysis, and transliteration.
Coverage
The dataset's regional coverage is global. Its linguistic scope is focused on Tamil-English mixed text, specifically "Tanglish." The data originates from comments and transcripts collected from 1,260 videos. Specific notes on data availability for certain groups or years are not detailed beyond the general collection from video comments.
License
CCO
Who Can Use It
This dataset is particularly useful for:
- The Natural Language Processing (NLP) community.
- Researchers and developers working on regional and multilingual languages.
- Individuals or teams focused on building and fine-tuning language models for Tanglish.
- Those developing solutions for machine translation and transliteration tasks involving Tamil-English content.
- Linguists interested in code-switching phenomena and low-resource language studies.
Dataset Name Suggestions
- Tamil-Tanglish Word Frequency Lexicon
- YouTube Comments Tanglish Word Counts
- Tanglish NLP Lexicon
- Multilingual Social Media Word List
- Tamil-English Code-Switching Dataset
Attributes
Original Data Source: Tamil and Tanglish Transliterated Words Dataset