Tamil YouTube Text Analysis Data
Knowledge Bundles
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains YouTube comments collected from Tamil videos, provided in their original form and cleaned versions, ideal for Natural Language Processing (NLP) tasks. The comments include a mix of Tamil Script, written entirely in Tamil Unicode characters, and Tanglish, which are Tamil words written in English transliteration. This collection is designed to support various linguistic analyses and model building for code-switched text.
Columns
The dataset primarily includes
comments.csv
, which provides the original YouTube comments with associated metadata:- video_id: A unique identifier for the specific YouTube video from which the comment was extracted [1].
- author: The author of the comment [1].
- text: The original comment text, encompassing both Tamil Script and Tanglish content [1].
- likes: The total number of likes the comment received [1].
- published_at: The timestamp indicating when the comment was posted [1].
Distribution
The dataset is provided in two main files [2]:
comments.csv
: This file contains the original comments (Tamil and Tanglish) along with metadata such as video IDs, authors, likes, and timestamps. It is in CSV format [2, 3].cleaned_text.txt
: This is a plain text file containing cleaned versions of the comments, specifically prepared for NLP tasks. The cleaning process focuses on retaining important Tamil transliterated words while removing noise [2]. Specific numbers for rows or records are not available in the provided information [4].
Usage
This dataset is well-suited for several applications and use cases, including [2]:
- Sentiment analysis of Tamil and Tanglish comments.
- Building machine learning models for code-switched Tamil-English text.
- Performing phonetic and linguistic analysis of Tanglish transliterations.
- Exploring patterns in Tamil transliterations and their alignment with native Tamil text.
Coverage
The dataset's region of coverage is Global [5]. It consists of YouTube comments from Tamil videos [2]. Specific details on the time range or demographic scope beyond the origin of the comments are not detailed in the available information [4].
License
CC-BY
Who Can Use It
This dataset is particularly useful for [2]:
- Researchers focusing on natural language processing, linguistics, and code-switching phenomena.
- Data scientists and machine learning engineers developing models for text classification, sentiment analysis, or text generation in mixed-language contexts.
- Anyone interested in the cultural and linguistic patterns of online communication in Tamil-speaking communities.
Dataset Name Suggestions
- Tamil Tanglish YouTube Comments
- YouTube Tamil NLP Dataset
- Code-Switched Tamil Comments
- Tamil YouTube Text Analysis Data
- Global Tamil Comment Corpus
Attributes
Original Data Source: Tamil and Tanglish YouTube Comments for NLP