Opendatabay APP

Tamil YouTube Text Analysis Data

Knowledge Bundles

Tags and Keywords

Text

Nlp

Tamil

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Tamil YouTube Text Analysis Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset contains YouTube comments collected from Tamil videos, provided in their original form and cleaned versions, ideal for Natural Language Processing (NLP) tasks. The comments include a mix of Tamil Script, written entirely in Tamil Unicode characters, and Tanglish, which are Tamil words written in English transliteration. This collection is designed to support various linguistic analyses and model building for code-switched text.

Columns

The dataset primarily includes comments.csv, which provides the original YouTube comments with associated metadata:
  • video_id: A unique identifier for the specific YouTube video from which the comment was extracted [1].
  • author: The author of the comment [1].
  • text: The original comment text, encompassing both Tamil Script and Tanglish content [1].
  • likes: The total number of likes the comment received [1].
  • published_at: The timestamp indicating when the comment was posted [1].

Distribution

The dataset is provided in two main files [2]:
  • comments.csv: This file contains the original comments (Tamil and Tanglish) along with metadata such as video IDs, authors, likes, and timestamps. It is in CSV format [2, 3].
  • cleaned_text.txt: This is a plain text file containing cleaned versions of the comments, specifically prepared for NLP tasks. The cleaning process focuses on retaining important Tamil transliterated words while removing noise [2]. Specific numbers for rows or records are not available in the provided information [4].

Usage

This dataset is well-suited for several applications and use cases, including [2]:
  • Sentiment analysis of Tamil and Tanglish comments.
  • Building machine learning models for code-switched Tamil-English text.
  • Performing phonetic and linguistic analysis of Tanglish transliterations.
  • Exploring patterns in Tamil transliterations and their alignment with native Tamil text.

Coverage

The dataset's region of coverage is Global [5]. It consists of YouTube comments from Tamil videos [2]. Specific details on the time range or demographic scope beyond the origin of the comments are not detailed in the available information [4].

License

CC-BY

Who Can Use It

This dataset is particularly useful for [2]:
  • Researchers focusing on natural language processing, linguistics, and code-switching phenomena.
  • Data scientists and machine learning engineers developing models for text classification, sentiment analysis, or text generation in mixed-language contexts.
  • Anyone interested in the cultural and linguistic patterns of online communication in Tamil-speaking communities.

Dataset Name Suggestions

  • Tamil Tanglish YouTube Comments
  • YouTube Tamil NLP Dataset
  • Code-Switched Tamil Comments
  • Tamil YouTube Text Analysis Data
  • Global Tamil Comment Corpus

Attributes

Listing Stats

VIEWS

2

DOWNLOADS

0

LISTED

08/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format