Opendatabay APP

Movie Subtitle Sentiment Analysis Dataset

Entertainment & Media Consumption

Tags and Keywords

Nlp

Statistical

Analysis

Segmentation

Sentence

Similarity

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Movie Subtitle Sentiment Analysis Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset features processed movie subtitle data specifically curated for sentiment analysis. It contains matched pairs of Slovak and English subtitles sourced from ten different movies [1]. Each original subtitle is paired with its machine translation, generated by Google Translate, and an associated sentiment score, identified using the OpenAI GPT model [1]. Further analysis includes sentiment results for each segment from the IBM Watson Natural Language Understanding (IBM NLU) service [1]. Additionally, the dataset includes validation metrics such as BLEU and TER scores, which quantify the accuracy and error rates of the machine translations [1]. This dataset is suitable for academic and research purposes focusing on natural language processing, machine translation, and sentiment analysis [1].

Columns

The dataset typically includes the following columns:
  • sentence_id: A unique identifier for each subtitle segment or sentence pair [2].
  • text_sk: The original subtitle text in Slovak [1, 2].
  • text_en: The original subtitle text in English [1, 2].
  • text_en_mt_gt: The English machine translation of the Slovak text, produced by Google Translate [1, 2].
  • categoryID: An identifier for the sentiment category, such as 'neutral' or 'other' [2].
  • BLEU-1_GT: The BLEU-1 score, a metric for evaluating the quality of machine-translated text based on unigram matches against ground truth [1, 2].
  • BLEU-2_GT: The BLEU-2 score, based on bigram matches [1, 2].
  • BLEU-3_GT: The BLEU-3 score, based on trigram matches [1, 2].
  • BLEU-4_GT: The BLEU-4 score, based on four-gram matches [1, 2].
  • TER_GT: The Translation Edit Rate (TER) score, a metric for measuring the number of edits required to change a machine translation into a human reference translation [1, 2].

Distribution

The dataset contains processed movie subtitle data from 10 movies [1]. While specific row counts are not detailed, some columns show over 7,000 total values [2]. Data files are typically provided in CSV format [3]. The dataset is designed for global use [4].

Usage

This dataset is ideal for applications and research in:
  • Sentiment Analysis: Analysing sentiment in original and machine-translated texts [1].
  • Machine Translation Quality Evaluation: Assessing the performance and error rates of machine translation systems using BLEU and TER metrics [1].
  • Natural Language Processing (NLP): Developing and testing NLP models, particularly those focused on multilingual text processing and sentiment detection [1].
  • AI and Machine Learning Research: Training and validating models related to text understanding, translation, and sentiment prediction [1].
  • Linguistics and Digital Humanities: Studying cross-lingual sentiment and the nuances of movie subtitle translation [1].

Coverage

The dataset covers Slovak and English languages [1] and is derived from 10 different movies [1]. Its scope is global [4]. The data reflects subtitles processed for sentiment analysis and machine translation evaluation [1].

License

CC0

Who Can Use It

  • Data Scientists and AI/ML Engineers: For building and refining sentiment analysis models and machine translation evaluation systems.
  • Researchers and Academics: Those studying natural language processing, computational linguistics, and machine translation quality.
  • Language Technologists: Professionals working on tools for multilingual content and sentiment detection.
  • Students: For educational projects and dissertations in AI, NLP, and data science.

Dataset Name Suggestions

  • Movie Subtitle Sentiment Analysis Dataset
  • Slovak-English MT Sentiment Data
  • Google Translate & GPT Sentiment Scores for Movie Subtitles
  • BLEU-TER Metrics for Slovak Movie Translations
  • Cross-Lingual Movie Sentiment Analysis

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format