Opendatabay APP

Movie Text-to-Genre Dataset

Entertainment & Media Consumption

Tags and Keywords

Arts

Movies

Ratings

Nlp

Bert

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Movie Text-to-Genre Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

The dataset is structured into two main CSV files:
  • movies_overview.csv:
    • title: The title of the movie.
    • overview: A brief description or synopsis of the movie.
    • genre_ids: One or more numerical identifiers corresponding to the movie's genre(s).
  • movies_genres.csv:
    • id: A unique identifier for each genre.
    • name: The corresponding genre name (e.g., Action, Comedy, Drama). This file enables mapping genre_ids from movies_overview.csv to their actual genre names.

Distribution

The dataset is provided in CSV file format. It comprises two files: movies_overview.csv and movies_genres.csv. The movies_genres.csv file contains 19 unique genre identifiers, each mapped to a specific genre name. Specific numbers for rows or records within movies_overview.csv are not available in the provided information.

Usage

This dataset is perfectly suited for a variety of NLP tasks and applications, including:
  • Multi-Label Genre Classification: The primary proposed task is to design and train an NLP model that accurately predicts one or more genres given a movie's overview text.
  • Text Classification Research: Experimenting with different text classification methods, from classical techniques like bag-of-words and TF-IDF combined with logistic regression or random forest, to modern deep learning approaches such as LSTM-based networks and transformer models (e.g., BERT, RoBERTa).
  • Data Preprocessing Exercises: Practising text cleaning (lowercasing, removing special characters, tokenisation) and label preparation (transforming genre IDs into a multi-label format using the movies_genres.csv mapping).
  • Model Evaluation: Utilising multi-label specific evaluation metrics like F1 Score (Macro/Micro), Hamming Loss, and Subset Accuracy.

Coverage

The dataset's coverage is global, indicating that the movie titles and overviews are not restricted to a specific geographical region. Information regarding specific time ranges for the movies or demographic scopes is not available in the provided sources.

License

CC-BY

Who Can Use It

This dataset is ideal for:
  • NLP Researchers and Practitioners: For developing, training, and evaluating advanced NLP models for text classification.
  • Data Scientists: Those interested in applying machine learning to textual data and exploring multi-label prediction problems.
  • Students and Learners: Especially those new to multi-label NLP tasks, as starter notebooks and detailed documentation are encouraged to lower the entry barrier.
  • Competitors in Data Challenges: Individuals participating in competitions focused on text processing and multi-label classification.

Dataset Name Suggestions

  • IMDb Movie Genre Classification Dataset
  • Film Overview Genre Predictor
  • Multi-Label Movie Synopsis Data
  • Movie Text-to-Genre Dataset

Attributes

Listing Stats

VIEWS

3

DOWNLOADS

0

LISTED

08/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free