£0

Movie Text-to-Genre Dataset

Entertainment & Media Consumption

Tags and Keywords

Arts

Movies

Ratings

Nlp

Bert

Trusted By

Movie Text-to-Genre Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

The dataset is structured into two main CSV files:

movies_overview.csv:
- title: The title of the movie.
- overview: A brief description or synopsis of the movie.
- genre_ids: One or more numerical identifiers corresponding to the movie's genre(s).
movies_genres.csv:
- id: A unique identifier for each genre.
- name: The corresponding genre name (e.g., Action, Comedy, Drama). This file enables mapping genre_ids from movies_overview.csv to their actual genre names.

Distribution

The dataset is provided in CSV file format. It comprises two files: movies_overview.csv and movies_genres.csv. The movies_genres.csv file contains 19 unique genre identifiers, each mapped to a specific genre name. Specific numbers for rows or records within movies_overview.csv are not available in the provided information.

Usage

This dataset is perfectly suited for a variety of NLP tasks and applications, including:

Multi-Label Genre Classification: The primary proposed task is to design and train an NLP model that accurately predicts one or more genres given a movie's overview text.
Text Classification Research: Experimenting with different text classification methods, from classical techniques like bag-of-words and TF-IDF combined with logistic regression or random forest, to modern deep learning approaches such as LSTM-based networks and transformer models (e.g., BERT, RoBERTa).
Data Preprocessing Exercises: Practising text cleaning (lowercasing, removing special characters, tokenisation) and label preparation (transforming genre IDs into a multi-label format using the movies_genres.csv mapping).
Model Evaluation: Utilising multi-label specific evaluation metrics like F1 Score (Macro/Micro), Hamming Loss, and Subset Accuracy.

Coverage

The dataset's coverage is global, indicating that the movie titles and overviews are not restricted to a specific geographical region. Information regarding specific time ranges for the movies or demographic scopes is not available in the provided sources.

License

CC-BY

Who Can Use It

This dataset is ideal for:

NLP Researchers and Practitioners: For developing, training, and evaluating advanced NLP models for text classification.
Data Scientists: Those interested in applying machine learning to textual data and exploring multi-label prediction problems.
Students and Learners: Especially those new to multi-label NLP tasks, as starter notebooks and detailed documentation are encouraged to lower the entry barrier.
Competitors in Data Challenges: Individuals participating in competitions focused on text processing and multi-label classification.

Dataset Name Suggestions

IMDb Movie Genre Classification Dataset
Film Overview Genre Predictor
Multi-Label Movie Synopsis Data
Movie Text-to-Genre Dataset

Attributes

Original Data Source: IMDb Movie Genre Classification Dataset

Listing Stats

VIEWS

DOWNLOADS

LISTED

08/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0