Movie Text-to-Genre Dataset
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
The dataset is structured into two main CSV files:
- movies_overview.csv:
title
: The title of the movie.overview
: A brief description or synopsis of the movie.genre_ids
: One or more numerical identifiers corresponding to the movie's genre(s).
- movies_genres.csv:
id
: A unique identifier for each genre.name
: The corresponding genre name (e.g., Action, Comedy, Drama). This file enables mappinggenre_ids
frommovies_overview.csv
to their actual genre names.
Distribution
The dataset is provided in CSV file format. It comprises two files:
movies_overview.csv
and movies_genres.csv
. The movies_genres.csv
file contains 19 unique genre identifiers, each mapped to a specific genre name. Specific numbers for rows or records within movies_overview.csv
are not available in the provided information.Usage
This dataset is perfectly suited for a variety of NLP tasks and applications, including:
- Multi-Label Genre Classification: The primary proposed task is to design and train an NLP model that accurately predicts one or more genres given a movie's overview text.
- Text Classification Research: Experimenting with different text classification methods, from classical techniques like bag-of-words and TF-IDF combined with logistic regression or random forest, to modern deep learning approaches such as LSTM-based networks and transformer models (e.g., BERT, RoBERTa).
- Data Preprocessing Exercises: Practising text cleaning (lowercasing, removing special characters, tokenisation) and label preparation (transforming genre IDs into a multi-label format using the
movies_genres.csv
mapping). - Model Evaluation: Utilising multi-label specific evaluation metrics like F1 Score (Macro/Micro), Hamming Loss, and Subset Accuracy.
Coverage
The dataset's coverage is global, indicating that the movie titles and overviews are not restricted to a specific geographical region. Information regarding specific time ranges for the movies or demographic scopes is not available in the provided sources.
License
CC-BY
Who Can Use It
This dataset is ideal for:
- NLP Researchers and Practitioners: For developing, training, and evaluating advanced NLP models for text classification.
- Data Scientists: Those interested in applying machine learning to textual data and exploring multi-label prediction problems.
- Students and Learners: Especially those new to multi-label NLP tasks, as starter notebooks and detailed documentation are encouraged to lower the entry barrier.
- Competitors in Data Challenges: Individuals participating in competitions focused on text processing and multi-label classification.
Dataset Name Suggestions
- IMDb Movie Genre Classification Dataset
- Film Overview Genre Predictor
- Multi-Label Movie Synopsis Data
- Movie Text-to-Genre Dataset
Attributes
Original Data Source: IMDb Movie Genre Classification Dataset