Opendatabay APP

Movie Genre Classification Dataset

Product Reviews & Feedback

Tags and Keywords

Movie

Genre

Classification

Machine

Learning

Synthetic

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Movie Genre Classification Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

50,000 fictional movie records, specifically designed for educational and experimentation purposes in machine learning. It aims to help beginners practice classification tasks by predicting a movie's genre using both textual descriptions and metadata such as rating, duration, and country. The dataset simulates real-world scenarios, making it valuable for developing recommendation systems, content organisation, or exploring supervised learning problems in natural language processing and tabular feature modelling. It is lightweight, clean, and ready for various learning workflows.

Columns

The dataset contains 17 columns:
  • Title: The fictional name of the movie.
  • Year: The year the movie was released, ranging from 1980 to 2023.
  • Director: The name of the movie's director.
  • Duration: The length of the movie in minutes, typically between 80 and 180 minutes.
  • Rating: An IMDb-style rating, ranging from 4.0 to 9.9.
  • Votes: The estimated number of audience votes, from 516 to 500,000.
  • Description: A short plot summary of the movie, providing textual context.
  • Language: The primary language of the movie.
  • Country: The country where the movie was produced.
  • Budget_USD: The estimated production budget in US dollars, from 1.14 million to 198 million.
  • BoxOffice_USD: The estimated box office revenue in US dollars, from 3.29 million to 993 million.
  • Genre: The target variable, representing one of seven predefined genres: Action, Comedy, Drama, Romance, Thriller, Horror, and Fantasy.
  • Production_Company: The name of the studio or distributor.
  • Content_Rating: The movie's content rating (e.g., PG, NC-17).
  • Lead_Actor: The name of the main actor or actress.
  • Num_Awards: The number of awards the movie has won, ranging from 0 to 20.
  • Critic_Reviews: The number of critic reviews received, from 0 to 1000.
The 'Genre' column serves as the target variable, with all other fields acting as potential inputs for prediction.

Distribution

The dataset consists of 50,000 fictional movie records, organised into 17 columns. It is provided in a UTF-8 encoded CSV format. The file, named movie_genre_classification_final.csv, has a size of 8.56 MB. It includes both structured metadata and textual summaries, making it suitable for a variety of analytical and machine learning tasks. All 50,000 records are valid across all columns, with no missing or mismatched values.

Usage

This dataset is ideal for a wide range of applications and use cases, including:
  • Practicing movie genre classification.
  • Building recommendation systems and content organisation tools.
  • Engaging in supervised learning problems, particularly in natural language processing (NLP) and tabular feature modelling.
  • Conducting exploratory data analysis (EDA).
  • Implementing text preprocessing and vectorization techniques.
  • Developing multiclass classification models using various machine learning algorithms.
  • Applying NLP techniques such as TF-IDF, word embeddings, or transformers.
  • Creating models that combine both structured and unstructured data effectively.
  • Learning workflows related to data cleaning, feature engineering, model evaluation and validation, and deploying classification pipelines.

Coverage

This synthetic dataset focuses on fictional movies, with release years spanning from 1980 to 2023. Production countries mentioned include South Korea and Japan, among others. The dataset's content is entirely synthetic, meaning it does not incorporate or reproduce any real-world copyrighted material. Demographic scope is indirectly referenced through content ratings like PG and NC-17.

License

CC BY 4.0

Who Can Use It

This dataset is primarily intended for beginners in machine learning, as well as anyone interested in data science, natural language processing, or film data analysis. Ideal users include:
  • Machine learning students and practitioners learning classification tasks.
  • Researchers and developers building recommendation systems or content management solutions.
  • Data analysts performing exploratory data analysis on movie characteristics.
  • Those practicing text preprocessing, feature engineering, and model evaluation.
  • Individuals exploring multiclass classification and NLP techniques on mixed data types.

Dataset Name Suggestions

  • Movie Genre Classification Dataset
  • Fictional Film Genre Data
  • ML Movie Genre Predictor
  • Synthetic Movie Data for Classification

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

08/09/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format