Movie Genre Classification Dataset
Product Reviews & Feedback
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
50,000 fictional movie records, specifically designed for educational and experimentation purposes in machine learning. It aims to help beginners practice classification tasks by predicting a movie's genre using both textual descriptions and metadata such as rating, duration, and country. The dataset simulates real-world scenarios, making it valuable for developing recommendation systems, content organisation, or exploring supervised learning problems in natural language processing and tabular feature modelling. It is lightweight, clean, and ready for various learning workflows.
Columns
The dataset contains 17 columns:
- Title: The fictional name of the movie.
- Year: The year the movie was released, ranging from 1980 to 2023.
- Director: The name of the movie's director.
- Duration: The length of the movie in minutes, typically between 80 and 180 minutes.
- Rating: An IMDb-style rating, ranging from 4.0 to 9.9.
- Votes: The estimated number of audience votes, from 516 to 500,000.
- Description: A short plot summary of the movie, providing textual context.
- Language: The primary language of the movie.
- Country: The country where the movie was produced.
- Budget_USD: The estimated production budget in US dollars, from 1.14 million to 198 million.
- BoxOffice_USD: The estimated box office revenue in US dollars, from 3.29 million to 993 million.
- Genre: The target variable, representing one of seven predefined genres: Action, Comedy, Drama, Romance, Thriller, Horror, and Fantasy.
- Production_Company: The name of the studio or distributor.
- Content_Rating: The movie's content rating (e.g., PG, NC-17).
- Lead_Actor: The name of the main actor or actress.
- Num_Awards: The number of awards the movie has won, ranging from 0 to 20.
- Critic_Reviews: The number of critic reviews received, from 0 to 1000.
The 'Genre' column serves as the target variable, with all other fields acting as potential inputs for prediction.
Distribution
The dataset consists of 50,000 fictional movie records, organised into 17 columns. It is provided in a UTF-8 encoded CSV format. The file, named
movie_genre_classification_final.csv
, has a size of 8.56 MB. It includes both structured metadata and textual summaries, making it suitable for a variety of analytical and machine learning tasks. All 50,000 records are valid across all columns, with no missing or mismatched values.Usage
This dataset is ideal for a wide range of applications and use cases, including:
- Practicing movie genre classification.
- Building recommendation systems and content organisation tools.
- Engaging in supervised learning problems, particularly in natural language processing (NLP) and tabular feature modelling.
- Conducting exploratory data analysis (EDA).
- Implementing text preprocessing and vectorization techniques.
- Developing multiclass classification models using various machine learning algorithms.
- Applying NLP techniques such as TF-IDF, word embeddings, or transformers.
- Creating models that combine both structured and unstructured data effectively.
- Learning workflows related to data cleaning, feature engineering, model evaluation and validation, and deploying classification pipelines.
Coverage
This synthetic dataset focuses on fictional movies, with release years spanning from 1980 to 2023. Production countries mentioned include South Korea and Japan, among others. The dataset's content is entirely synthetic, meaning it does not incorporate or reproduce any real-world copyrighted material. Demographic scope is indirectly referenced through content ratings like PG and NC-17.
License
CC BY 4.0
Who Can Use It
This dataset is primarily intended for beginners in machine learning, as well as anyone interested in data science, natural language processing, or film data analysis. Ideal users include:
- Machine learning students and practitioners learning classification tasks.
- Researchers and developers building recommendation systems or content management solutions.
- Data analysts performing exploratory data analysis on movie characteristics.
- Those practicing text preprocessing, feature engineering, and model evaluation.
- Individuals exploring multiclass classification and NLP techniques on mixed data types.
Dataset Name Suggestions
- Movie Genre Classification Dataset
- Fictional Film Genre Data
- ML Movie Genre Predictor
- Synthetic Movie Data for Classification
Attributes
Original Data Source: Movie Genre Classification Dataset