Global Movie Popularity Dataset
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains details for 10,000 top-rated movies from TMDB, updated as of 26th July 2022. Its primary purpose is to facilitate text preprocessing and cleansing for Natural Language Processing (NLP) tasks related to movie data. It is also highly suitable for developing content-based and collaborative filtering recommendation engines. This resource offers a rich context for understanding movie popularity, genres, and audience reception.
Columns
- id: The unique identification number for the movie on the website.
- title: The name of the movie.
- genre: The categorisation of the movie, such as crime, adventure, or drama.
- original_language: The initial language in which the movie was released.
- overview: A brief summary or synopsis of the movie.
- popularity: A metric indicating the movie's popularity.
- release_date: The date when the movie was first released.
- vote_average: The average rating given to the movie by voters.
- vote_count: The total number of votes received by the movie.
Distribution
This dataset comprises approximately 10,000 records, typically provided in a CSV file format. Specific row counts for a sample file are updated separately. The dataset includes unique values for movie IDs, with
original_language
predominantly being English (around 78%) and French (7%). Movie genres include Comedy (7%) and Drama (6%), with a wide array of other genres. Release dates span a broad period from 1902 to 2022, with the majority of entries from 1998 onwards. Popularity scores range from 0.6 to over 10,000, and vote averages are generally between 4.6 and 8.7, with vote counts reaching up to 31,900.Usage
This dataset is ideal for:
- Performing extensive text preprocessing and cleansing for NLP applications on movie descriptions and titles.
- Building various movie recommendation systems, including content-based recommenders and collaborative filtering engines.
- Analysing trends in movie popularity, audience ratings, and language distribution.
- Developing data science projects focused on entertainment and media consumption.
Coverage
The dataset's geographic scope is global. It covers movies released between 17th April 1902 and 13th July 2022, with the dataset itself assembled with data up to 26th July 2022. There are no specific demographic notes available, but it broadly covers top-rated films from the TMDB database.
License
CCO
Who Can Use It
This dataset is suitable for:
- Data Scientists and Machine Learning Engineers working on recommendation systems or NLP projects.
- Researchers studying film industry trends, audience engagement, or language processing.
- Developers looking to integrate movie data into applications.
- Anyone interested in exploratory data analysis within the entertainment sector.
Dataset Name Suggestions
- TMDB Top Movies Dataset
- Movie Data for NLP & Recommendations
- Global Movie Popularity Dataset
- Film Data Hub
Attributes
Original Data Source: TMDB Movies Dataset