Wikipedia Movie Plot Collection
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains movie plots extracted from Wikipedia, along with other key metadata. It is specifically curated for movies released between 1950 and 2023 that have accumulated over 1000 ratings on IMDb. The primary purpose of this dataset is to facilitate development in Large Language Models (LLMs) for applications such as movie searching or recommendation systems. The plot summaries have been meticulously cleaned to remove irrelevant elements like links and references, ensuring a pure text value. Where Wikipedia plots were unavailable, IMDb synopses were used as a fallback. The dataset includes 89% of movies with detailed plot information, while 100% include a short summary untouched from Wikipedia, which is useful for matching metadata in retriever applications. Columns like 'stars', 'directors', and 'genres' are provided as lists of values, making them suitable for direct loading into vector databases.
Columns
- title: The title of the film, presented in lowercase.
- stars: The names of the actors featured in the film, also in lowercase.
- directors: The names of the film's directors, in lowercase.
- year: The year when the movie was released.
- genre: The genres associated with the film, listed in lowercase.
- runtime: The duration of the film, measured in minutes.
- ratingCount: An indication of the film's popularity, showing the number of people who have rated it on IMDb.
- plot: Detailed storyline of the film.
- summary: A short overview and additional details about the film.
- imdb_rating: The film's rating on IMDb, on a scale of 1 to 10.
Distribution
The data file is typically in CSV format. The dataset spans movies released from 1950 up to 2023. There are 20,617 unique movie titles, 21,596 unique star names, and 9,863 unique director names. The genres column contains 21,675 unique values. Movie runtimes range from -1 to 776 minutes, with a significant majority (17,433 entries) falling between 76.70 and 115.55 minutes. The number of ratings (
ratingCount
) varies widely, starting from 1,001 and going up to 2.73 million. IMDb ratings range from 1.2 to 9.3. While specific total row/record counts are not available, the distribution data for year
, runtime
, ratingCount
, and imdb_rating
show various value counts within different ranges.Usage
This dataset is ideal for:
- Developing demonstration projects leveraging Large Language Models (LLMs).
- Creating movie search applications, such as the example of a movie searching app like cinemattr.ca.
- Building retriever applications where the 'summary' column can be used for metadata matching.
- Populating vector databases with structured information from 'stars', 'directors', and 'genres' for advanced querying and analysis.
Coverage
The dataset's geographic scope is global. It includes movies released within the time frame of 1950 to 2023. The data availability specifies that 89% of the movies have detailed plot information, and all movies (100%) include a short summary. The dataset focuses on films with more than 1000 ratings on IMDb.
License
CC0
Who Can Use It
This dataset is suitable for:
- AI and machine learning developers who are building models based on natural language processing.
- Data scientists and researchers interested in film data and entertainment analytics.
- Software engineers developing applications that require movie plot summaries or metadata, such as recommendation engines.
- Students and enthusiasts looking for high-quality, pre-processed text data for LLM projects.
Dataset Name Suggestions
- IMDb Verified Movie Plots
- Historical Film Summaries (1950-2023)
- Wikipedia Movie Plot Collection
- LLM-Ready Movie Dataset
- Global Cinema Plot Archive
Attributes
Original Data Source: Movie Plots from Wikipedia