Opendatabay APP

Wikipedia Movie Plot Collection

Entertainment & Media Consumption

Tags and Keywords

Movies

And

Tv

Shows

Nlp

Recommender

Systems

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Wikipedia Movie Plot Collection Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset contains movie plots extracted from Wikipedia, along with other key metadata. It is specifically curated for movies released between 1950 and 2023 that have accumulated over 1000 ratings on IMDb. The primary purpose of this dataset is to facilitate development in Large Language Models (LLMs) for applications such as movie searching or recommendation systems. The plot summaries have been meticulously cleaned to remove irrelevant elements like links and references, ensuring a pure text value. Where Wikipedia plots were unavailable, IMDb synopses were used as a fallback. The dataset includes 89% of movies with detailed plot information, while 100% include a short summary untouched from Wikipedia, which is useful for matching metadata in retriever applications. Columns like 'stars', 'directors', and 'genres' are provided as lists of values, making them suitable for direct loading into vector databases.

Columns

  • title: The title of the film, presented in lowercase.
  • stars: The names of the actors featured in the film, also in lowercase.
  • directors: The names of the film's directors, in lowercase.
  • year: The year when the movie was released.
  • genre: The genres associated with the film, listed in lowercase.
  • runtime: The duration of the film, measured in minutes.
  • ratingCount: An indication of the film's popularity, showing the number of people who have rated it on IMDb.
  • plot: Detailed storyline of the film.
  • summary: A short overview and additional details about the film.
  • imdb_rating: The film's rating on IMDb, on a scale of 1 to 10.

Distribution

The data file is typically in CSV format. The dataset spans movies released from 1950 up to 2023. There are 20,617 unique movie titles, 21,596 unique star names, and 9,863 unique director names. The genres column contains 21,675 unique values. Movie runtimes range from -1 to 776 minutes, with a significant majority (17,433 entries) falling between 76.70 and 115.55 minutes. The number of ratings (ratingCount) varies widely, starting from 1,001 and going up to 2.73 million. IMDb ratings range from 1.2 to 9.3. While specific total row/record counts are not available, the distribution data for year, runtime, ratingCount, and imdb_rating show various value counts within different ranges.

Usage

This dataset is ideal for:
  • Developing demonstration projects leveraging Large Language Models (LLMs).
  • Creating movie search applications, such as the example of a movie searching app like cinemattr.ca.
  • Building retriever applications where the 'summary' column can be used for metadata matching.
  • Populating vector databases with structured information from 'stars', 'directors', and 'genres' for advanced querying and analysis.

Coverage

The dataset's geographic scope is global. It includes movies released within the time frame of 1950 to 2023. The data availability specifies that 89% of the movies have detailed plot information, and all movies (100%) include a short summary. The dataset focuses on films with more than 1000 ratings on IMDb.

License

CC0

Who Can Use It

This dataset is suitable for:
  • AI and machine learning developers who are building models based on natural language processing.
  • Data scientists and researchers interested in film data and entertainment analytics.
  • Software engineers developing applications that require movie plot summaries or metadata, such as recommendation engines.
  • Students and enthusiasts looking for high-quality, pre-processed text data for LLM projects.

Dataset Name Suggestions

  • IMDb Verified Movie Plots
  • Historical Film Summaries (1950-2023)
  • Wikipedia Movie Plot Collection
  • LLM-Ready Movie Dataset
  • Global Cinema Plot Archive

Attributes

Original Data Source: Movie Plots from Wikipedia

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free