Opendatabay APP

Wikipedia Plot Summary Stopword Registry

Website Analytics & User Experience

Tags and Keywords

Names

Nlp

Stopwords

Movies

Wikipedia

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Wikipedia Plot Summary Stopword Registry Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Identifying and filtering proper names from text is a vital step in refining natural language processing outcomes. Extracted from a vast archive of 35,000 Wikipedia movie plot summaries, these records provide a collection of over 54,000 unique personal names and named entities. By isolating these terms, researchers can improve topic modelling accuracy, ensuring that insights focus on the thematic content of a narrative rather than being obscured by specific character or location identifiers.

Columns

  • name: The specific identifier for a person (first name, last name, or both) or a named entity found within the source material.

Distribution

The information is delivered as a single CSV file titled Names-from-35k-WikipediaMoviePlots-Abbrivia.com-CC-BY-SA-4.0.csv, with a file size of approximately 433.11 kB. It contains 54,215 unique records within a single column. The data maintains 100% validity with no mismatched or missing entries and is provided as a static resource with no expected updates.

Usage

This resource is ideal for use as a custom stopword list in natural language processing pipelines to filter out noise caused by proper names. It is well-suited for improving the quality of topic maps, allowing for more generalised insights into textual content. Additionally, developers can use the list to enhance named entity recognition systems or as a reference for cleaning scraped web data before performing sentiment analysis.

Coverage

The scope is based on 35,000 diverse movie plot summaries sourced from Wikipedia. It covers a broad range of Western and international names, spanning various eras of cinematic history. The demographic focus includes first names, surnames, and other named entities commonly found in digital storytelling and media summaries.

License

CC BY-SA 4.0

Who Can Use It

Natural Language Processing (NLP) engineers can leverage these records to refine their preprocessing steps. Data scientists working on topic modelling may utilise the list to remove dominant name markers from their clusters to reveal deeper thematic insights. Furthermore, linguistic researchers and developers can use the data to train or benchmark named entity recognition tools against cinematic-specific vocabulary.

Dataset Name Suggestions

  • Wikipedia Movie Plot Names: NLP Stopword List
  • Cinematic Named Entity Archive for Topic Modelling
  • 54k Person and Entity Names for Text Refinement
  • Wikipedia Plot Summary Stopword Registry
  • Named Entity Recognition Reference for Movie Plots

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

28/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format