Wikipedia Plot Summary Stopword Registry
Website Analytics & User Experience
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Identifying and filtering proper names from text is a vital step in refining natural language processing outcomes. Extracted from a vast archive of 35,000 Wikipedia movie plot summaries, these records provide a collection of over 54,000 unique personal names and named entities. By isolating these terms, researchers can improve topic modelling accuracy, ensuring that insights focus on the thematic content of a narrative rather than being obscured by specific character or location identifiers.
Columns
- name: The specific identifier for a person (first name, last name, or both) or a named entity found within the source material.
Distribution
The information is delivered as a single CSV file titled
Names-from-35k-WikipediaMoviePlots-Abbrivia.com-CC-BY-SA-4.0.csv, with a file size of approximately 433.11 kB. It contains 54,215 unique records within a single column. The data maintains 100% validity with no mismatched or missing entries and is provided as a static resource with no expected updates.Usage
This resource is ideal for use as a custom stopword list in natural language processing pipelines to filter out noise caused by proper names. It is well-suited for improving the quality of topic maps, allowing for more generalised insights into textual content. Additionally, developers can use the list to enhance named entity recognition systems or as a reference for cleaning scraped web data before performing sentiment analysis.
Coverage
The scope is based on 35,000 diverse movie plot summaries sourced from Wikipedia. It covers a broad range of Western and international names, spanning various eras of cinematic history. The demographic focus includes first names, surnames, and other named entities commonly found in digital storytelling and media summaries.
License
CC BY-SA 4.0
Who Can Use It
Natural Language Processing (NLP) engineers can leverage these records to refine their preprocessing steps. Data scientists working on topic modelling may utilise the list to remove dominant name markers from their clusters to reveal deeper thematic insights. Furthermore, linguistic researchers and developers can use the data to train or benchmark named entity recognition tools against cinematic-specific vocabulary.
Dataset Name Suggestions
- Wikipedia Movie Plot Names: NLP Stopword List
- Cinematic Named Entity Archive for Topic Modelling
- 54k Person and Entity Names for Text Refinement
- Wikipedia Plot Summary Stopword Registry
- Named Entity Recognition Reference for Movie Plots
Attributes
Original Data Source: Wikipedia Plot Summary Stopword Registry
Loading...
Free
Download Dataset in CSV Format
Recommended Datasets
Loading recommendations...
