Opendatabay APP

The Jedi Master Dialogue Collection

Data Science and Analytics

Tags and Keywords

Linguistics

Dialogue

Star

Wars

Yoda

Speech

Trusted By
Trusted by company1Trusted by company2Trusted by company3
The Jedi Master Dialogue Collection Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This collection of dialogues, known as the Yoda Speech Corpus, provides data specifically structured for the study of the Jedi Master Yoda's distinct English usage within a discourse context. The content is drawn from the scripts of the Star Wars movies, spanning Episodes I through VI. It captures not only the lines spoken by Yoda but also the responses and inputs from other characters involved in the specific conversations, offering a rich environment for linguistic analysis. However, researchers should note that the dataset was compiled using automated procedures without subsequent manual correction, meaning it may contain typos present in the original scripts or various processing errors.

Columns

The dataset contains seven distinct fields:
  • movie: Indicates the Star Wars movie number (Episode I-VI).
  • scene: Represents the scene number as referenced in the scripts. Values range from 48 up to 234.
  • line: The specific dialogue line number, spanning a range from 341 to 1705.
  • character: The name of the character speaking the line. Yoda accounts for approximately 27% of the lines, with the narrator accounting for 25%, and 29 unique characters in total.
  • text: The actual text spoken by the character, featuring 367 unique values.
  • slug: The scene slug, which provides a brief descriptive header for the setting (e.g., 50 INT YODA'S HOUSE).
  • component: Identifies whether the line is attributed to a character (75% of entries) or describes an action (25% of entries).

Distribution

The data is delivered in a CSV file format named yoda-corpus.csv, weighing 54.66 kB. It is structured with seven columns and contains 371 valid records across all primary fields. This resource is considered static, with an expected update frequency of "Never."

Usage

This data product is ideally suited for academic and research applications focusing on linguistics and natural language processing. Key applications include:
  • Linguistic Study: Analysing non-standard syntax and morphology, specifically the unique speech patterns observed in Yoda English.
  • Discourse Analysis: Examining how specific characters interact and converse within the context of a film script.
  • Machine Learning: Training models for stylistic text generation or character-specific speech emulation.

Coverage

The dataset's scope covers the dialogue content of the core six Star Wars movies, specifically Episodes I through VI. It is focused entirely on linguistic data derived from the fictional universe's scripts. The data availability is specific to dialogues involving Yoda as an interlocutor throughout these movies.

License

The usage of this dataset is governed by the CC BY-NC-SA 4.0 license.

Who Can Use It

This dataset is designed for a variety of users, typically those with a high usability rating (10.00):
  • Academics and Linguists: For research into cinematic dialogue structure and grammatical anomalies.
  • Data Scientists/NLP Engineers: For developing character-specific language models or testing text classification algorithms.
  • Media and Cultural Researchers: For quantitative analysis of film scripts and character prominence.

Dataset Name Suggestions

  • Yoda Speech Corpus
  • Star Wars Dialogue Data (Episodes 1-6)
  • Yoda's Linguistic Patterns
  • The Jedi Master Dialogue Collection

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

11/11/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format