Opendatabay APP

Francophone Conversational Dataset

Entertainment & Media Consumption

Tags and Keywords

Arts

Entertainment

Text

Nlp

French

Conversation

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Francophone Conversational Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides full scripts from numerous series and plays written in French. It was created through a process of manual transcription and automatic extraction. Users should anticipate potential errors within the data, including inaccuracies in the separation between narrator names and dialogue. The dataset is particularly useful for applications in artificial intelligence and machine learning, especially those focused on natural language processing.

Columns

The dataset is structured with the following key columns:
  • source: Identifies the original source material of the dialogue, such as "Les_frères_scott" or "Friends". A significant portion of the data falls under "Other" sources.
  • episode: Specifies the particular episode within a series, e.g., "S01E01", or "S01E03". The majority of episodes are categorised as "Other".
  • speaker: Indicates the character or entity speaking a line. A notable portion of these entries are null. Examples of specific speakers include "Joey".
  • utterance (also referred to as line of dialog): Contains the actual lines of dialogue. There are 451,877 unique dialogue lines within the dataset.

Distribution

The data files are typically in CSV format. Specific numbers for rows or records are not explicitly available beyond the count of unique utterance values. The distribution across sources shows that "Other" sources account for 74% of the data, "Les_frères_scott" for 15%, and "Friends" for 11%. For episodes, "Other" constitutes 98%, with "S01E01" and "S01E03" each representing 1%. Speaker distribution indicates 78% as "Other", 19% as null, and 2% as "Joey".

Usage

This dataset is ideally suited for:
  • Training natural language processing (NLP) models.
  • Developing conversational AI systems.
  • Conducting linguistic research on French dialogue structures.
  • Analysing character interactions and script patterns in French media.
  • Sentiment analysis on conversational text.

Coverage

The dataset's coverage is global in terms of region. It encompasses full scripts of various series and plays, specifically those written in the French language. Specific time ranges for the original series and plays are not detailed, nor are specific notes on data availability for certain demographic groups or years beyond the language itself.

License

CC-BY-NC

Who Can Use It

This dataset is intended for a range of users, including:
  • AI Researchers and Developers: For creating and refining models related to natural language understanding and generation in French.
  • Data Scientists: For data exploration, pattern recognition, and feature engineering in textual data.
  • Linguists and Academics: For studying French phonetics, syntax, semantics, and pragmatics within a natural dialogue context.
  • Entertainment Industry Analysts: For understanding dialogue characteristics and trends in French series and plays.

Dataset Name Suggestions

  • French Script Dialogue Corpus
  • Francophone Conversational Dataset
  • French Series & Play Transcripts
  • Dialogue Data for French NLP
  • French Conversational Scripts

Attributes

Original Data Source: french dialogs

Listing Stats

VIEWS

2

DOWNLOADS

0

LISTED

24/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format