Francophone Conversational Dataset
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides full scripts from numerous series and plays written in French. It was created through a process of manual transcription and automatic extraction. Users should anticipate potential errors within the data, including inaccuracies in the separation between narrator names and dialogue. The dataset is particularly useful for applications in artificial intelligence and machine learning, especially those focused on natural language processing.
Columns
The dataset is structured with the following key columns:
- source: Identifies the original source material of the dialogue, such as "Les_frères_scott" or "Friends". A significant portion of the data falls under "Other" sources.
- episode: Specifies the particular episode within a series, e.g., "S01E01", or "S01E03". The majority of episodes are categorised as "Other".
- speaker: Indicates the character or entity speaking a line. A notable portion of these entries are null. Examples of specific speakers include "Joey".
- utterance (also referred to as line of dialog): Contains the actual lines of dialogue. There are 451,877 unique dialogue lines within the dataset.
Distribution
The data files are typically in CSV format. Specific numbers for rows or records are not explicitly available beyond the count of unique utterance values. The distribution across sources shows that "Other" sources account for 74% of the data, "Les_frères_scott" for 15%, and "Friends" for 11%. For episodes, "Other" constitutes 98%, with "S01E01" and "S01E03" each representing 1%. Speaker distribution indicates 78% as "Other", 19% as null, and 2% as "Joey".
Usage
This dataset is ideally suited for:
- Training natural language processing (NLP) models.
- Developing conversational AI systems.
- Conducting linguistic research on French dialogue structures.
- Analysing character interactions and script patterns in French media.
- Sentiment analysis on conversational text.
Coverage
The dataset's coverage is global in terms of region. It encompasses full scripts of various series and plays, specifically those written in the French language. Specific time ranges for the original series and plays are not detailed, nor are specific notes on data availability for certain demographic groups or years beyond the language itself.
License
CC-BY-NC
Who Can Use It
This dataset is intended for a range of users, including:
- AI Researchers and Developers: For creating and refining models related to natural language understanding and generation in French.
- Data Scientists: For data exploration, pattern recognition, and feature engineering in textual data.
- Linguists and Academics: For studying French phonetics, syntax, semantics, and pragmatics within a natural dialogue context.
- Entertainment Industry Analysts: For understanding dialogue characteristics and trends in French series and plays.
Dataset Name Suggestions
- French Script Dialogue Corpus
- Francophone Conversational Dataset
- French Series & Play Transcripts
- Dialogue Data for French NLP
- French Conversational Scripts
Attributes
Original Data Source: french dialogs