Empathetic Dialogue Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a valuable collection of conversation models, specifically designed to offer insight and challenge for research into dialogue systems and the dynamics of human conversation. It is structured into three distinct sets: training, validation, and test, each containing detailed conversations. These conversations are enriched with corresponding speaker identifiers, allowing for a clear contextual flow. Furthermore, each entry includes an utterance index, the prompt or topic that initiated the conversation, a self-evaluation of the utterance, and assigned tags. This rich assembly of information provides a foundation for exploring the full potential of conversation topics and advancing the field of conversational AI.
Columns
The dataset is organised across several key columns, found consistently within the train, validation, and test CSV files:
- context: A string detailing the surrounding context of the conversation.
- prompt: A string indicating the specific prompt or topic that drives the conversation.
- utterance: A string representing the individual statement or response made by a speaker.
- selfeval: An integer score assigned as a self-evaluation for each utterance.
- tags: Associated string tags used to categorise or label dialogues.
Additionally, certain files like
test.csv
may include:- utterance_idx: An index for each utterance within a conversation.
- speaker_idx: Identifiers for individual speakers within the conversation.
- conv_id: A unique identifier for each conversation.
Distribution
The dataset is provided in CSV format, organised into three separate files:
train.csv
, validation.csv
, and test.csv
. While specific row or record counts for each file are not explicitly stated, the dataset is substantial, with the test.csv
file, for instance, containing thousands of unique values across various attributes, indicating a considerable volume of conversation data suitable for in-depth analysis and model development. Each row in these files contains the aforementioned eight columns, structured to facilitate the development and evaluation of conversational models.Usage
This dataset is ideal for a wide range of applications and research endeavours, including:
- Developing Machine Learning Models: Train models to generate natural conversations based on context and assign empathetic scores to generated responses using sentiment analysis techniques.
- Model Evaluation: Utilise the validation set for testing model functionality and the test set for final performance evaluation.
- Dialogue Categorisation: Employ the 'tags' column to label and categorise different conversations, such as 'casual chat' or 'career advice', enabling comparisons between standard and ML models.
- Building Empathetic AI: Develop empathetic open-domain conversation models for applications like virtual assistants or chatbots, including sorting conversations by topics and training models to respond appropriately.
- Linguistic Atmosphere Analysis: Use the self-evaluation scores to observe shifts in language atmosphere, mood, and tonality within conversations.
- Advanced NLP Research: Conduct research focusing on advanced architectural models like convolutional attention models, LSTMs, seq2seq architectures, Gated Recurrent Units (GRUs), and Transformer Networks to enhance conversation model performance and accuracy.
Coverage
The dataset's geographic scope is global, making it suitable for research and applications worldwide. The dataset was listed on 24 June 2025. There are no specific notes on data availability for particular groups or years beyond this.
License
CC0
Who Can Use It
This dataset is primarily intended for data scientists, machine learning engineers, and researchers focused on:
- Conversational AI Development: Those building or improving chatbots, virtual assistants, and other automated dialogue systems.
- Natural Language Processing (NLP): Professionals working on text analysis, sentiment analysis within dialogues, and understanding conversational dynamics.
- Academic Research: Scholars and students exploring advanced machine learning architectures for dialogue generation and evaluation.
Dataset Name Suggestions
- Empathetic Dialogue Dataset
- Conversational AI Benchmark
- Open Dialogue Research Data
- Empathic Chatbot Training Data
- Global Conversation Model Dataset
Attributes
Original Data Source: Empathetic Conversational Model Benchmark