Dolly 15K AI Chat Data
Telecommunications & Network Data
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides over 15,000 language models and dialogues designed to power dynamic ChatGPT applications. It was created by Databricks employees, aiming to facilitate the use of large language models (LLMs) for interactive dialogue interactions. The dataset generates prompt-response pairs across eight distinct instruction categories and deliberately avoids information from external web sources, with the exception of Wikipedia for specific instruction sets. This open-source resource is ideal for exploring the boundaries of text-based conversations and uncovering new insights into natural language processing.
Columns
- Instruction (Text): This field contains the text prompt intended to generate an appropriate response from a machine learning model or chatbot, utilising natural language processing techniques. It represents what one individual says in a conversation.
- Context (Text): Providing additional information, the context field enhances accuracy by offering the model more detail about the ongoing conversation or request execution. Like the instruction, it captures what is said by one individual.
- Response (Text): This column holds the conversational reply or what is said back by the other individual in the dialogue.
- Category (Text): Each prompt-response pair is classified into one of eight distinct categories based on its content. Examples of unique category values include 'open_qa' and 'general_qa'.
Distribution
The dataset is typically provided as a data file, usually in CSV format. It contains over 15,000 language models and dialogues, with the main
train.csv
file consisting of this quantity of records. Each record within the dataset represents a unique prompt-response pair, or a single turn in a conversation between two individuals. The columns are all of a string data type.Usage
This dataset is suited for a variety of applications and use cases:
- Training dialogue systems by developing multiple funneling pipelines to enrich models with real-world conversations.
- Creating intelligent chatbot interactions.
- Generating natural language answers as part of Q&A systems.
- Utilising excerpts from Wikipedia for particular subsets of instruction categories.
- Leveraging the classification labels with supervised learning techniques, such as multi-class classification neural networks or logistic regression classifiers.
- Developing deep learning models to detect and respond to conversational intent.
- Training language models for customer service queries using natural language processing (NLP).
- Creating custom dialogue agents capable of handling more intricate conversational interactions.
Coverage
The dataset has a global reach. It was listed on 17/06/2025, and its content focuses on general conversational and Q&A interactions, without specific demographic limitations.
License
CC0
Who Can Use It
This dataset is valuable for a wide range of users, including AI/ML developers, researchers, and data scientists looking to:
- Build and train conversational AI models.
- Develop advanced chatbot applications.
- Explore new insights in natural language processing.
- Create bespoke dialogue agents for various sectors, such as customer service.
- Apply supervised learning to classify conversational data.
Dataset Name Suggestions
- Databricks Dolly (15K) Dialogue Data
- LLM Training Conversation Dataset
- Dolly 15K AI Chat Data
- Prompt-Response Pairs for LLMs
Attributes
Original Data Source: Databricks Dolly (15K)