Databricks Human Instruction Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a collection of over 15,000 records generated by Databricks employees, specifically designed to enable large language models to exhibit the interactive qualities of conversational AI. It serves as an open-source, human-generated instruction corpus, invaluable for fine-tuning large language models. The contributors created prompt and response pairs across eight distinct instruction categories, carefully avoiding external web sources (with the exception of Wikipedia for certain subsets) and generative AI in their formulations. This dataset holds significant value for instruction fine-tuning, synthetic data generation, and data augmentation, and is openly available for any purpose, including academic and commercial applications.
Columns
- instruction: Represents the prompt or question provided.
- context: Serves as reference material relevant to the instruction.
- response: Contains the generated response to the instruction.
- category: Indicates the annotator behavioural category, derived from the InstructGPT paper.
Distribution
The dataset is provided as a CSV file, containing fields for instruction, context, response, and category. It comprises over 15,000 records, with 14,781 unique values for 'instruction' and 14,944 unique values for 'category'.
Usage
This dataset is ideal for several applications, including:
- Instruction fine-tuning of large language models to enhance their interactive capabilities.
- Generating synthetic data by using the human-generated prompts as few-shot examples for large open language models.
- Data augmentation techniques, such as paraphrasing prompts or short responses to regularise the dataset and improve model robustness.
Coverage
The dataset has a global reach. It was listed on 11/06/2025. The data is human-generated by Databricks employees. While the language used is American English, it is noted that some annotators may not be native English speakers. The demographic profile and subject matter of the data may reflect the composition of Databricks employees. It is important to note that as Wikipedia was consulted for certain categories, the dataset may reflect biases, factual errors, or topical focuses present in Wikipedia.
License
CC-BY-SA
Who Can Use It
This dataset is intended for a wide range of users, including:
- Data Scientists and Machine Learning Engineers: For fine-tuning and developing large language models.
- Researchers: For studies on instruction-following, synthetic data generation, and data augmentation in natural language processing.
- Developers: Building applications that require interactive or instruction-based language model capabilities.
- Organisations: For commercial product development involving custom language models.
Dataset Name Suggestions
- Dolly 15K Instruction Corpus
- Databricks Human Instruction Data
- LLM Fine-tuning Prompt Dataset
- Opendatabay Dolly 15K
- Interactive AI Training Data
Attribute
Original Data Source: Databricks Dolly 15K Dataset