Dark Mode

Home

Data Categories

AI & ML Data

Databricks Human Instruction Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Databricks Human Instruction Dataset

Data Science and Analytics

Tags and Keywords

Software

Nlp

Research

Text

Instructions

Human

Dataset

Trusted By

Databricks Human Instruction Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a collection of over 15,000 records generated by Databricks employees, specifically designed to enable large language models to exhibit the interactive qualities of conversational AI. It serves as an open-source, human-generated instruction corpus, invaluable for fine-tuning large language models. The contributors created prompt and response pairs across eight distinct instruction categories, carefully avoiding external web sources (with the exception of Wikipedia for certain subsets) and generative AI in their formulations. This dataset holds significant value for instruction fine-tuning, synthetic data generation, and data augmentation, and is openly available for any purpose, including academic and commercial applications.

Columns

instruction: Represents the prompt or question provided.
context: Serves as reference material relevant to the instruction.
response: Contains the generated response to the instruction.
category: Indicates the annotator behavioural category, derived from the InstructGPT paper.

Distribution

The dataset is provided as a CSV file, containing fields for instruction, context, response, and category. It comprises over 15,000 records, with 14,781 unique values for 'instruction' and 14,944 unique values for 'category'.

Usage

This dataset is ideal for several applications, including:

Instruction fine-tuning of large language models to enhance their interactive capabilities.
Generating synthetic data by using the human-generated prompts as few-shot examples for large open language models.
Data augmentation techniques, such as paraphrasing prompts or short responses to regularise the dataset and improve model robustness.

Coverage

The dataset has a global reach. It was listed on 11/06/2025. The data is human-generated by Databricks employees. While the language used is American English, it is noted that some annotators may not be native English speakers. The demographic profile and subject matter of the data may reflect the composition of Databricks employees. It is important to note that as Wikipedia was consulted for certain categories, the dataset may reflect biases, factual errors, or topical focuses present in Wikipedia.

License

CC-BY-SA

Who Can Use It

This dataset is intended for a wide range of users, including:

Data Scientists and Machine Learning Engineers: For fine-tuning and developing large language models.
Researchers: For studies on instruction-following, synthetic data generation, and data augmentation in natural language processing.
Developers: Building applications that require interactive or instruction-based language model capabilities.
Organisations: For commercial product development involving custom language models.

Dataset Name Suggestions

Dolly 15K Instruction Corpus
Databricks Human Instruction Data
LLM Fine-tuning Prompt Dataset
Opendatabay Dolly 15K
Interactive AI Training Data

Attribute

Original Data Source: Databricks Dolly 15K Dataset

Listing Stats

VIEWS

DOWNLOADS

LISTED

11/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...