Multilingual Databricks Dolly 15k Parallel Corpus
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Providing high-quality, human-generated prompts in a multilingual format, this corpus serves as a vital resource for instruction-tuning large language models. It extends the original Databricks Dolly 15k dataset into five additional languages—Russian, Kazakh, Spanish, Italian, and French—using the googletrans library. By aligning these prompts through a unique identifier, the collection enables researchers to develop models that can understand and respond to instructions across diverse linguistic backgrounds, including lower-resource languages like Kazakh.
Columns
- uid: A unique identifier assigned to each entry, allowing for the precise alignment of translations across all six included languages.
- instruction: The human-generated task or prompt that the model is expected to follow.
- context: Optional background information or text provided to give the model the necessary information to complete the instruction; this may contain null values where no context is required.
- response: The high-quality human output that serves as the correct answer or target for the instruction.
- category: The type of task being performed, with common classifications including open question-answering and general question-answering.
- lang: The language code (e.g., en, ru, kk, es, it, fr) representing the specific language used for the text in that record.
Distribution
The information is delivered in a CSV format titled
databricks-dolly-15k-parallel-corpus-6.csv with a file size of 95.87 MB. It contains approximately 88,000 valid records structured across 6 columns, with a 100% validity rate for key fields like instruction and response. The resource has achieved a top-tier usability score of 10.00 and is provided as a finished archive with no future updates planned.Usage
This collection is ideal for training multilingual large language models to follow instructions and for fine-tuning chatbots for specific regional markets. It can be used as a parallel corpus for benchmarking machine translation quality or for research into cross-lingual transfer learning. Developers can also utilise the records to build evaluation sets for testing how well models maintain intent and factual accuracy across different languages.
Coverage
The scope is primarily linguistic, providing parallel data for English, Russian, Kazakh, Spanish, Italian, and French. While the foundation is the 15,000-prompt Databricks Dolly dataset, users should note that up to 600 records were lost for some languages during the translation process. The data provides a broad demographic reach by covering several major global languages alongside Kazakh, offering a varied range of topics and task categories.
License
CC BY-SA 4.0
Who Can Use It
Machine learning engineers can leverage these records to improve the performance of instruction-following agents in non-English contexts. Academic researchers in the field of natural language processing can use the parallel structure to study translation nuances and model biases. Additionally, developers creating localised AI solutions for European or Central Asian markets will find the inclusion of Kazakh and Russian particularly beneficial for their specialised workflows.
Dataset Name Suggestions
- Multilingual Databricks Dolly 15k Parallel Corpus
- Six-Language Instruction-Following Prompt Database
- Dolly 15k Parallel Translation Dataset for LLM Tuning
- Global Dolly: Human-Generated Multilingual Instructions
- Cross-Lingual Instruction-Response Parallel Corpus
Attributes
Original Data Source: Multilingual Databricks Dolly 15k Parallel Corpus
Loading...
Free
Download Dataset in ZIP Format
Recommended Datasets
Loading recommendations...
