Opendatabay APP

English-to-Korean DeepL Model Training Dataset

NLP / Natural Language Processing

Tags and Keywords

Korean

Translation

Nlp

Instruction

Linguistics

Trusted By
Trusted by company1Trusted by company2Trusted by company3
English-to-Korean DeepL Model Training Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Providing a robust foundation for natural language processing, these records contain English to Korean translations specifically curated for instruction-based models like GPT4ALL, Dolly, and Vicuna. These pairs were generated using the DeepL API to offer high-quality linguistic resources for machine learning development. By facilitating the training of models on instruction-input-output triplets, this collection supports the advancement of automated translation and conversational AI within the Korean linguistic context.

Columns

  • id: A unique alphanumeric identifier for each record, often categorised by the source model type such as Vicuna or Alpaca.
  • instruction: The translated command or task description provided to the model in Korean.
  • input: The original English text that serves as the basis for the translation task or provides necessary context.
  • output: The resulting translated text in Korean, generated after processing the given input and instruction.

Distribution

The information is delivered in a single CSV file titled train.csv with a file size of approximately 254.75 MB. It contains 153,000 valid records with a 100% validity rate for the primary identification fields. While the output and instruction columns are largely populated, approximately 83% of the input fields are intentionally blank or marked as missing, reflecting instruction-only tasks.

Usage

This resource is ideal for training and fine-tuning machine translation models specifically focused on English-to-Korean linguistic pairs. It is well-suited for benchmarking the performance of various translation APIs or models against established DeepL-generated outputs. Additionally, developers can use these records to build and evaluate the accuracy of conversational agents and large language models.

Coverage

The geographic and demographic scope focuses on the Korean-speaking population and those requiring English-to-Korean translation services. Temporally, the collection is a static snapshot with no further updates expected, providing a consistent baseline for research. The data covers a wide range of instruction types derived from multiple NLP datasets, ensuring a diverse array of linguistic patterns.

License

CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication

Who Can Use It

Natural language processing researchers can leverage these records to develop more accurate translation algorithms and study cross-lingual instruction following. Language learners may utilise the input-output pairs to practice their translation skills and compare their work against professional-grade API results. Furthermore, data scientists can use the structured text to perform sentiment analysis or linguistic modelling in the Korean language.

Dataset Name Suggestions

  • Korean Instruction and Translation Corpus for NLP
  • English-to-Korean DeepL Model Training Dataset
  • Instruction-Based Korean Translation Registry
  • Multimodel English-Korean Linguistic Pairs
  • High-Fidelity Korean NLP Translation Archive

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

29/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format