English-to-Korean DeepL Model Training Dataset
NLP / Natural Language Processing
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Providing a robust foundation for natural language processing, these records contain English to Korean translations specifically curated for instruction-based models like GPT4ALL, Dolly, and Vicuna. These pairs were generated using the DeepL API to offer high-quality linguistic resources for machine learning development. By facilitating the training of models on instruction-input-output triplets, this collection supports the advancement of automated translation and conversational AI within the Korean linguistic context.
Columns
- id: A unique alphanumeric identifier for each record, often categorised by the source model type such as Vicuna or Alpaca.
- instruction: The translated command or task description provided to the model in Korean.
- input: The original English text that serves as the basis for the translation task or provides necessary context.
- output: The resulting translated text in Korean, generated after processing the given input and instruction.
Distribution
The information is delivered in a single CSV file titled
train.csv with a file size of approximately 254.75 MB. It contains 153,000 valid records with a 100% validity rate for the primary identification fields. While the output and instruction columns are largely populated, approximately 83% of the input fields are intentionally blank or marked as missing, reflecting instruction-only tasks.Usage
This resource is ideal for training and fine-tuning machine translation models specifically focused on English-to-Korean linguistic pairs. It is well-suited for benchmarking the performance of various translation APIs or models against established DeepL-generated outputs. Additionally, developers can use these records to build and evaluate the accuracy of conversational agents and large language models.
Coverage
The geographic and demographic scope focuses on the Korean-speaking population and those requiring English-to-Korean translation services. Temporally, the collection is a static snapshot with no further updates expected, providing a consistent baseline for research. The data covers a wide range of instruction types derived from multiple NLP datasets, ensuring a diverse array of linguistic patterns.
License
CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
Who Can Use It
Natural language processing researchers can leverage these records to develop more accurate translation algorithms and study cross-lingual instruction following. Language learners may utilise the input-output pairs to practice their translation skills and compare their work against professional-grade API results. Furthermore, data scientists can use the structured text to perform sentiment analysis or linguistic modelling in the Korean language.
Dataset Name Suggestions
- Korean Instruction and Translation Corpus for NLP
- English-to-Korean DeepL Model Training Dataset
- Instruction-Based Korean Translation Registry
- Multimodel English-Korean Linguistic Pairs
- High-Fidelity Korean NLP Translation Archive
Attributes
Original Data Source: English-to-Korean DeepL Model Training Dataset
Loading...
Free
Download Dataset in CSV Format
Recommended Datasets
Loading recommendations...
