Cross-Lingual Question & Answer Data
Software and Technology
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
The Mintaka by AmazonScience dataset is an invaluable resource for developing advanced end-to-end models in multilingual, complex natural language question answering. It includes the English original along with 8 additional languages: Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish, making a total of 9 languages. This dataset is ideal for building robust models capable of tackling a wide array of questions, as it provides train and test splits for each language, facilitating accurate model development and evaluation. Data points within the dataset feature important distinguishing characteristics such as question category, complexity type, and entities related to both the question and the answer itself, which enhances the efficiency of model training.
Columns
The dataset includes the following columns:
- id: A unique identifier for each question and answer entry.
- lang: The language of the question, typically represented by a language code (e.g., 'eng' for English).
- question: The natural language question posed in the respective language.
- answerText: The textual answer corresponding to the question.
- category: The type of question, such as "Who" or "When".
- complexityType: An indicator of whether the question is simple or complex.
- questionEntity: A brief description or definition about entities related to the question.
- answerEntity: A brief description or definition about entities related to the answer.
The dataset is organised into
train.csv
, dev.csv
, and test.csv
files for different splits of the data.Distribution
The dataset is typically provided in CSV format and is partitioned into train, development (dev), and test splits for each of the supported languages. For instance, the
test.csv
file contains 4000 unique question and answer records. While specific file sizes for all splits are not detailed, the structure supports distinct sets for model training, development, and evaluation. The dataset is described as having a global region coverage.Usage
This dataset is suited for a variety of applications and use cases:
- Developing machine learning pipelines that provide a deep understanding of context and accuracy when answering natural language questions, especially across the 9 different languages.
- Designing complex question-answering applications that can be deployed on diverse language platforms to offer multi-lingual support for natural language queries.
- Enhancing Natural Language Processing (NLP) techniques by incorporating multilingual complexity into systems, which further refines the handling of questions and answers in various languages.
- Training advanced end-to-end models in multilingual, complex natural language question answering systems.
Coverage
The dataset covers natural language questions and answers across 9 languages: English, Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish. Its scope is global, without specific notes on geographic limitations or demographic groups. The data quality is assessed at 5 out of 5.
License
CC0
Who Can Use It
This dataset is intended for a broad range of users and applications:
- Data scientists and machine learning engineers working on multilingual Natural Language Processing (NLP) models.
- Researchers focused on advancements in natural language understanding, question answering systems, and cross-lingual AI.
- Developers building and deploying multi-language conversational AI agents, chatbots, or virtual assistants.
- Academics exploring linguistic complexity, cross-lingual information retrieval, and the nuances of human language across different cultures.
Dataset Name Suggestions
- Multilingual Complex QA
- Mintaka Multilingual QA
- Global NLP Question Answer Set
- Cross-Lingual Question & Answer Data
- AmazonScience Multilingual Q&A
Attributes
Original Data Source: Mintaka by AmazonScience (Multilingual Q&A)