Dark Mode

Home

Data Categories

AI & ML Data

Cross-Lingual Question & Answer Data

FREE DATASET LIBRARY

Verified Data Provider

£0

Cross-Lingual Question & Answer Data

Software and Technology

Tags and Keywords

Computer

Science

Programming

Nlp

Languages

Trusted By

Cross-Lingual Question & Answer Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

The Mintaka by AmazonScience dataset is an invaluable resource for developing advanced end-to-end models in multilingual, complex natural language question answering. It includes the English original along with 8 additional languages: Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish, making a total of 9 languages. This dataset is ideal for building robust models capable of tackling a wide array of questions, as it provides train and test splits for each language, facilitating accurate model development and evaluation. Data points within the dataset feature important distinguishing characteristics such as question category, complexity type, and entities related to both the question and the answer itself, which enhances the efficiency of model training.

Columns

The dataset includes the following columns:

id: A unique identifier for each question and answer entry.
lang: The language of the question, typically represented by a language code (e.g., 'eng' for English).
question: The natural language question posed in the respective language.
answerText: The textual answer corresponding to the question.
category: The type of question, such as "Who" or "When".
complexityType: An indicator of whether the question is simple or complex.
questionEntity: A brief description or definition about entities related to the question.
answerEntity: A brief description or definition about entities related to the answer.

The dataset is organised into train.csv, dev.csv, and test.csv files for different splits of the data.

Distribution

The dataset is typically provided in CSV format and is partitioned into train, development (dev), and test splits for each of the supported languages. For instance, the test.csv file contains 4000 unique question and answer records. While specific file sizes for all splits are not detailed, the structure supports distinct sets for model training, development, and evaluation. The dataset is described as having a global region coverage.

Usage

This dataset is suited for a variety of applications and use cases:

Developing machine learning pipelines that provide a deep understanding of context and accuracy when answering natural language questions, especially across the 9 different languages.
Designing complex question-answering applications that can be deployed on diverse language platforms to offer multi-lingual support for natural language queries.
Enhancing Natural Language Processing (NLP) techniques by incorporating multilingual complexity into systems, which further refines the handling of questions and answers in various languages.
Training advanced end-to-end models in multilingual, complex natural language question answering systems.

Coverage

The dataset covers natural language questions and answers across 9 languages: English, Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish. Its scope is global, without specific notes on geographic limitations or demographic groups. The data quality is assessed at 5 out of 5.

License

CC0

Who Can Use It

This dataset is intended for a broad range of users and applications:

Data scientists and machine learning engineers working on multilingual Natural Language Processing (NLP) models.
Researchers focused on advancements in natural language understanding, question answering systems, and cross-lingual AI.
Developers building and deploying multi-language conversational AI agents, chatbots, or virtual assistants.
Academics exploring linguistic complexity, cross-lingual information retrieval, and the nuances of human language across different cultures.

Dataset Name Suggestions

Multilingual Complex QA
Mintaka Multilingual QA
Global NLP Question Answer Set
Cross-Lingual Question & Answer Data
AmazonScience Multilingual Q&A

Attributes

Original Data Source: Mintaka by AmazonScience (Multilingual Q&A)

Listing Stats

VIEWS

DOWNLOADS

LISTED

26/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format

Recommended Datasets

Loading recommendations...