Opendatabay APP

TyDi QA Extension Dataset

Knowledge Bundles

Tags and Keywords

Earth

Nature

Nlp

Data

Type

Trusted By
Trusted by company1Trusted by company2Trusted by company3
TyDi QA Extension Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, known as Answerable-TyDiQA, is an extension of the GoldP subtask from the original TyDi QA. It serves as a valuable resource for training artificial intelligence models for natural language processing tasks. The collection comprises an extensive array of question-answer pairs, meticulously extracted from the Tashkeela Giclée Web Corpus. It offers researchers, developers, and data scientists a rich set of real-world scenarios for exploration in areas like language engineering and AI research.

Columns

The dataset typically includes the following columns:
  • question_text: This column contains the actual text of the questions asked. (String)
  • document_title: This column provides the title of the document associated with each question. (String)
  • language: This column indicates the language in which the question is posed. (String)
  • annotations: This column holds annotations pertinent to the question. (String)
  • document_plaintext: This column includes the plain text content of the document linked to the question. (String)
  • document_url: This column provides the URL of the document associated with the question. (String)

Distribution

The dataset is typically provided in CSV format, with a file such as train.csv containing the question-answer pairs. While specific total row counts are not explicitly stated, the dataset features a substantial number of unique values across its columns. For instance, there are 57,645 unique plain text documents and 35,185 unique document titles. The language distribution includes Arabic at approximately 26%, Finnish at 12%, and other languages collectively making up about 63% of the dataset.

Usage

This dataset is ideal for various applications, including:
  • AI-based question answering systems: It can be used to train and test AI models to understand question formatting, language usage, and potential answer identification.
  • Natural language processing research: Researchers can leverage this data to identify language usage trends and extract insights for developing advanced applications such as sentiment analysis or machine translation.
  • Search engine optimisation (SEO): Businesses can utilise the dataset to craft content based on commonly asked questions and answers, potentially improving their organic search engine rankings.

Coverage

The dataset has a global scope, drawing from a variety of linguistic sources. No specific time range or demographic scope is provided, but its multi-language nature suggests broad applicability.

License

CC0

Who Can Use It

The dataset is primarily intended for:
  • AI researchers: For exploring and gaining insights into AI language understanding.
  • Language engineers: For developing and refining language-related technologies.
  • NLP enthusiasts: For experimenting with question answering, information extraction, and text summarisation tasks.

Dataset Name Suggestions

  • Multi-Lingual QA Pairs
  • Answerable NLP Corpus
  • Global Question-Answer Dataset
  • TyDi QA Extension Dataset
  • Web Corpus Q&A

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

21/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free