TyDi QA Extension Dataset
Knowledge Bundles
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, known as Answerable-TyDiQA, is an extension of the GoldP subtask from the original TyDi QA. It serves as a valuable resource for training artificial intelligence models for natural language processing tasks. The collection comprises an extensive array of question-answer pairs, meticulously extracted from the Tashkeela Giclée Web Corpus. It offers researchers, developers, and data scientists a rich set of real-world scenarios for exploration in areas like language engineering and AI research.
Columns
The dataset typically includes the following columns:
- question_text: This column contains the actual text of the questions asked. (String)
- document_title: This column provides the title of the document associated with each question. (String)
- language: This column indicates the language in which the question is posed. (String)
- annotations: This column holds annotations pertinent to the question. (String)
- document_plaintext: This column includes the plain text content of the document linked to the question. (String)
- document_url: This column provides the URL of the document associated with the question. (String)
Distribution
The dataset is typically provided in CSV format, with a file such as
train.csv
containing the question-answer pairs. While specific total row counts are not explicitly stated, the dataset features a substantial number of unique values across its columns. For instance, there are 57,645 unique plain text documents and 35,185 unique document titles. The language distribution includes Arabic at approximately 26%, Finnish at 12%, and other languages collectively making up about 63% of the dataset.Usage
This dataset is ideal for various applications, including:
- AI-based question answering systems: It can be used to train and test AI models to understand question formatting, language usage, and potential answer identification.
- Natural language processing research: Researchers can leverage this data to identify language usage trends and extract insights for developing advanced applications such as sentiment analysis or machine translation.
- Search engine optimisation (SEO): Businesses can utilise the dataset to craft content based on commonly asked questions and answers, potentially improving their organic search engine rankings.
Coverage
The dataset has a global scope, drawing from a variety of linguistic sources. No specific time range or demographic scope is provided, but its multi-language nature suggests broad applicability.
License
CC0
Who Can Use It
The dataset is primarily intended for:
- AI researchers: For exploring and gaining insights into AI language understanding.
- Language engineers: For developing and refining language-related technologies.
- NLP enthusiasts: For experimenting with question answering, information extraction, and text summarisation tasks.
Dataset Name Suggestions
- Multi-Lingual QA Pairs
- Answerable NLP Corpus
- Global Question-Answer Dataset
- TyDi QA Extension Dataset
- Web Corpus Q&A
Attributes
Original Data Source: TyDi QA (Questions & Answers in 11 Languages)