Question Answering Text Spans Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a reading comprehension dataset designed to assess a machine's ability to answer questions by extracting information from text [1, 2]. It comprises questions posed by crowdworkers based on a collection of Wikipedia articles [1, 2]. The answers to these questions are provided as specific segments of text, or spans, directly from the corresponding reading passages [1, 2]. The dataset serves as a valuable resource for developing and testing natural language processing models focused on understanding and responding to text-based queries [2].
Columns
- title: The title of the Wikipedia article from which the context is drawn. (String) [1, 3]
- context: The full textual content of the Wikipedia article. (String) [1, 3]
- question: The query formulated by a crowdworker, requiring an answer from the context. (String) [1, 3]
- answers: The answer to the question, presented as a list of text spans extracted from the context. (List of strings) [1, 3]
- id: A unique identifier for each individual question-answer pair within the dataset. (Integer) [3]
Distribution
The dataset is typically structured in a CSV file format, with
train.csv
and validation.csv
splits available [1, 2, 4]. Each row within these files represents a single question-answer pair, providing a clear and structured layout for analysis [2]. While specific total row counts for the entire dataset are not explicitly provided, the train.csv
split alone contains 87,355 unique IDs, indicating a substantial volume of data [3].Usage
This dataset is ideal for a variety of applications and use cases, including:
- Developing Reading Comprehension models capable of answering open-ended questions based on provided text passages [2].
- Training models to extract specific text spans as answers for multiple-choice questions from source materials [2].
- Building systems that generate large training datasets for reading comprehension by creating synthetic questions from existing passages [2].
Coverage
The dataset's coverage is global, drawing contexts from a wide array of Wikipedia articles [2, 5]. It does not specify a particular time range or demographic scope, as its content is derived from the broad and diverse nature of Wikipedia.
License
CC0
Who Can Use It
This dataset is intended for:
- Researchers and developers in the fields of Artificial Intelligence and Machine Learning, particularly those focusing on Natural Language Processing (NLP) [2].
- Educators and analysts interested in education and learning analytics, as it supports the development of tools for understanding textual information [2].
- Individuals and teams working on text mining and text pre-processing tasks, where the ability to extract precise information from text is crucial [2].
Dataset Name Suggestions
- Stanford Question Answering Dataset (SQuAD) [2]
- Reading Comprehension Wikipedia Dataset
- Question Answering Text Spans
- Crowdworker QA Dataset
Attributes
Original Data Source: Stanford Question Answering Dataset (SQuAD)