Opendatabay APP

Question Answering Text Spans Dataset

Education & Learning Analytics

Tags and Keywords

Education

Text

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Question Answering Text Spans Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a reading comprehension dataset designed to assess a machine's ability to answer questions by extracting information from text [1, 2]. It comprises questions posed by crowdworkers based on a collection of Wikipedia articles [1, 2]. The answers to these questions are provided as specific segments of text, or spans, directly from the corresponding reading passages [1, 2]. The dataset serves as a valuable resource for developing and testing natural language processing models focused on understanding and responding to text-based queries [2].

Columns

  • title: The title of the Wikipedia article from which the context is drawn. (String) [1, 3]
  • context: The full textual content of the Wikipedia article. (String) [1, 3]
  • question: The query formulated by a crowdworker, requiring an answer from the context. (String) [1, 3]
  • answers: The answer to the question, presented as a list of text spans extracted from the context. (List of strings) [1, 3]
  • id: A unique identifier for each individual question-answer pair within the dataset. (Integer) [3]

Distribution

The dataset is typically structured in a CSV file format, with train.csv and validation.csv splits available [1, 2, 4]. Each row within these files represents a single question-answer pair, providing a clear and structured layout for analysis [2]. While specific total row counts for the entire dataset are not explicitly provided, the train.csv split alone contains 87,355 unique IDs, indicating a substantial volume of data [3].

Usage

This dataset is ideal for a variety of applications and use cases, including:
  • Developing Reading Comprehension models capable of answering open-ended questions based on provided text passages [2].
  • Training models to extract specific text spans as answers for multiple-choice questions from source materials [2].
  • Building systems that generate large training datasets for reading comprehension by creating synthetic questions from existing passages [2].

Coverage

The dataset's coverage is global, drawing contexts from a wide array of Wikipedia articles [2, 5]. It does not specify a particular time range or demographic scope, as its content is derived from the broad and diverse nature of Wikipedia.

License

CC0

Who Can Use It

This dataset is intended for:
  • Researchers and developers in the fields of Artificial Intelligence and Machine Learning, particularly those focusing on Natural Language Processing (NLP) [2].
  • Educators and analysts interested in education and learning analytics, as it supports the development of tools for understanding textual information [2].
  • Individuals and teams working on text mining and text pre-processing tasks, where the ability to extract precise information from text is crucial [2].

Dataset Name Suggestions

  • Stanford Question Answering Dataset (SQuAD) [2]
  • Reading Comprehension Wikipedia Dataset
  • Question Answering Text Spans
  • Crowdworker QA Dataset

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free