Opendatabay APP

NLP Coreference Reasoning Dataset

Data Science and Analytics

Tags and Keywords

Text

Nlp

Mining

Classification

Pre-processing

Trusted By
Trusted by company1Trusted by company2Trusted by company3
NLP Coreference Reasoning Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, known as Quoref, is designed to evaluate the coreferential reasoning abilities of reading comprehension systems. It consists of 24,000 questions based on 4,700 paragraphs extracted from Wikipedia pages. The dataset's primary purpose is to challenge systems to resolve complex coreferences before they can accurately select the relevant span(s) within the paragraphs to answer questions. It enables systems not only to provide answers but also to offer supporting evidence from the given context.

Columns

The dataset is typically structured in files like train.csv and validation.csv, featuring the following columns:
  • question: The full text of the question. (String)
  • context: The paragraph or paragraphs serving as context for the question. (String)
  • title: The title of the Wikipedia page from which the context was sourced. (String)
  • url: The URL of the original Wikipedia page. (String)
  • answers: The designated answer span or spans for the question. (List of strings)

Distribution

The dataset files are usually in a CSV format. The train.csv file, for instance, contains 24,000 questions spanning across 4,700 distinct paragraphs from Wikipedia. While specific row counts for all files are not provided, the train.csv file provides a substantial volume of data. Unique values for key columns include 19,399 for 'id', 19,375 for 'question', 3,771 for 'context', 2,146 for 'title', 2,146 for 'url', and 13,841 for 'answers'.

Usage

This dataset is ideal for testing the coreferential reasoning capability of reading comprehension systems. It can be utilised for research ideas where a system must resolve hard coreferences prior to selecting the appropriate span(s) in paragraphs for answering questions. It's particularly valuable for developing and evaluating advanced Natural Language Processing (NLP) models.

Coverage

The dataset's content is derived from Wikipedia pages, providing a broad and varied textual scope. Its listed region is Global, implying its applicability is not limited to specific geographical areas. While a direct time range for the data collection isn't specified, the dataset was listed on 26/06/2025.

License

The dataset is available under a CC0 license.

Who Can Use It

This dataset is suitable for:
  • Data scientists and AI/ML researchers developing and assessing reading comprehension systems.
  • Natural Language Processing (NLP) practitioners working on coreference resolution tasks.
  • Academics and students engaged in research on AI reasoning and text understanding.
  • Developers looking to build more sophisticated question-answering systems that can handle complex linguistic phenomena.

Dataset Name Suggestions

  • Quoref Coreference QA
  • Wikipedia Coreference Question Answering Dataset
  • Reading Comprehension Coreference Test Set
  • NLP Coreference Reasoning Dataset

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

26/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format