NLP Coreference Reasoning Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, known as Quoref, is designed to evaluate the coreferential reasoning abilities of reading comprehension systems. It consists of 24,000 questions based on 4,700 paragraphs extracted from Wikipedia pages. The dataset's primary purpose is to challenge systems to resolve complex coreferences before they can accurately select the relevant span(s) within the paragraphs to answer questions. It enables systems not only to provide answers but also to offer supporting evidence from the given context.
Columns
The dataset is typically structured in files like
train.csv
and validation.csv
, featuring the following columns:- question: The full text of the question. (String)
- context: The paragraph or paragraphs serving as context for the question. (String)
- title: The title of the Wikipedia page from which the context was sourced. (String)
- url: The URL of the original Wikipedia page. (String)
- answers: The designated answer span or spans for the question. (List of strings)
Distribution
The dataset files are usually in a CSV format. The
train.csv
file, for instance, contains 24,000 questions spanning across 4,700 distinct paragraphs from Wikipedia. While specific row counts for all files are not provided, the train.csv
file provides a substantial volume of data. Unique values for key columns include 19,399 for 'id', 19,375 for 'question', 3,771 for 'context', 2,146 for 'title', 2,146 for 'url', and 13,841 for 'answers'.Usage
This dataset is ideal for testing the coreferential reasoning capability of reading comprehension systems. It can be utilised for research ideas where a system must resolve hard coreferences prior to selecting the appropriate span(s) in paragraphs for answering questions. It's particularly valuable for developing and evaluating advanced Natural Language Processing (NLP) models.
Coverage
The dataset's content is derived from Wikipedia pages, providing a broad and varied textual scope. Its listed region is Global, implying its applicability is not limited to specific geographical areas. While a direct time range for the data collection isn't specified, the dataset was listed on 26/06/2025.
License
The dataset is available under a CC0 license.
Who Can Use It
This dataset is suitable for:
- Data scientists and AI/ML researchers developing and assessing reading comprehension systems.
- Natural Language Processing (NLP) practitioners working on coreference resolution tasks.
- Academics and students engaged in research on AI reasoning and text understanding.
- Developers looking to build more sophisticated question-answering systems that can handle complex linguistic phenomena.
Dataset Name Suggestions
- Quoref Coreference QA
- Wikipedia Coreference Question Answering Dataset
- Reading Comprehension Coreference Test Set
- NLP Coreference Reasoning Dataset
Attributes
Original Data Source: Quoref (Q&A for Coreference Resolution)