Woodchuck Science Quiz Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a unique opportunity for NLP researchers to develop models capable of answering multiple-choice questions based on a given context paragraph. It is particularly well-suited for the development and testing of question-answering systems that can handle real-world, noisy data. Originating from grade school science content, this dataset can be utilised to create interactive tools such as a question-answering chatbot, a multiple-choice quiz game, or systems that generate multiple-choice questions for students.
Columns
The dataset is primarily composed of three files:
validation.csv
, train.csv
, and test.csv
. Each file contains the following columns:- id: A unique identifier for each question record.
- question: The text of the question (String).
- choices: A list of multiple-choice answers for the question (List of Strings).
- answerKey: The integer index corresponding to the correct answer within the
choices
list. - fact1: The first piece of supporting information (String).
- fact2: The second piece of supporting information (String).
- combinedfact: A combined piece of supporting information (String).
- formatted_question: The question text with the multiple-choice answers inserted into it (String).
Distribution
The data files are typically provided in CSV format. For the
test.csv
file, there are 920 unique records for the id
, question
, choices
, answerKey
, and formatted_question
columns. The fact1
, fact2
, and combinedfact
columns are noted as having 100% null values in some distributions. This is a free dataset, listed on a data marketplace with a quality rating of 5 out of 5 and is available globally. The current version is 1.0.Usage
This dataset is ideal for:
- Developing and evaluating Natural Language Processing (NLP) models for question answering.
- Creating question-answering chatbots that can respond to science-based queries.
- Designing multiple-choice quiz games for educational purposes.
- Generating multiple-choice questions to aid student learning and assessment.
- Research into handling noisy, real-world data in Q&A systems.
Coverage
The dataset's scope is global in terms of availability. Its content focuses on grade school science, making it relevant for primary and secondary education contexts. While a specific time range for data collection is not provided, the dataset was listed on 16/06/2025.
License
CC0
Who Can Use It
- NLP Researchers and Data Scientists focusing on question answering, text classification, and natural language understanding.
- Educators and Content Developers looking to create educational tools, quizzes, or automated question generation systems.
- Game Developers interested in building educational quiz games.
- Anyone working on AI and Machine Learning models that require structured question-answer pairs for training and testing.
Dataset Name Suggestions
- Grade School Science Q&A
- Educational NLP Challenge Data
- Multi-Choice Science Questions
- Woodchuck Science Quiz Dataset
- Primary/Secondary Science QA
Attributes
Original Data Source: Woodchuck (Grade School Science Multi-Choice Q&A)