Woodchuck Science Quiz Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a unique opportunity for NLP researchers to develop models capable of answering multiple-choice questions based on a given context paragraph. It is particularly well-suited for the development and testing of question-answering systems that can handle real-world, noisy data. Originating from grade school science content, this dataset can be utilised to create interactive tools such as a question-answering chatbot, a multiple-choice quiz game, or systems that generate multiple-choice questions for students.
Columns
The dataset is primarily composed of three files:
validation.csv, train.csv, and test.csv. Each file contains the following columns:- id: A unique identifier for each question record.
- question: The text of the question (String).
- choices: A list of multiple-choice answers for the question (List of Strings).
- answerKey: The integer index corresponding to the correct answer within the
choiceslist. - fact1: The first piece of supporting information (String).
- fact2: The second piece of supporting information (String).
- combinedfact: A combined piece of supporting information (String).
- formatted_question: The question text with the multiple-choice answers inserted into it (String).
Distribution
The data files are typically provided in CSV format. For the
test.csv file, there are 920 unique records for the id, question, choices, answerKey, and formatted_question columns. The fact1, fact2, and combinedfact columns are noted as having 100% null values in some distributions. This is a free dataset, listed on a data marketplace with a quality rating of 5 out of 5 and is available globally. The current version is 1.0.Usage
This dataset is ideal for:
- Developing and evaluating Natural Language Processing (NLP) models for question answering.
- Creating question-answering chatbots that can respond to science-based queries.
- Designing multiple-choice quiz games for educational purposes.
- Generating multiple-choice questions to aid student learning and assessment.
- Research into handling noisy, real-world data in Q&A systems.
Coverage
The dataset's scope is global in terms of availability. Its content focuses on grade school science, making it relevant for primary and secondary education contexts. While a specific time range for data collection is not provided, the dataset was listed on 16/06/2025.
License
CC0
Who Can Use It
- NLP Researchers and Data Scientists focusing on question answering, text classification, and natural language understanding.
- Educators and Content Developers looking to create educational tools, quizzes, or automated question generation systems.
- Game Developers interested in building educational quiz games.
- Anyone working on AI and Machine Learning models that require structured question-answer pairs for training and testing.
Dataset Name Suggestions
- Grade School Science Q&A
- Educational NLP Challenge Data
- Multi-Choice Science Questions
- Woodchuck Science Quiz Dataset
- Primary/Secondary Science QA
Attributes
Original Data Source: Woodchuck (Grade School Science Multi-Choice Q&A)
Loading...
