Opendatabay APP

Science Exam NLP Dataset

Education & Learning Analytics

Tags and Keywords

Earth

And

Nature

Education

Nlp

Data

Cleaning

Text

Mining

Clustering

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Science Exam NLP Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, SciTail (Multiple-choice science exams), serves as a crucial resource for developing and training advanced Natural Language Inference (NLI) algorithms, particularly those capable of understanding intricate sci-fi conversations. It compiles 27,026 multiple-choice science exams and web sentences, with content drawn from popular sci-fi books, films, and television programmes. The primary purpose is to empower researchers and data scientists to unlock powerful Sci-Fi NLI capabilities and to explore the realm of NLI. The dataset is instrumental in making predictions about statement labels, such as entailment, contradiction, or neutrality, thereby fostering supercharged Sci-Fi language processing.

Columns

The dataset is organised into several formats, each containing specific columns. While the fields are consistent across formats, their structures may vary.
  • dgem_format_test.csv and dgem_format_train.csv:
    • premise: The premise of the statement (String).
    • hypothesis: The hypothesis of the statement (String).
    • label: The label of the statement – either entailment, neutral or contradiction (String).
    • hypothesis_graph_structure: A graph structure of the hypothesis (Graph).
  • predictor_format_validation.csv and predictor_format_train.csv:
    • answer: The answer to the question (String).
    • sentence2_structure: A graph structure of the second sentence (Graph).
    • sentence1: The first sentence of the statement (String).
    • gold_label: The label of the statement – either entailment, neutral or contradiction (String).
  • tsv_format_test.csv and tsv_format_validation.csv:
    • premise: The premise of the statement (String).
    • hypothesis: The hypothesis of the statement (String).
    • label: The label of the statement – either entailment, neutral or contradiction (String).
  • snli_format_validation.csv, snli_format_train.csv, and snli_format_test.csv:
    • sentence1: The first sentence of the statement (String).
    • sentence2_structure: A graph structure of the second sentence (Graph).
    • gold_label: The label of the statement – either entailment, neutral or contradiction (String).
    • sentence1_binary_parse: Binary parse of first sentence (String).
    • sentence1_parse: Parse of first sentence (String).
    • sentence2_parse: Parse of second sentence (String).
    • annotator_labels: Labels assigned by annotators (String).

Distribution

The SciTail dataset consists of 27,026 multiple-choice science exams and web sentences. It is provided in seven distinct formats, including training, validation, and testing sets. All files are stored as CSV (Comma Separated Values) files, with each row representing a single data point in the form of premise-hypothesis pairs, complete with assigned labels. While the data fields are consistent, their structures vary across the different file formats.

Usage

This dataset is ideally suited for a variety of Natural Language Inference (NLI) applications and research, including:
  • Developing and training NLI algorithms to understand complex sci-fi conversations.
  • Fine-tuning NLI algorithms to handle varying levels of Sci-Fi language complexity.
  • Creating machine learning models that predict statement labels (entailment, contradiction, or neutral).
  • Developing automated human-in-the-loop approaches for NLI algorithms using annotator labels.
  • Integrating hypothesis graph structures into existing models to enhance accuracy and minimise errors in identifying contextual comparisons within Sci-Fi texts.

Coverage

The dataset's content is derived from popular sci-fi books, movies, and TV shows, providing a thematic scope focused on science fiction. The data is available for global use. Specific time ranges or demographic scopes are not detailed in the available information.

License

CC0

Who Can Use It

The SciTail dataset is an essential resource for:
  • Scientists and researchers interested in exploring Sci-Fi NLI.
  • Machine learning engineers and data scientists focused on developing and training NLI algorithms.
  • Anyone working on Natural Language Processing (NLP) tasks, particularly those involving inference, entailment, and contradiction detection in text.
  • Developers aiming to build systems for understanding and processing complex, genre-specific language.

Dataset Name Suggestions

  • SciTail NLI Dataset
  • Sci-Fi Language Inference Corpus
  • Multiple-Choice Science NLI
  • Science Exam NLP Dataset
  • Huggingface SciTail Data

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format