Question Answering Text Classification Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a collection of data specifically designed for training and evaluating text classification models intended for question answering. It contains various types of information to facilitate this task. An important aspect of the dataset is the inclusion of previous questions, which provide context for the current question being asked, helping models understand conversation flow. The dataset's current question column represents the specific query needing an answer. To identify relevant terms, gold terms are provided, serving as correct or relevant reference points. Semantic terms offer additional context by identifying related concepts. The dataset also highlights overlapping terms between the question and answer text, providing insight into shared keywords. Furthermore, the answer text with window column gives the answer along with its surrounding context, allowing models to consider a broader scope. Named entities recognised by a BERT model are highlighted through the BERT NER overlap column if they appear in both questions and answers, enhancing comprehension for accurate responses. Researchers can use this dataset to train, validate, and test their models for text classification in question answering tasks.
Columns
The dataset comprises several columns providing relevant information:
- Test: A field likely indicating a test phase or identifier for test records.
- id / ID: Unique identifiers for each record.
- prev_questions: Contains the previous questions asked in a conversation, providing conversational context.
- cur_question: Contains the current question being asked, which requires an answer.
- gold_terms: These are terms considered correct or highly relevant for answering each question effectively.
- semantic_terms: Provides terms that are semantically related to each question, offering additional conceptual context.
- overlapping_terms: These are terms that are common between each question and its corresponding answer.
- answer_text_with_window: Supplies the answer text along with a segment of the surrounding context from which it was derived.
- answer_text: The core answer text for the question.
- bert_ner_overlap: Highlights named entities recognised by a BERT model that overlap between the question and its corresponding answer.
Distribution
The dataset is typically structured in CSV format and is divided into three main files: train.csv, validation.csv, and test.csv. While specific row counts for each split are not detailed, the dataset contains over 3,000 unique records.
Usage
This dataset is ideal for various applications in natural language processing and machine learning:
- Text classification model training: Utilise the dataset to build and train text classification models specifically for question answering.
- Performance validation: Evaluate the performance of trained models using the validation set to assess generalisation on unseen samples.
- Model testing: Test the effectiveness of a final trained model on new, unseen data using the provided test set.
- Feature engineering: Extract meaningful features like n-gram features, part-of-speech tags, or syntactic dependencies to enhance model performance.
- Experimentation: Experiment with different models and architectures, including deep learning models, traditional machine learning algorithms (e.g., Random Forests), or pre-trained language models (e.g., BERT).
Coverage
The dataset has a global region scope. It was listed on 17/06/2025, with a stated quality rating of 5/5 and version 1.0. Specific geographic, time range, or demographic notes beyond this are not available.
License
CC0
Who Can Use It
This dataset is intended for:
- Researchers: To train, validate, and test text classification models for question answering.
- Data Scientists and Analysts: For developing and evaluating models in the field of data science and analytics.
- Machine Learning Engineers: To fine-tune and assess deep learning models and pre-trained language models.
- NLP Practitioners: For tasks involving natural language processing, such as understanding conversation flow, identifying key terms, and generating accurate responses.
Dataset Name Suggestions
- Question Answering Text Classification Dataset
- Conversational QA Dataset
- Semantic Question Answering Data
- BERT NER QA Classification Dataset
- Contextual Text Classification for QA
Attributes
Original Data Source: Text Classification for QA Dataset