Opendatabay APP

Semantic Question Categorisation Data

Data Science and Analytics

Tags and Keywords

Question

Classification

Nlp

Qa

Category

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Semantic Question Categorisation Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Classifies natural language questions into specific subcategories based on the expected type of answer, such as a numerical value, a text description, a place name, or a human name. This data product serves as a foundational resource for training systems in question entity recognition and enabling advanced Question Answering (QA) capabilities. It supports machine learning practitioners by providing meticulously labelled data essential for multi-class classification tasks.

Columns

The dataset comprises 5 columns and features 5452 valid entries:
  • Index: A simple numerical index column used for ordering records.
  • Questions: The core text field containing the question itself. It features 5381 unique questions, with one highly repeated example being, "What is the latitude and longitude of El Paso , Texas ?".
  • Category0: The primary category assigned to the question, determining the overall type of expected answer. Common categories include ENTITY (23%) and HUMAN (22%). There are 6 unique values in this field.
  • Category1: Contains the keys corresponding to Category0, often presented in an abbreviated format (e.g., ENTY, HUM). This column also holds 6 unique values.
  • Category2: Provides granular subcategories detailing the nature of the required answer type, featuring 47 unique values. The most frequent subcategories are 'ind' (18%) and 'other' (13%).

Distribution

The data is delivered in a standard CSV file format (Question_Classification_Dataset.csv), approximately 408.75 kB in size. It contains 5452 individual records ready for analysis. The dataset is static, with no expected updates, making it highly reliable for research replication.

Usage

This data product is ideally suited for:
  • Developing and evaluating Natural Language Processing (NLP) models focused on question understanding.
  • Building efficient Question Answering (QA) systems by enabling better routing and processing of user queries.
  • Training multi-class classification algorithms to categorise text based on semantic requirements.
  • Academic research into syntactic structures and information retrieval methodologies.

Coverage

The dataset focuses purely on the structural and semantic classification of questions written in English. It does not possess specific geographic, time range, or demographic boundaries, as its utility is derived from the linguistic properties of the questions themselves, rather than external factors.

License

CC0: Public Domain

Who Can Use It

  • NLP Researchers: For creating novel algorithms related to text classification and QA model architecture.
  • Data Scientists: For developing supervised machine learning solutions for text categorization.
  • Software Engineers: For integrating robust question classification modules into applications requiring sophisticated natural language interaction.

Dataset Name Suggestions

  • Question Classification for QA Systems
  • Semantic Question Categorisation Data
  • Multi-Class Question Type Dataset
  • Entity Recognition Question Bank

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

10/10/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format