Dialogue-based Question Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, known as HQA-data, is a Question Answer Generation dataset created from historical multi-perspective conversations. It is built upon "The Ubuntu Dialog Corpus", where user chats are organised and merged by a unique dialogue identifier (dialogueID) to form a context. Questions and their corresponding answers are then derived from these contexts, with the starting and ending positions of each answer also identified within the text. It serves as a valuable resource for data science and analytics, and is freely available from a verified data provider on the Opendatabay platform.
Columns
- dialogueID: A unique identifier for each chat room or conversation thread.
- Context: The merged text from a single dialogueID, forming the conversational background for question and answer pairs.
- QuestionID: A unique identifier for each question generated.
- Question: The question generated from the context.
- Answer: The answer extracted from the context for the given question.
- Answer Start: The character offset indicating where the answer begins within the Context.
- Answer End: The character offset indicating where the answer ends within the Context.
Distribution
The dataset is available in two formats: Comma Separated Values (CSV) and JSON-formatted data. The training set contains 7323 contexts and 29150 question-answer pairs, while the test set includes 2041 contexts and 7288 question-answer pairs. In total, the dataset features 9364 contexts and 36438 question-answer pairs.
Usage
This dataset is ideal for training and evaluating models in various Natural Language Processing (NLP) tasks. Specific use cases include developing question answering systems, enabling text generation, enhancing text conversation models, and facilitating chat analysis and text extraction. It can be applied in deep learning and machine learning research and development.
Coverage
The dataset is global in its intended region of use. While it is described as "historical", specific dates or a precise time range for the conversations are not detailed in the provided information. The data originates from user chat logs, but no specific demographic details beyond "users" are provided.
License
CC0
Who Can Use It
The dataset is suitable for data scientists, researchers, and developers working on projects related to Natural Language Processing (NLP), Deep Learning, and Artificial Intelligence (AI). It can be used by those aiming to build or improve conversational agents, question-answering systems, or text summarisation tools. It is also relevant for academic studies in linguistics and human-computer interaction based on textual dialogues.
Dataset Name Suggestions
- Conversational QA Dataset
- Ubuntu Dialog Question Answering
- Multi-Perspective Chat Log QA
- Historical Chat QA Pairs
- Dialogue-based Question Dataset
Attributes
Original Data Source: HQA Dataset from Multi-Perspective Conversations.