Opendatabay APP

Intelligent Question Answering Dataset

Data Science and Analytics

Tags and Keywords

Data

Analytics

Classification

Nlp

Text

Mining

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Intelligent Question Answering Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, LC-QuAD 2.0, represents a significant advancement in intelligent querying capabilities. It offers a collection of 30,000 distinct pairs of natural language questions and their corresponding SPARQL queries. The questions are specifically designed to relate to the latest versions of Wikidata and DBpedia, providing an extensive information repository. The dataset's purpose is to empower intelligent systems, enabling them to transform natural language queries into executable SPARQL queries, thereby unlocking the power of knowledge and facilitating smart querying techniques. It serves as a foundational resource for developing question-answering systems, constructing new knowledge graphs, and enhancing search functionalities.

Columns

The dataset is structured across several key columns, designed to facilitate question-answering and semantic parsing tasks:
  • NNQT_question: Natural language questions, provided as strings.
  • uid: A unique identifier for each entry.
  • subgraph: Information related to the subgraph of the question, represented as a graph.
  • template_index: An index identifying the template used for the question.
  • question: The core question, also presented as a string.
  • sparql_wikidata: The SPARQL query corresponding to the question, specifically for Wikidata.
  • sparql_dbpedia18: The SPARQL query for the question, specifically for DBpedia 18.
  • template: The template from which the SPARQL query was generated, provided as a string.
  • paraphrased_question: A paraphrased version of the original natural language question, presented as a string.
The dataset is organised into two main files: a training file (train.csv) used for training intelligent systems, and a testing file (test.csv) for evaluating them.

Distribution

The LC-QuAD 2.0 dataset comprises 30,000 different pairs of questions and their respective SPARQL queries. The dataset is provided in CSV format, structured into training and testing files. While a specific breakdown of rows per file is not provided, the total collection encompasses thirty thousand question-answer pairs, making it a substantial resource for machine learning and natural language processing applications.

Usage

This dataset is ideally suited for a variety of applications aiming to enhance intelligent systems with sophisticated querying capabilities:
  • Powering up intelligent systems with smarter querying.
  • Building robust question-answering systems.
  • Creating new knowledge graphs and advanced search systems.
  • Incorporating into AI applications such as chatbots, document summarisation programs, and other intelligent systems to retrieve information by transforming natural language questions into SPARQL queries.
  • Utilising in Semantic Scholar Search Engines and Academic Digital Libraries to enable more sophisticated and accurate searches using natural language queries instead of traditional keywords.
  • Applying for the construction of Knowledge Graphs, which can store entities along with their attributes, categories, and relations, thereby allowing for a better understanding of complex relationships within data.
  • Advancing the development of AI agents capable of answering specific questions or providing personalised recommendations in various contexts.

Coverage

The dataset's coverage is global, implying its applicability and relevance are not restricted by geographic boundaries. The questions within the dataset are meticulously designed to interact with the latest versions of Wikidata and DBpedia, ensuring the information accessed is current and expansive. There are no specific notes on demographic scope or data availability for particular groups or years beyond its connection to current knowledge bases. This dataset is part of a free dataset library and is available as a free resource, listed as Version 1.0.

License

CC0

Who Can Use It

LC-QuAD 2.0 is intended for a broad range of users, including:
  • Individuals and researchers seeking to unlock the power of knowledge through smart querying techniques.
  • Tech-savvy professionals and developers aiming to build or enhance intelligent systems.
  • Data scientists and machine learning engineers focusing on natural language processing, semantic parsing, and question-answering.
  • Academics and researchers developing chatbots, document summarisation tools, or advanced search engines.
  • Anyone interested in creating or expanding knowledge graphs and building AI agents capable of intelligent information retrieval and recommendation.

Dataset Name Suggestions

  • Natural Language to SPARQL Query Pairs
  • Intelligent Question Answering Dataset
  • Knowledge Base Query Generation
  • Wikidata & DBpedia QA
  • Semantic Query Dataset

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format