Opendatabay APP

Questions to SQL Dataset

Data Science and Analytics

Tags and Keywords

Business

Computer

Programming

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Questions to SQL Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a large, crowd-sourced collection designed for developing natural language interfaces for relational databases. It contains hand-annotated examples of natural language questions paired with their corresponding SQL queries. The data is derived from Wikipedia tables, providing a rich context for understanding how natural language can be translated into database queries. It serves as a valuable resource for training and testing models that aim to bridge the gap between human language and structured database interactions.

Columns

  • phase: The stage of the data collection process. (String)
  • question: The user's question posed in natural language. (String)
  • table: The specific database table relevant to the question. (String)
  • sql: The SQL query that corresponds to the user's question. (String)

Distribution

The dataset is typically provided in a CSV file format. It comprises 80,654 hand-annotated examples of questions and SQL queries. These examples are distributed across 24,241 distinct tables originating from Wikipedia. Specific numbers for rows or records beyond this total are not explicitly detailed, but unique values for questions are 5,069 and for SQL queries are 15,595.

Usage

This dataset is ideal for several applications:
  • Developing and improving natural language interfaces for relational databases.
  • Building a knowledge base of frequently used SQL queries.
  • Generating training sets for neural networks that convert natural language into SQL queries.

Coverage

The dataset's scope is global, reflecting its origins from Wikipedia tables which have worldwide applicability. There are no specific geographical, time range, or demographic notes on data availability for particular groups or years within the dataset itself. It focuses on the general relationship between questions and SQL queries.

License

CC0

Who Can Use It

This dataset is intended for:
  • Data scientists developing machine learning models for language processing.
  • AI and ML researchers focused on natural language understanding (NLU) and natural language generation (NLG) in the context of databases.
  • Software developers creating intelligent database query tools or conversational AI agents that interact with databases.
  • Academics and students conducting research in areas like computational linguistics, database systems, and artificial intelligence.

Dataset Name Suggestions

  • WikiSQL Natural Language Interface Data
  • Questions to SQL Dataset
  • NLP2SQL Database Interface Dataset
  • Structured Query Language Question Bank
  • Wiki Table Query Data

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

16/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free