Questions to SQL Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a large, crowd-sourced collection designed for developing natural language interfaces for relational databases. It contains hand-annotated examples of natural language questions paired with their corresponding SQL queries. The data is derived from Wikipedia tables, providing a rich context for understanding how natural language can be translated into database queries. It serves as a valuable resource for training and testing models that aim to bridge the gap between human language and structured database interactions.
Columns
- phase: The stage of the data collection process. (String)
- question: The user's question posed in natural language. (String)
- table: The specific database table relevant to the question. (String)
- sql: The SQL query that corresponds to the user's question. (String)
Distribution
The dataset is typically provided in a CSV file format. It comprises 80,654 hand-annotated examples of questions and SQL queries. These examples are distributed across 24,241 distinct tables originating from Wikipedia. Specific numbers for rows or records beyond this total are not explicitly detailed, but unique values for questions are 5,069 and for SQL queries are 15,595.
Usage
This dataset is ideal for several applications:
- Developing and improving natural language interfaces for relational databases.
- Building a knowledge base of frequently used SQL queries.
- Generating training sets for neural networks that convert natural language into SQL queries.
Coverage
The dataset's scope is global, reflecting its origins from Wikipedia tables which have worldwide applicability. There are no specific geographical, time range, or demographic notes on data availability for particular groups or years within the dataset itself. It focuses on the general relationship between questions and SQL queries.
License
CC0
Who Can Use It
This dataset is intended for:
- Data scientists developing machine learning models for language processing.
- AI and ML researchers focused on natural language understanding (NLU) and natural language generation (NLG) in the context of databases.
- Software developers creating intelligent database query tools or conversational AI agents that interact with databases.
- Academics and students conducting research in areas like computational linguistics, database systems, and artificial intelligence.
Dataset Name Suggestions
- WikiSQL Natural Language Interface Data
- Questions to SQL Dataset
- NLP2SQL Database Interface Dataset
- Structured Query Language Question Bank
- Wiki Table Query Data
Attributes
Original Data Source: WikiSQL (Questions and SQL Queries)`