Portuguese Text2SQL database
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Overview
This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, which was constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context.
The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualized SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.
Dataset Details
Total Examples: 78,577
Columns:
pergunta: The question in natural language.
contexto: The SQL CREATE TABLE statement that provides the necessary context to answer the question.
resposta: The SQL query that answers the question using the provided context.
Translation Process
The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model, ensuring that the natural language queries maintain the same meaning and context as the original English questions.
Objective and Applications
This dataset is ideal for training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial. It can be used to enhance model performance in text-to-SQL tasks, providing clear context and avoiding common hallucination errors.
Original Projects
@misc{b-mc2_2023_sql-create-context,
title = {sql-create-context Dataset},
author = {b-mc2},
year = {2023},
url = {https://huggingface.co/datasets/b-mc2/sql-create-context},
note = {This dataset was created by modifying data from the following sources: \cite{zhongSeq2SQL2017, yu2018spider}.},
}
@article{zhongSeq2SQL2017,
author = {Victor Zhong and Caiming Xiong and Richard Socher},
title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning},
journal = {CoRR},
volume = {abs/1709.00103},
year = {2017}
}
@article{yu2018spider,
title = {Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task},
author = {Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
journal = {arXiv preprint arXiv:1809.08887},
year = {2018}
}
License
CC-BY-NC
Original Data Source: Portuguese Text2SQL database