£0

Portuguese Text2SQL database

Data Science and Analytics

Tags and Keywords

nlp

brazil

sql

text

generation

portuguese

Trusted By

Portuguese Text2SQL database Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Overview This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, which was constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context.

The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualized SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.

Dataset Details Total Examples: 78,577 Columns: pergunta: The question in natural language. contexto: The SQL CREATE TABLE statement that provides the necessary context to answer the question. resposta: The SQL query that answers the question using the provided context. Translation Process The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model, ensuring that the natural language queries maintain the same meaning and context as the original English questions.

Objective and Applications This dataset is ideal for training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial. It can be used to enhance model performance in text-to-SQL tasks, providing clear context and avoiding common hallucination errors.

Original Projects @misc{b-mc2_2023_sql-create-context, title = {sql-create-context Dataset}, author = {b-mc2}, year = {2023}, url = {https://huggingface.co/datasets/b-mc2/sql-create-context}, note = {This dataset was created by modifying data from the following sources: \cite{zhongSeq2SQL2017, yu2018spider}.}, } @article{zhongSeq2SQL2017, author = {Victor Zhong and Caiming Xiong and Richard Socher}, title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning}, journal = {CoRR}, volume = {abs/1709.00103}, year = {2017} } @article{yu2018spider, title = {Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task}, author = {Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others}, journal = {arXiv preprint arXiv:1809.08887}, year = {2018} }

License

CC-BY-NC

Original Data Source: Portuguese Text2SQL database

Listing Stats

VIEWS

DOWNLOADS

LISTED

22/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0