Contextual Portuguese Text2SQL
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context. The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualised SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.
Columns
- pergunta: The question in natural language about the database, in Portuguese.
- contexto: The SQL CREATE TABLE statement that provides the necessary context to answer the question, representing the schema or structure of the database tables, in Portuguese.
- resposta: The SQL query that answers the question based on the provided context, in Portuguese.
Distribution
This dataset consists of 78,577 entries. Each entry represents a question about a database, the context of the database schema, and the corresponding SQL query. Data files are typically in CSV format. The 'pergunta' column contains 78,220 unique values, 'contexto' has 72,947 unique values, and 'resposta' has 78,577 unique values.
Usage
This dataset is ideal for:
- Training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial.
- Enhancing model performance in text-to-SQL tasks.
- Supporting natural language processing and machine learning tasks related to generating structured queries from natural language.
Coverage
The dataset has a global region scope and focuses on the Portuguese language. The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model. It was listed on 22/06/2025.
License
CC-BY-NC
Who Can Use It
This dataset is suitable for:
- Data scientists and analysts focused on developing and refining natural language processing models.
- Researchers and developers working on text-to-SQL solutions.
- Anyone aiming to build or improve AI models that translate natural language queries into SQL, particularly for Portuguese.
Dataset Name Suggestions
- Portuguese Text2SQL Database
- NL to SQL Portuguese Dataset
- SQL Query Generation from Portuguese Text
- Portuguese Natural Language to SQL
- Contextual Portuguese Text2SQL
Attributes
Original Data Source: Portuguese Text2SQL database