Opendatabay APP

Contextual Portuguese Text2SQL

Data Science and Analytics

Tags and Keywords

Nlp

Brazil

Sql

Text

Generation

Portuguese

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Contextual Portuguese Text2SQL Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context. The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualised SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.

Columns

  • pergunta: The question in natural language about the database, in Portuguese.
  • contexto: The SQL CREATE TABLE statement that provides the necessary context to answer the question, representing the schema or structure of the database tables, in Portuguese.
  • resposta: The SQL query that answers the question based on the provided context, in Portuguese.

Distribution

This dataset consists of 78,577 entries. Each entry represents a question about a database, the context of the database schema, and the corresponding SQL query. Data files are typically in CSV format. The 'pergunta' column contains 78,220 unique values, 'contexto' has 72,947 unique values, and 'resposta' has 78,577 unique values.

Usage

This dataset is ideal for:
  • Training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial.
  • Enhancing model performance in text-to-SQL tasks.
  • Supporting natural language processing and machine learning tasks related to generating structured queries from natural language.

Coverage

The dataset has a global region scope and focuses on the Portuguese language. The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model. It was listed on 22/06/2025.

License

CC-BY-NC

Who Can Use It

This dataset is suitable for:
  • Data scientists and analysts focused on developing and refining natural language processing models.
  • Researchers and developers working on text-to-SQL solutions.
  • Anyone aiming to build or improve AI models that translate natural language queries into SQL, particularly for Portuguese.

Dataset Name Suggestions

  • Portuguese Text2SQL Database
  • NL to SQL Portuguese Dataset
  • SQL Query Generation from Portuguese Text
  • Portuguese Natural Language to SQL
  • Contextual Portuguese Text2SQL

Attributes

Original Data Source: Portuguese Text2SQL database

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

22/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format