Dataset assetOpen Source CommunityNatural Language ProcessingDatabase Query

gretelai/synthetic_text_to_sql

The gretelai/synthetic_text_to_sql dataset is a high‑quality synthetic Text‑to‑SQL sample dataset generated using Gretel Navigator, containing 105,851 records, split into 100,000 training records and 5,851 test records. The dataset covers 100 different domains and includes various SQL tasks such as data definition, retrieval, manipulation, analysis, and reporting. Additionally, the dataset provides natural‑language explanations of SQL queries and contextual tags to optimize model training. The dataset quality is evaluated using LLM‑as‑a‑judge techniques, showing excellent performance in SQL standard compliance, correctness, and instruction adherence.

Source

hugging_face

Created

Nov 28, 2025

Updated

May 10, 2024

Signals

197 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Basic Information

Name: gretelai/synthetic_text_to_sql
License: Apache-2.0
Language: English
Tags: synthetic, SQL, text‑to‑SQL, code
Task Types: question answering, table QA, text generation
Size Category: 100K < size < 1M

Dataset Content

Number of Records: 105,851 (100,000 training, 5,851 test)
Total Tokens: ~23 M, including ~12 M SQL tokens
Covered Domains: 100 different domains/verticals
SQL Task Types: data definition, retrieval, manipulation, analysis, reporting
SQL Complexity: includes sub‑queries, single joins, multiple joins, aggregation, window functions, set operations
Database Context: includes table and view creation statements
Natural‑Language Explanation: explanations of the SQL queries
Contextual Tags: used to optimize model training

Dataset Characteristics

Diversity: broad range of SQL complexities and task types
Quality: evaluated with LLM‑as‑a‑judge; scores higher than the b‑mc2/sql‑create‑context dataset on SQL standard compliance, correctness, and instruction adherence
Applications: suitable for developers, researchers, and data enthusiasts building or refining text‑to‑SQL models

Dataset Structure

Number of Fields: 11
Example Fields: id, domain, domain_description, sql_complexity, sql_complexity_description, sql_task_type, sql_task_type_description, sql_prompt, sql_context, sql, sql_explanation

Data Quality Evaluation

Evaluation Method: GPT‑4 scored 1,000 random samples and compared results with the b‑mc2/sql‑create‑context dataset
Evaluation Results: outperformed the comparison dataset on multiple metrics

Citation Information

@misc{gretel‑synthetic‑text‑to‑sql-2024,
  author = {Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew},
  title = {{Synthetic‑Text‑To‑SQL}: A synthetic dataset for training language models to generate SQL queries from natural language prompts},
  month = {April},
  year = {2024},
  url = {https://huggingface.co/datasets/gretelai/synthetic-text-to-sql}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio