Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingDatabase Query

gretelai/synthetic_text_to_sql

The gretelai/synthetic_text_to_sql dataset is a high‑quality synthetic Text‑to‑SQL sample dataset generated using Gretel Navigator, containing 105,851 records, split into 100,000 training records and 5,851 test records. The dataset covers 100 different domains and includes various SQL tasks such as data definition, retrieval, manipulation, analysis, and reporting. Additionally, the dataset provides natural‑language explanations of SQL queries and contextual tags to optimize model training. The dataset quality is evaluated using LLM‑as‑a‑judge techniques, showing excellent performance in SQL standard compliance, correctness, and instruction adherence.

Source
hugging_face
Created
Nov 28, 2025
Updated
May 10, 2024
Signals
197 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • Name: gretelai/synthetic_text_to_sql
  • License: Apache-2.0
  • Language: English
  • Tags: synthetic, SQL, text‑to‑SQL, code
  • Task Types: question answering, table QA, text generation
  • Size Category: 100K < size < 1M

Dataset Content

  • Number of Records: 105,851 (100,000 training, 5,851 test)
  • Total Tokens: ~23 M, including ~12 M SQL tokens
  • Covered Domains: 100 different domains/verticals
  • SQL Task Types: data definition, retrieval, manipulation, analysis, reporting
  • SQL Complexity: includes sub‑queries, single joins, multiple joins, aggregation, window functions, set operations
  • Database Context: includes table and view creation statements
  • Natural‑Language Explanation: explanations of the SQL queries
  • Contextual Tags: used to optimize model training

Dataset Characteristics

  • Diversity: broad range of SQL complexities and task types
  • Quality: evaluated with LLM‑as‑a‑judge; scores higher than the b‑mc2/sql‑create‑context dataset on SQL standard compliance, correctness, and instruction adherence
  • Applications: suitable for developers, researchers, and data enthusiasts building or refining text‑to‑SQL models

Dataset Structure

  • Number of Fields: 11
  • Example Fields: id, domain, domain_description, sql_complexity, sql_complexity_description, sql_task_type, sql_task_type_description, sql_prompt, sql_context, sql, sql_explanation

Data Quality Evaluation

  • Evaluation Method: GPT‑4 scored 1,000 random samples and compared results with the b‑mc2/sql‑create‑context dataset
  • Evaluation Results: outperformed the comparison dataset on multiple metrics

Citation Information

@misc{gretel‑synthetic‑text‑to‑sql-2024,
  author = {Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew},
  title = {{Synthetic‑Text‑To‑SQL}: A synthetic dataset for training language models to generate SQL queries from natural language prompts},
  month = {April},
  year = {2024},
  url = {https://huggingface.co/datasets/gretelai/synthetic-text-to-sql}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio