gretelai/synthetic_text_to_sql
The gretelai/synthetic_text_to_sql dataset is a high‑quality synthetic Text‑to‑SQL sample dataset generated using Gretel Navigator, containing 105,851 records, split into 100,000 training records and 5,851 test records. The dataset covers 100 different domains and includes various SQL tasks such as data definition, retrieval, manipulation, analysis, and reporting. Additionally, the dataset provides natural‑language explanations of SQL queries and contextual tags to optimize model training. The dataset quality is evaluated using LLM‑as‑a‑judge techniques, showing excellent performance in SQL standard compliance, correctness, and instruction adherence.
Dataset description and usage context
Dataset Overview
Basic Information
- Name: gretelai/synthetic_text_to_sql
- License: Apache-2.0
- Language: English
- Tags: synthetic, SQL, text‑to‑SQL, code
- Task Types: question answering, table QA, text generation
- Size Category: 100K < size < 1M
Dataset Content
- Number of Records: 105,851 (100,000 training, 5,851 test)
- Total Tokens: ~23 M, including ~12 M SQL tokens
- Covered Domains: 100 different domains/verticals
- SQL Task Types: data definition, retrieval, manipulation, analysis, reporting
- SQL Complexity: includes sub‑queries, single joins, multiple joins, aggregation, window functions, set operations
- Database Context: includes table and view creation statements
- Natural‑Language Explanation: explanations of the SQL queries
- Contextual Tags: used to optimize model training
Dataset Characteristics
- Diversity: broad range of SQL complexities and task types
- Quality: evaluated with LLM‑as‑a‑judge; scores higher than the b‑mc2/sql‑create‑context dataset on SQL standard compliance, correctness, and instruction adherence
- Applications: suitable for developers, researchers, and data enthusiasts building or refining text‑to‑SQL models
Dataset Structure
- Number of Fields: 11
- Example Fields: id, domain, domain_description, sql_complexity, sql_complexity_description, sql_task_type, sql_task_type_description, sql_prompt, sql_context, sql, sql_explanation
Data Quality Evaluation
- Evaluation Method: GPT‑4 scored 1,000 random samples and compared results with the b‑mc2/sql‑create‑context dataset
- Evaluation Results: outperformed the comparison dataset on multiple metrics
Citation Information
@misc{gretel‑synthetic‑text‑to‑sql-2024,
author = {Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew},
title = {{Synthetic‑Text‑To‑SQL}: A synthetic dataset for training language models to generate SQL queries from natural language prompts},
month = {April},
year = {2024},
url = {https://huggingface.co/datasets/gretelai/synthetic-text-to-sql}
}
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.