gretelai/synthetic_text_to_sql
The gretelai/synthetic_text_to_sql dataset is a high‑quality synthetic Text‑to‑SQL sample dataset generated using Gretel Navigator, containing 105,851 records, split into 100,000 training records and 5,851 test records. The dataset covers 100 different domains and includes various SQL tasks such as data definition, retrieval, manipulation, analysis, and reporting. Additionally, the dataset provides natural‑language explanations of SQL queries and contextual tags to optimize model training. The dataset quality is evaluated using LLM‑as‑a‑judge techniques, showing excellent performance in SQL standard compliance, correctness, and instruction adherence.
Description
Dataset Overview
Basic Information
- Name: gretelai/synthetic_text_to_sql
- License: Apache-2.0
- Language: English
- Tags: synthetic, SQL, text‑to‑SQL, code
- Task Types: question answering, table QA, text generation
- Size Category: 100K < size < 1M
Dataset Content
- Number of Records: 105,851 (100,000 training, 5,851 test)
- Total Tokens: ~23 M, including ~12 M SQL tokens
- Covered Domains: 100 different domains/verticals
- SQL Task Types: data definition, retrieval, manipulation, analysis, reporting
- SQL Complexity: includes sub‑queries, single joins, multiple joins, aggregation, window functions, set operations
- Database Context: includes table and view creation statements
- Natural‑Language Explanation: explanations of the SQL queries
- Contextual Tags: used to optimize model training
Dataset Characteristics
- Diversity: broad range of SQL complexities and task types
- Quality: evaluated with LLM‑as‑a‑judge; scores higher than the b‑mc2/sql‑create‑context dataset on SQL standard compliance, correctness, and instruction adherence
- Applications: suitable for developers, researchers, and data enthusiasts building or refining text‑to‑SQL models
Dataset Structure
- Number of Fields: 11
- Example Fields: id, domain, domain_description, sql_complexity, sql_complexity_description, sql_task_type, sql_task_type_description, sql_prompt, sql_context, sql, sql_explanation
Data Quality Evaluation
- Evaluation Method: GPT‑4 scored 1,000 random samples and compared results with the b‑mc2/sql‑create‑context dataset
- Evaluation Results: outperformed the comparison dataset on multiple metrics
Citation Information
@misc{gretel‑synthetic‑text‑to‑sql-2024,
author = {Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew},
title = {{Synthetic‑Text‑To‑SQL}: A synthetic dataset for training language models to generate SQL queries from natural language prompts},
month = {April},
year = {2024},
url = {https://huggingface.co/datasets/gretelai/synthetic-text-to-sql}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.