JUHE API Marketplace
DATASET
Open Source Community

gretelai/synthetic_text_to_sql

The gretelai/synthetic_text_to_sql dataset is a high‑quality synthetic Text‑to‑SQL sample dataset generated using Gretel Navigator, containing 105,851 records, split into 100,000 training records and 5,851 test records. The dataset covers 100 different domains and includes various SQL tasks such as data definition, retrieval, manipulation, analysis, and reporting. Additionally, the dataset provides natural‑language explanations of SQL queries and contextual tags to optimize model training. The dataset quality is evaluated using LLM‑as‑a‑judge techniques, showing excellent performance in SQL standard compliance, correctness, and instruction adherence.

Updated 5/10/2024
hugging_face

Description

Dataset Overview

Basic Information

  • Name: gretelai/synthetic_text_to_sql
  • License: Apache-2.0
  • Language: English
  • Tags: synthetic, SQL, text‑to‑SQL, code
  • Task Types: question answering, table QA, text generation
  • Size Category: 100K < size < 1M

Dataset Content

  • Number of Records: 105,851 (100,000 training, 5,851 test)
  • Total Tokens: ~23 M, including ~12 M SQL tokens
  • Covered Domains: 100 different domains/verticals
  • SQL Task Types: data definition, retrieval, manipulation, analysis, reporting
  • SQL Complexity: includes sub‑queries, single joins, multiple joins, aggregation, window functions, set operations
  • Database Context: includes table and view creation statements
  • Natural‑Language Explanation: explanations of the SQL queries
  • Contextual Tags: used to optimize model training

Dataset Characteristics

  • Diversity: broad range of SQL complexities and task types
  • Quality: evaluated with LLM‑as‑a‑judge; scores higher than the b‑mc2/sql‑create‑context dataset on SQL standard compliance, correctness, and instruction adherence
  • Applications: suitable for developers, researchers, and data enthusiasts building or refining text‑to‑SQL models

Dataset Structure

  • Number of Fields: 11
  • Example Fields: id, domain, domain_description, sql_complexity, sql_complexity_description, sql_task_type, sql_task_type_description, sql_prompt, sql_context, sql, sql_explanation

Data Quality Evaluation

  • Evaluation Method: GPT‑4 scored 1,000 random samples and compared results with the b‑mc2/sql‑create‑context dataset
  • Evaluation Results: outperformed the comparison dataset on multiple metrics

Citation Information

@misc{gretel‑synthetic‑text‑to‑sql-2024,
  author = {Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew},
  title = {{Synthetic‑Text‑To‑SQL}: A synthetic dataset for training language models to generate SQL queries from natural language prompts},
  month = {April},
  year = {2024},
  url = {https://huggingface.co/datasets/gretelai/synthetic-text-to-sql}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Database Query

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.