json-training

This dataset is intended to support fine‑tuning of small yet powerful models (e.g., Qwen2 0.5B and SmolLM 135M/360M) that struggle with JSON‑structured data generation tasks. It contains three fields—`query`, `schema`, and `response`—representing the user's plain‑text query, the desired output JSON schema, and an LLM response that conforms to the schema. The data were synthesized by large language models such as Llama 3.1 8B and Claude 3.5 Sonnet and will be updated regularly.

Updated 8/22/2024

huggingface

Description

JSON Training Data

数据集概述

该数据集旨在为小型但功能强大的模型（如Qwen2 0.5B和SmolLM 135M/360M）提供微调数据，特别是在JSON结构化数据生成方面。这些模型在处理JSON输出时表现不佳，因此需要专门的数据集进行微调。

数据收集

数据完全由大型语言模型（LLMs）合成生成，主要使用Llama 3.1 8B生成，并由Claude 3.5 Sonnet贡献约2000个示例。

数据字段

数据集包含以下字段：

query：用户的纯文本查询，无结构化组件。
schema：期望的输出JSON模式。
response：符合schema的LLM对query的示例响应。

用户可以根据需要将这些字段转换为任何格式进行微调，例如将模式放入系统提示中，或将模式注入用户消息中。

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Model Training

JSON Data Processing

Source

Organization: huggingface

Created: 8/21/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →