Back to datasets
Dataset assetOpen Source CommunityModel TrainingJSON Data Processing

json-training

This dataset is intended to support fine‑tuning of small yet powerful models (e.g., Qwen2 0.5B and SmolLM 135M/360M) that struggle with JSON‑structured data generation tasks. It contains three fields—`query`, `schema`, and `response`—representing the user's plain‑text query, the desired output JSON schema, and an LLM response that conforms to the schema. The data were synthesized by large language models such as Llama 3.1 8B and Claude 3.5 Sonnet and will be updated regularly.

Source
huggingface
Created
Aug 21, 2024
Updated
Aug 22, 2024
Signals
218 views
Availability
Linked source ready
Overview

Dataset description and usage context

JSON Training Data

数据集概述

该数据集旨在为小型但功能强大的模型(如Qwen2 0.5B和SmolLM 135M/360M)提供微调数据,特别是在JSON结构化数据生成方面。这些模型在处理JSON输出时表现不佳,因此需要专门的数据集进行微调。

数据收集

数据完全由大型语言模型(LLMs)合成生成,主要使用Llama 3.1 8B生成,并由Claude 3.5 Sonnet贡献约2000个示例。

数据字段

数据集包含以下字段:

  • query:用户的纯文本查询,无结构化组件。
  • schema:期望的输出JSON模式。
  • response:符合schema的LLM对query的示例响应。

用户可以根据需要将这些字段转换为任何格式进行微调,例如将模式放入系统提示中,或将模式注入用户消息中。

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio