Dataset assetOpen Source CommunityLanguage ModelsChinese Language Processing

smoltalk-chinese

smoltalk‑chinese is a Chinese fine‑tuning dataset referenced from the SmolTalk dataset, designed to provide high‑quality synthetic data for training large language models (LLMs). The dataset consists entirely of synthetic data, covering more than 700,000 entries, and is composed of multiple parts including tasks referenced from magpie‑ultra, other SmolTalk tasks, simulated daily‑life dialogues, and mathematics problems from the Chinese version of Math23K. The generation process follows strict standards to ensure data quality and diversity. Experiments show that models fine‑tuned on smoltalk‑chinese achieve significant advantages on multiple metrics.

Source

huggingface

Created

Dec 25, 2024

Updated

Jan 2, 2025

Signals

247 views

Availability

Linked source ready

Overview

Dataset description and usage context

Chinese SmolTalk Dataset Overview

Basic Information

Language: Chinese (zh)
Task Category: Text Generation (text‑generation)
License: Apache‑2.0
Scale: 10B < n < 100B

Description

smoltalk‑chinese is a Chinese fine‑tuning dataset referenced from the SmolTalk dataset, designed to provide high‑quality synthetic data for training large language models (LLMs). The dataset consists entirely of synthetic data, covering more than 700,000 entries, and is specifically designed to improve performance of Chinese LLMs on various tasks, enhancing model versatility and adaptability. The dataset comprises multiple parts, including tasks referenced from magpie‑ultra, other SmolTalk tasks, simulated daily‑life dialogue style, and mathematics problem data from the Chinese version of Math23K. The generation process adheres to high standards, ensuring data quality and diversity. Experimental validation shows that models fine‑tuned on smoltalk‑chinese exhibit significant advantages across multiple metrics.

Composition

Magpie‑Ultra Reference Tasks
- Uses three‑turn dialogues synthesized by Magpie, tasks include:
  - Information‑seeking
  - Reasoning
  - Planning
  - Editing
  - Coding
  - Mathematics
  - Role‑playing
  - Data‑analysis
  - Creative‑writing
  - Advice‑seeking
  - Brainstorming
SmolTalk Reference Tasks
- Uses one‑turn dialogues synthesized by Magpie, tasks include:
  - Format‑constrain
  - Rewrite
  - Summary
  - Safe
  - Translate
  - Document QA
Simulated Daily Dialogue
- Generates five‑turn dialogues simulating everyday conversation style.
Mathematics Problems
- Mathematics questions from the Chinese version of Math23K, answers include detailed reasoning steps generated by deepseek‑v2.5.

Generation Method

Data Generation: Synthetic data created with Magpie; models used include deepseek‑v2.5 and qwen2.5‑72b‑instruct; Distilabel library ensures richness and diversity.
Data Screening: qwen2‑7b‑instruct evaluates clarity and fluency of the first instruction; only entries scoring ≥2 are retained.
Deduplication: gte‑large‑zh encodes the first instruction; duplicates are removed based on embedding similarity.

Experimental Validation

Base Model: opencsg/csg‑wukong‑ablation‑chinese‑fineweb‑edu (a 2B model pre‑trained on chinese‑fineweb‑edu)
Fine‑Tuning: Conducted on smoltalk‑chinese, Magpie‑Qwen2‑Pro‑200K‑Chinese and infinity‑instruct datasets; settings:
- Epochs: 2
- Learning Rate: 3e‑4
- Scheduler: Cosine decay
- Global Batch Size: 32
Evaluation Results: Evaluated on Alignbench; results show models fine‑tuned on smoltalk‑chinese achieve notable superiority across multiple metrics.

License

Using the Chinese SmolTalk dataset requires compliance with the OpenCSG community license. The dataset permits commercial use, but an email must be sent to lorraineg@opencsg.com to obtain permission.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio