JUHE API Marketplace
DATASET
Open Source Community

smoltalk-chinese

smoltalk‑chinese is a Chinese fine‑tuning dataset referenced from the SmolTalk dataset, designed to provide high‑quality synthetic data for training large language models (LLMs). The dataset consists entirely of synthetic data, covering more than 700,000 entries, and is composed of multiple parts including tasks referenced from magpie‑ultra, other SmolTalk tasks, simulated daily‑life dialogues, and mathematics problems from the Chinese version of Math23K. The generation process follows strict standards to ensure data quality and diversity. Experiments show that models fine‑tuned on smoltalk‑chinese achieve significant advantages on multiple metrics.

Updated 1/2/2025
huggingface

Description

Chinese SmolTalk Dataset Overview

Basic Information

  • Language: Chinese (zh)
  • Task Category: Text Generation (text‑generation)
  • License: Apache‑2.0
  • Scale: 10B < n < 100B

Description

smoltalk‑chinese is a Chinese fine‑tuning dataset referenced from the SmolTalk dataset, designed to provide high‑quality synthetic data for training large language models (LLMs). The dataset consists entirely of synthetic data, covering more than 700,000 entries, and is specifically designed to improve performance of Chinese LLMs on various tasks, enhancing model versatility and adaptability. The dataset comprises multiple parts, including tasks referenced from magpie‑ultra, other SmolTalk tasks, simulated daily‑life dialogue style, and mathematics problem data from the Chinese version of Math23K. The generation process adheres to high standards, ensuring data quality and diversity. Experimental validation shows that models fine‑tuned on smoltalk‑chinese exhibit significant advantages across multiple metrics.

Composition

  1. Magpie‑Ultra Reference Tasks

    • Uses three‑turn dialogues synthesized by Magpie, tasks include:
      • Information‑seeking
      • Reasoning
      • Planning
      • Editing
      • Coding
      • Mathematics
      • Role‑playing
      • Data‑analysis
      • Creative‑writing
      • Advice‑seeking
      • Brainstorming
  2. SmolTalk Reference Tasks

    • Uses one‑turn dialogues synthesized by Magpie, tasks include:
      • Format‑constrain
      • Rewrite
      • Summary
      • Safe
      • Translate
      • Document QA
  3. Simulated Daily Dialogue

    • Generates five‑turn dialogues simulating everyday conversation style.
  4. Mathematics Problems

    • Mathematics questions from the Chinese version of Math23K, answers include detailed reasoning steps generated by deepseek‑v2.5.

Generation Method

  • Data Generation: Synthetic data created with Magpie; models used include deepseek‑v2.5 and qwen2.5‑72b‑instruct; Distilabel library ensures richness and diversity.
  • Data Screening: qwen2‑7b‑instruct evaluates clarity and fluency of the first instruction; only entries scoring ≥2 are retained.
  • Deduplication: gte‑large‑zh encodes the first instruction; duplicates are removed based on embedding similarity.

Experimental Validation

  • Base Model: opencsg/csg‑wukong‑ablation‑chinese‑fineweb‑edu (a 2B model pre‑trained on chinese‑fineweb‑edu)

  • Fine‑Tuning: Conducted on smoltalk‑chinese, Magpie‑Qwen2‑Pro‑200K‑Chinese and infinity‑instruct datasets; settings:

    • Epochs: 2
    • Learning Rate: 3e‑4
    • Scheduler: Cosine decay
    • Global Batch Size: 32
  • Evaluation Results: Evaluated on Alignbench; results show models fine‑tuned on smoltalk‑chinese achieve notable superiority across multiple metrics.

License

Using the Chinese SmolTalk dataset requires compliance with the OpenCSG community license. The dataset permits commercial use, but an email must be sent to lorraineg@opencsg.com to obtain permission.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Language Models
Chinese Language Processing

Source

Organization: huggingface

Created: 12/25/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.