Datasets | JuheAPI

smoltalk-chinese

Language Models

Chinese Language Processing

smoltalk‑chinese is a Chinese fine‑tuning dataset referenced from the SmolTalk dataset, designed to provide high‑quality synthetic data for training large language models (LLMs). The dataset consists entirely of synthetic data, covering more than 700,000 entries, and is composed of multiple parts including tasks referenced from magpie‑ultra, other SmolTalk tasks, simulated daily‑life dialogues, and mathematics problems from the Chinese version of Math23K. The generation process follows strict standards to ensure data quality and diversity. Experiments show that models fine‑tuned on smoltalk‑chinese achieve significant advantages on multiple metrics.

huggingface

View Details

Dataset Hub

Browse by Category

smoltalk-chinese

viking-education