botp/Azure99_blossom-chat-v3
Blossom Chat V3 is a bilingual Chinese‑English dialogue dataset derived from ShareGPT 90K, suitable for multi‑turn dialogue fine‑tuning. The dataset is fully distilled using GPT‑4, addressing the scarcity of Chinese dialogue data and the output truncation problem. Chinese and English data are mixed in roughly a 1:1 ratio; each record represents a complete multi‑turn conversation containing an `id` and a `conversations` field. The `conversations` field includes `role` and `content`, representing user input and assistant output respectively. The dataset exhibits issues such as incoherent multi‑turn dialogues and inaccurate answers.
Dataset description and usage context
Dataset Overview
Dataset Name
BLOSSOM CHAT V3
Dataset Source
Derived from ShareGPT 90K, specifically designed for bilingual Chinese‑English multi‑turn dialogue fine‑tuning.
Dataset Characteristics
- Fully distilled with GPT‑4.
- Solves the problems of limited Chinese dialogue data and output truncation caused by ChatGPT’s length limits.
- The released version contains 50 % of the total data, amounting to 5 K records.
Language
The dataset primarily contains Chinese and English, mixed at approximately a 1:1 ratio.
Dataset Structure
- id: Unique identifier starting from 1.
- conversations: Array of objects, each with
roleandcontentfields.role: Eitheruserorassistant, indicating user input or assistant output.content: The corresponding textual content.
Dataset Limitations
- May contain incoherent multi‑turn dialogues, especially in conversations involving randomness.
- All responses are generated by gpt‑4‑0125‑preview without rigorous data verification; they may include inaccurate or severely erroneous answers.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.