Dataset assetOpen Source CommunityNatural Language ProcessingChatbot

zhengr/ultrachat_200k

UltraChat 200k is a rigorously filtered dialogue dataset containing 1.4 M ChatGPT‑generated conversations covering a wide range of topics. The dataset has been processed, including selecting a subset of data, correcting case, and removing dialogues containing certain phrases, making it suitable for supervised fine‑tuning and generation ranking tasks.

Source

hugging_face

Created

Nov 28, 2025

Updated

Nov 8, 2023

Signals

267 views

Availability

Linked source ready

Overview

Dataset description and usage context

UltraChat 200k Dataset Overview

Dataset Description

UltraChat 200k is a filtered version of the UltraChat dataset used to train Zephyr‑7B‑β, an advanced 7B chat model. The original dataset contains 1.4 M dialogues generated by ChatGPT across diverse topics. To create UltraChat 200k, the following processing steps were applied:

Selected a subset of data to accelerate supervised fine‑tuning.
Fixed case inconsistencies, as about 5 % of entries contained grammatical errors such as "Hello. how are you?" instead of "Hello. How are you?".
Removed dialogues where the assistant replies contained phrases like "I do not have emotions" or "I don't have opinions", even when the corresponding factual prompts did not involve such content.

Dataset Structure

The dataset is divided into four splits for:

Supervised fine‑tuning (sft).
Generation ranking (gen), usable with rejection sampling or PPO techniques.

Sample counts per split:

train_sft	test_sft	train_gen	test_gen
207,865	23,110	256,032	28,304

The data is stored in Parquet format, with each entry following this schema:

{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...", "role": "user"},
        {"content": "Name: Ava\n\n Ava was just 16 years old ...", "role": "assistant"},
        {"content": "Wow, Ava's story is so intense ...", "role": "user"},
        {"content": "Certainly! ...", "role": "assistant"},
        {"content": "That's really interesting! ...", "role": "user"},
        {"content": "Certainly! ...", "role": "assistant"}
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}

Citation

If you find this dataset useful in your work, please cite the original UltraChat dataset:

@misc{ding2023enhancing,
      title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations},
      author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
      year={2023},
      eprint={2305.14233},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

You may also cite the Zephyr‑7B technical report:

@misc{tunstall2023zephyr,
      title={Zephyr: Direct Distillation of LM Alignment},
      author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf},
      year={2023},
      eprint={2310.16944},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio