zhengr/ultrachat_200k
UltraChat 200k is a rigorously filtered dialogue dataset containing 1.4 M ChatGPT‑generated conversations covering a wide range of topics. The dataset has been processed, including selecting a subset of data, correcting case, and removing dialogues containing certain phrases, making it suitable for supervised fine‑tuning and generation ranking tasks.
Description
UltraChat 200k Dataset Overview
Dataset Description
UltraChat 200k is a filtered version of the UltraChat dataset used to train Zephyr‑7B‑β, an advanced 7B chat model. The original dataset contains 1.4 M dialogues generated by ChatGPT across diverse topics. To create UltraChat 200k, the following processing steps were applied:
- Selected a subset of data to accelerate supervised fine‑tuning.
- Fixed case inconsistencies, as about 5 % of entries contained grammatical errors such as "Hello. how are you?" instead of "Hello. How are you?".
- Removed dialogues where the assistant replies contained phrases like "I do not have emotions" or "I don't have opinions", even when the corresponding factual prompts did not involve such content.
Dataset Structure
The dataset is divided into four splits for:
- Supervised fine‑tuning (
sft). - Generation ranking (
gen), usable with rejection sampling or PPO techniques.
Sample counts per split:
| train_sft | test_sft | train_gen | test_gen |
|---|---|---|---|
| 207,865 | 23,110 | 256,032 | 28,304 |
The data is stored in Parquet format, with each entry following this schema:
{
"prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"messages":[
{"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...", "role": "user"},
{"content": "Name: Ava\n\n Ava was just 16 years old ...", "role": "assistant"},
{"content": "Wow, Ava's story is so intense ...", "role": "user"},
{"content": "Certainly! ...", "role": "assistant"},
{"content": "That's really interesting! ...", "role": "user"},
{"content": "Certainly! ...", "role": "assistant"}
],
"prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}
Citation
If you find this dataset useful in your work, please cite the original UltraChat dataset:
@misc{ding2023enhancing,
title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations},
author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
year={2023},
eprint={2305.14233},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
You may also cite the Zephyr‑7B technical report:
@misc{tunstall2023zephyr,
title={Zephyr: Direct Distillation of LM Alignment},
author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf},
year={2023},
eprint={2310.16944},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.