JUHE API Marketplace
DATASET
Open Source Community

zhengr/ultrachat_200k

UltraChat 200k is a rigorously filtered dialogue dataset containing 1.4 M ChatGPT‑generated conversations covering a wide range of topics. The dataset has been processed, including selecting a subset of data, correcting case, and removing dialogues containing certain phrases, making it suitable for supervised fine‑tuning and generation ranking tasks.

Updated 11/8/2023
hugging_face

Description

UltraChat 200k Dataset Overview

Dataset Description

UltraChat 200k is a filtered version of the UltraChat dataset used to train Zephyr‑7B‑β, an advanced 7B chat model. The original dataset contains 1.4 M dialogues generated by ChatGPT across diverse topics. To create UltraChat 200k, the following processing steps were applied:

  • Selected a subset of data to accelerate supervised fine‑tuning.
  • Fixed case inconsistencies, as about 5 % of entries contained grammatical errors such as "Hello. how are you?" instead of "Hello. How are you?".
  • Removed dialogues where the assistant replies contained phrases like "I do not have emotions" or "I don't have opinions", even when the corresponding factual prompts did not involve such content.

Dataset Structure

The dataset is divided into four splits for:

  • Supervised fine‑tuning (sft).
  • Generation ranking (gen), usable with rejection sampling or PPO techniques.

Sample counts per split:

train_sfttest_sfttrain_gentest_gen
207,86523,110256,03228,304

The data is stored in Parquet format, with each entry following this schema:

{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...", "role": "user"},
        {"content": "Name: Ava\n\n Ava was just 16 years old ...", "role": "assistant"},
        {"content": "Wow, Ava's story is so intense ...", "role": "user"},
        {"content": "Certainly! ...", "role": "assistant"},
        {"content": "That's really interesting! ...", "role": "user"},
        {"content": "Certainly! ...", "role": "assistant"}
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}

Citation

If you find this dataset useful in your work, please cite the original UltraChat dataset:

@misc{ding2023enhancing,
      title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations},
      author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
      year={2023},
      eprint={2305.14233},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

You may also cite the Zephyr‑7B technical report:

@misc{tunstall2023zephyr,
      title={Zephyr: Direct Distillation of LM Alignment},
      author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf},
      year={2023},
      eprint={2310.16944},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Chatbot
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.