JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

dim/lmsys_chatbot_arena_conversations

Chatbot
Dialogue Analysis

The dataset includes multiple features such as question ID, dialogue content from two models, winner label, judge, turn count, anonymity flag, language, timestamp, and OpenAI moderation results. Each dialogue records content and role, turn number, and anonymity status. Toxicity detection results from two large models are also captured, including binary flags and probabilities. The dataset is provided as a training split with 33,000 samples.

hugging_face
View Details

NemoSheng/codefuse_fc_v1_sharegpt

Dialogue Data
Chatbot

The dataset contains dialogues and tool information, primarily for training and testing models. Dialogue information is stored as a list, each dialogue having a source and content field. Tool information is stored as a string. The dataset is split into training and test sets, with 72,032 training examples and 1,250 test examples. Download size 193,720,278 bytes, total size 1,002,393,963 bytes.

hugging_face
View Details

audichandra/bitext_customer_support_llm_dataset_indonesian

Customer Support
Chatbot

This dataset is the Bitext dataset translated into Indonesian using the Helsinki‑NLP/opus‑mt‑en‑id model. The original Bitext dataset is primarily used for training customer‑support LLM chatbots.

hugging_face
View Details

zhengr/ultrachat_200k

Chatbot
Natural Language Processing

UltraChat 200k is a rigorously filtered dialogue dataset containing 1.4 M ChatGPT‑generated conversations covering a wide range of topics. The dataset has been processed, including selecting a subset of data, correcting case, and removing dialogues containing certain phrases, making it suitable for supervised fine‑tuning and generation ranking tasks.

hugging_face
View Details

shareAI/DPO-zh-en-emoji

Chatbot
Instruction Fine‑tuning

--- license: apache-2.0 task_categories: - question-answering language: - zh - en pretty_name: dpo-llama3 size_categories: - 1K<n<10K --- A chatbot dialogue dataset with textual emojis, available in both Chinese and English versions, suitable for SFT/DPO training. We have carefully selected some questions originating from Zhihu, logic reasoning, and Weichi Bar as Queries. These were generated using the llama3 70b instruct version, with each query producing a Chinese version of the answer and an English version of the answer. This can be used for aligning language model "language type" and "language style" tasks. Github link: https://github.com/CrazyBoyM/llama3-Chinese-chat Modelscope link: https://modelscope.cn/datasets/shareAI/shareAI-Llama3-DPO-zh-en-emoji/summary The data can also be used for traditional training methods such as SFT/ORPO, improving the model's logical reasoning and complex question answering capabilities while aligning language styles. 一个带有趣味文字表情的机器人聊天对话数据集,包含中文和英文版本,可用于SFT/DPO训练。 我们精心选出了一些源于知乎、逻辑推理、弱智吧的问题作为Query, 使用llama3 70b instruct版本采样生成, 对每个query生成一个中文版本的answer和一个英文版本的answer, 用于对齐语言模型的“语种”、“语言风格”任务。 Github地址:https://github.com/CrazyBoyM/llama3-Chinese-chat modelscope地址:https://modelscope.cn/datasets/shareAI/shareAI-Llama3-DPO-zh-en-emoji/summary 该数据亦可用于SFT/ORPO等传统训练方式,可在对齐语言风格的同时提升模型的推理逻辑、复杂问题问答能力。 如果您的工作成果使用到了该项目,请按如下方式进行引用: If your work results use this project, please cite it as follows: ``` @misc{DPO-zh-en-emoji2024, author = {Xinlu Lai, shareAI}, title = {The DPO Dataset for Chinese and English with emoji}, year = {2024}, publisher = {huggingface}, journal = {huggingface repository}, howpublished = {\url{https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji}} } ```

hugging_face
View Details