Back to datasets
Dataset assetOpen Source CommunityChatbotInstruction Fine‑tuning

shareAI/DPO-zh-en-emoji

--- license: apache-2.0 task_categories: - question-answering language: - zh - en pretty_name: dpo-llama3 size_categories: - 1K<n<10K --- A chatbot dialogue dataset with textual emojis, available in both Chinese and English versions, suitable for SFT/DPO training. We have carefully selected some questions originating from Zhihu, logic reasoning, and Weichi Bar as Queries. These were generated using the llama3 70b instruct version, with each query producing a Chinese version of the answer and an English version of the answer. This can be used for aligning language model "language type" and "language style" tasks. Github link: https://github.com/CrazyBoyM/llama3-Chinese-chat Modelscope link: https://modelscope.cn/datasets/shareAI/shareAI-Llama3-DPO-zh-en-emoji/summary The data can also be used for traditional training methods such as SFT/ORPO, improving the model's logical reasoning and complex question answering capabilities while aligning language styles. 一个带有趣味文字表情的机器人聊天对话数据集,包含中文和英文版本,可用于SFT/DPO训练。 我们精心选出了一些源于知乎、逻辑推理、弱智吧的问题作为Query, 使用llama3 70b instruct版本采样生成, 对每个query生成一个中文版本的answer和一个英文版本的answer, 用于对齐语言模型的“语种”、“语言风格”任务。 Github地址:https://github.com/CrazyBoyM/llama3-Chinese-chat modelscope地址:https://modelscope.cn/datasets/shareAI/shareAI-Llama3-DPO-zh-en-emoji/summary 该数据亦可用于SFT/ORPO等传统训练方式,可在对齐语言风格的同时提升模型的推理逻辑、复杂问题问答能力。 如果您的工作成果使用到了该项目,请按如下方式进行引用: If your work results use this project, please cite it as follows: ``` @misc{DPO-zh-en-emoji2024, author = {Xinlu Lai, shareAI}, title = {The DPO Dataset for Chinese and English with emoji}, year = {2024}, publisher = {huggingface}, journal = {huggingface repository}, howpublished = {\url{https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji}} } ```

Source
hugging_face
Created
Nov 28, 2025
Updated
Jun 4, 2024
Signals
682 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • License: Apache-2.0
  • Task Category: Question Answering
  • Languages: Chinese, English
  • Dataset Name: dpo-llama3
  • Size Category: 1K<n<10K

Dataset Description

  • Content: A chatbot dialogue dataset that includes textual emojis and provides both Chinese and English versions.
  • Purpose: Suitable for SFT/DPO training to align language models' "language type" and "language style".
  • Source: Queries were carefully selected from Zhihu, logic‑reasoning forums, and Weichi Bar, then generated with the llama3 70B instruct model.
  • Structure: Each query is paired with a Chinese answer and an English answer.

Application Scenarios

  • Training Methods: Can be used for traditional approaches such as SFT or ORPO, enhancing logical reasoning and complex QA capabilities while aligning language styles.

Citation Information

  • Authors: Xinlu Lai, shareAI
  • Title: The DPO Dataset for Chinese and English with emoji
  • Year: 2024
  • Publisher: huggingface
  • Repository: huggingface repository
  • Citation:
@misc{DPO-zh-en-emoji2024,
  author = {Xinlu Lai, shareAI},
  title = {The DPO Dataset for Chinese and English with emoji},
  year = {2024},
  publisher = {huggingface},
  journal = {huggingface repository},
  howpublished = {\url{https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji}}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio