JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

alpaca-chinese-dataset

Instruction Fine‑tuning
Machine Translation

This dataset comprises a mixed Chinese‑English corpus designed for bilingual fine‑tuning and ongoing data correction. The original Alpaca English dataset contains numerous issues, such as erroneous mathematical samples, mislabeled output fields, and misaligned tags. This dataset rectifies those problems, translates the corrected samples into Chinese, and manually rewrites instructions where literal translation leads to loss of rhyme, tense inconsistencies, or other nuances. It focuses on: (1) fixing problems in the original English data, (2) translating into Chinese, (3) adjusting samples affected by direct translation, (4) leaving code and special outputs unchanged, and (5) aligning special tags or refusal outputs.

github
View Details

TigerResearch/sft_zh

Chinese QA
Instruction Fine‑tuning

Chinese sft‑zh data collection from the Tigerbot open‑source project, encompassing multiple Chinese datasets such as Alpaca‑Chinese, encyclopedia QA, classic literature QA, riddles, reading comprehension, general QA, and Zhihu QA. The collection can be used directly without repeated downloads.

hugging_face
View Details

shareAI/DPO-zh-en-emoji

Chatbot
Instruction Fine‑tuning

--- license: apache-2.0 task_categories: - question-answering language: - zh - en pretty_name: dpo-llama3 size_categories: - 1K<n<10K --- A chatbot dialogue dataset with textual emojis, available in both Chinese and English versions, suitable for SFT/DPO training. We have carefully selected some questions originating from Zhihu, logic reasoning, and Weichi Bar as Queries. These were generated using the llama3 70b instruct version, with each query producing a Chinese version of the answer and an English version of the answer. This can be used for aligning language model "language type" and "language style" tasks. Github link: https://github.com/CrazyBoyM/llama3-Chinese-chat Modelscope link: https://modelscope.cn/datasets/shareAI/shareAI-Llama3-DPO-zh-en-emoji/summary The data can also be used for traditional training methods such as SFT/ORPO, improving the model's logical reasoning and complex question answering capabilities while aligning language styles. 一个带有趣味文字表情的机器人聊天对话数据集,包含中文和英文版本,可用于SFT/DPO训练。 我们精心选出了一些源于知乎、逻辑推理、弱智吧的问题作为Query, 使用llama3 70b instruct版本采样生成, 对每个query生成一个中文版本的answer和一个英文版本的answer, 用于对齐语言模型的“语种”、“语言风格”任务。 Github地址:https://github.com/CrazyBoyM/llama3-Chinese-chat modelscope地址:https://modelscope.cn/datasets/shareAI/shareAI-Llama3-DPO-zh-en-emoji/summary 该数据亦可用于SFT/ORPO等传统训练方式,可在对齐语言风格的同时提升模型的推理逻辑、复杂问题问答能力。 如果您的工作成果使用到了该项目,请按如下方式进行引用: If your work results use this project, please cite it as follows: ``` @misc{DPO-zh-en-emoji2024, author = {Xinlu Lai, shareAI}, title = {The DPO Dataset for Chinese and English with emoji}, year = {2024}, publisher = {huggingface}, journal = {huggingface repository}, howpublished = {\url{https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji}} } ```

hugging_face
View Details

shareAI/ShareGPT-Chinese-English-90k

Natural Language Processing
Instruction Fine‑tuning

--- license: apache-2.0 configs: - config_name: default data_files: sharegpt_jsonl/*.jsonl task_categories: - question-answering - text-generation language: - en - zh tags: - code size_categories: - 10K<n<100K --- # ShareGPT-Chinese-English-90k Bilingual Human-Machine QA Dataset A high‑quality Chinese‑English parallel bilingual human‑machine QA dataset, covering user questions in real and complex scenarios. It is used for training high‑quality dialogue models (more robust in instruction distribution than datasets generated by repeatedly calling API interfaces to simulate machine‑generated Q&A, such as Moss). **Features** - Provides fully semantically equivalent Chinese‑English parallel corpus, facilitating bilingual dialogue model training. - All questions are genuine user inquiries, not fabricated by artificial imagination or API polling (like Moss), aligning more closely with the real distribution of user scenarios and their expressions of questions. - The ShareGPT dataset is collected through voluntary sharing by netizens, acting as a natural filter (via human perception) that screens out most dialogues with poor experience. It is recommended to use the Firefly framework for quick and easy out‑of‑the‑box loading of this data format: https://github.com/yangjianxin1/Firefly **Note**: This dataset was collected at a time before ChatGPT showed signs of significant cognitive decline. (It is speculated that this may be partly because the provider replaced the 150B GPT‑3.5 with a distilled version of about 10B to reduce expenses, and partly because the introduction of more refusal responses led to a degradation in the model's ability to connect knowledge and logic.) Training an excellent dialogue LLM cannot be done without a high‑quality multi‑turn dialogue dataset. If you also wish to become a volunteer, you are welcome to join the dataset QQ group: 130920969, to exchange, collect, and contribute to the construction of high‑quality datasets. # ShareGPT-Chinese-English-90k Bilingual Human-Machine QA Dataset (Chinese version) 中英文平行双语优质人机问答数据集,覆盖真实复杂场景下的用户提问。用于训练高质量的对话模型(比那些通过反复调用 API 接口生成机器模拟问答的数据在指令分布上更鲁棒)。 **特点** - 同时提供意义表达完全相同的中英文平行对照语料,可进行双语对话模型训练。 - 所有问题均非人为臆想加上 API 轮询拟造的假数据(如 Moss),更加符合真实用户场景的指令分布和提问表达。 - ShareGPT 数据集是由网友自发分享而收集到的,相当于有一层非常天然的过滤(通过人类感觉),筛除了大部分体验不好的对话。 推荐使用 Firefly 框架,可以快速开箱即用使用该数据格式的加载: https://github.com/yangjianxin1/Firefly PS:当前数据集为 Firefly 格式,可以自行使用仓库内提供的脚本转换为更广为使用的 ShareGPT 格式的多轮对话数据集。 ```python import json def convert_jsonl(input_file, output_file): with open(input_file, 'r', encoding='utf-8') as f: with open(output_file, 'w', encoding='utf-8') as fout: for line in f: data = json.loads(line.strip()) conversations = data['conversation'] new_conversations = [] for conv in conversations: for key, value in conv.items(): if key == 'assistant': key = 'gpt' else: key = 'human' new_conversations.append({'from': key, 'value': value}) new_data = {'conversations': new_conversations} fout.write(json.dumps(new_data, ensure_ascii=False) + '\n') # 替换输入文件路径和输出文件路径 input_file = 'input_firefly.jsonl' output_file = 'output_sharegpt.jsonl' convert_jsonl(input_file, output_file) ``` 补充:该数据收集于 ChatGPT 还未表现出明显智力退化的时间点。(猜测一方面可能是官方为了减小开支把 150B 的 GPT‑3.5 替换成 10B 左右的蒸馏版本了,另一方面可能是由于引入了更多的拒绝答复导致模型连接知识逻辑的程度退化) 优秀对话 LLM 的训练离不开高质量的多轮对话数据集,如果你也想成为志愿者,欢迎加入 ShareAI QQ 群:130920969,共同进行优质数据集的交流、收集和建设工作。 特别感谢:“淮北艾阿网络科技有限公司”对翻译工作费用的赞助支持! <img width="360" src="https://cdn-uploads.huggingface.co/production/uploads/631f5b422225f12fc0f2c838/rnAz74Adg-m8QbRraXhqU.jpeg"> 如果您的工作成果使用到了该项目,请按如下方式进行引用: If your work results use this project, please cite it as follows: ``` @misc{ShareGPT-Chinese-English-90k, author = {shareAI}, title = {ShareGPT-Chinese-English-90k Bilingual Human-Machine QA Dataset}, year = {2023}, publisher = {huggingface}, journal = {huggingface repository}, howpublished = {\url{https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k}} } ```

hugging_face
View Details