DATASET
Open Source Community
zhihu_rlhf_3k
Over 3k human preference records derived from Zhihu Q&A, each question provides a pair of answers with differing up‑vote counts.
Updated 4/10/2024
github
Description
Preference Data
| Name | License | Description | Count |
|---|---|---|---|
| zhihu_rlhf_3k | cc‑by‑2.0 | Over 3k human preference records derived from Zhihu Q&A, each question provides a pair of answers with differing up‑vote counts | 3k |
| huozi_rlhf_data | Apache 2.0 | 16.9k manually annotated preferences (responses from huozi‑1.0) | 16.9k |
| chatbot_arena_conversations | cc | 33K cleaned dialogues containing paired human preferences, collected from 13K unique IP addresses | 33k |
Manual Annotations
| Name | License | Description | Count |
|---|---|---|---|
| ruozhiba | Apache 2.0 | Inspired by COIG‑CQIA, builds a similar dataset with more concise answer style | 1.5k |
| COIG‑CQIA | — | Open‑source high‑quality instruction‑tuning dataset aimed at providing Chinese NLP community with high‑quality instruction data | 46k |
| OL‑CC | Apache 2.0 | Crowdsourced, human‑generated open‑source Chinese dialogue instruction set, containing 10k+ “instruction‑answer” pairs | 11.6k |
NLP Task Data Transformations
| Name | License | Description | Count |
|---|---|---|---|
| firefly‑train‑1.1M | none | Constructed on 23 common Chinese datasets by manually writing various instruction templates | 1.1M |
| pCLUE | none | Derived from 9 datasets (tnews, ocnli, etc.) with 73 prompts | 1.2M |
| xP3mt_zh | apache‑2.0 | Chinese version obtained by translating the original English xP3 dataset | 3,571,636 |
LLM‑Generated Data
| Name | License | Description | Count |
|---|---|---|---|
| alpaca_gpt4_data_zh_52k | Apache 2.0 | Data generated by GPT‑4 using Chinese prompts | 52k |
| alpaca_data_zh_51k | Apache 2.0 | Chinese Alpaca data containing 51k instruction samples scraped from ChatGPT (gpt‑3.5‑turbo) | 51k |
| BELLE | gpl‑3.0 | Chinese dataset generated following Stanford Alpaca methodology | 0.5M/1M/2M/10M |
| alpaca_chinese_dataset | MIT | Manually verified ~21k Alpaca translation data, enriched with many Chinese‑specific samples | >21k |
| COIG | Apache 2.0/MIT/CC‑BY‑SA‑4.0 | Multiple sub‑datasets totaling 191,191 instruction samples | 191,191 |
| MOSS | cc‑by‑4.0 | moss‑002‑sft‑data (~590k Chinese dialogues), moss‑003‑sft‑data (~1.1M dialogues) | 590k/1.1M |
| HC3‑Chinese | cc‑by‑sa‑4.0 | Human‑ChatGPT comparison corpus | 12,853 |
| RefGPT‑Fact‑zh | Apache 2.0 | Multi‑turn dialogue dataset containing 50k Chinese factual knowledge turns | 50k |
| Safety‑Prompts | Apache 2.0 | 100k Chinese safety‑scenario prompts and ChatGPT responses | 100k |
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Social QA
Preference Datasets
Source
Organization: github
Created: 4/25/2023
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.