zhihu_rlhf_3k

Name	License	Description	Count
zhihu_rlhf_3k	cc‑by‑2.0	Over 3k human preference records derived from Zhihu Q&A, each question provides a pair of answers with differing up‑vote counts	3k
huozi_rlhf_data	Apache 2.0	16.9k manually annotated preferences (responses from huozi‑1.0)	16.9k
chatbot_arena_conversations	cc	33K cleaned dialogues containing paired human preferences, collected from 13K unique IP addresses	33k

Name	License	Description	Count
ruozhiba	Apache 2.0	Inspired by COIG‑CQIA, builds a similar dataset with more concise answer style	1.5k
COIG‑CQIA	—	Open‑source high‑quality instruction‑tuning dataset aimed at providing Chinese NLP community with high‑quality instruction data	46k
OL‑CC	Apache 2.0	Crowdsourced, human‑generated open‑source Chinese dialogue instruction set, containing 10k+ “instruction‑answer” pairs	11.6k

Name	License	Description	Count
firefly‑train‑1.1M	none	Constructed on 23 common Chinese datasets by manually writing various instruction templates	1.1M
pCLUE	none	Derived from 9 datasets (tnews, ocnli, etc.) with 73 prompts	1.2M
xP3mt_zh	apache‑2.0	Chinese version obtained by translating the original English xP3 dataset	3,571,636

Name	License	Description	Count
alpaca_gpt4_data_zh_52k	Apache 2.0	Data generated by GPT‑4 using Chinese prompts	52k
alpaca_data_zh_51k	Apache 2.0	Chinese Alpaca data containing 51k instruction samples scraped from ChatGPT (gpt‑3.5‑turbo)	51k
BELLE	gpl‑3.0	Chinese dataset generated following Stanford Alpaca methodology	0.5M/1M/2M/10M
alpaca_chinese_dataset	MIT	Manually verified ~21k Alpaca translation data, enriched with many Chinese‑specific samples	>21k
COIG	Apache 2.0/MIT/CC‑BY‑SA‑4.0	Multiple sub‑datasets totaling 191,191 instruction samples	191,191
MOSS	cc‑by‑4.0	moss‑002‑sft‑data (~590k Chinese dialogues), moss‑003‑sft‑data (~1.1M dialogues)	590k/1.1M
HC3‑Chinese	cc‑by‑sa‑4.0	Human‑ChatGPT comparison corpus	12,853
RefGPT‑Fact‑zh	Apache 2.0	Multi‑turn dialogue dataset containing 50k Chinese factual knowledge turns	50k
Safety‑Prompts	Apache 2.0	100k Chinese safety‑scenario prompts and ChatGPT responses	100k