Back to datasets
Dataset assetOpen Source CommunitySocial QAPreference Datasets

zhihu_rlhf_3k

Over 3k human preference records derived from Zhihu Q&A, each question provides a pair of answers with differing up‑vote counts.

Source
github
Created
Apr 25, 2023
Updated
Apr 10, 2024
Signals
395 views
Availability
Linked source ready
Overview

Dataset description and usage context

Preference Data

NameLicenseDescriptionCount
zhihu_rlhf_3kcc‑by‑2.0Over 3k human preference records derived from Zhihu Q&A, each question provides a pair of answers with differing up‑vote counts3k
huozi_rlhf_dataApache 2.016.9k manually annotated preferences (responses from huozi‑1.0)16.9k
chatbot_arena_conversationscc33K cleaned dialogues containing paired human preferences, collected from 13K unique IP addresses33k

Manual Annotations

NameLicenseDescriptionCount
ruozhibaApache 2.0Inspired by COIG‑CQIA, builds a similar dataset with more concise answer style1.5k
COIG‑CQIAOpen‑source high‑quality instruction‑tuning dataset aimed at providing Chinese NLP community with high‑quality instruction data46k
OL‑CCApache 2.0Crowdsourced, human‑generated open‑source Chinese dialogue instruction set, containing 10k+ “instruction‑answer” pairs11.6k

NLP Task Data Transformations

NameLicenseDescriptionCount
firefly‑train‑1.1MnoneConstructed on 23 common Chinese datasets by manually writing various instruction templates1.1M
pCLUEnoneDerived from 9 datasets (tnews, ocnli, etc.) with 73 prompts1.2M
xP3mt_zhapache‑2.0Chinese version obtained by translating the original English xP3 dataset3,571,636

LLM‑Generated Data

NameLicenseDescriptionCount
alpaca_gpt4_data_zh_52kApache 2.0Data generated by GPT‑4 using Chinese prompts52k
alpaca_data_zh_51kApache 2.0Chinese Alpaca data containing 51k instruction samples scraped from ChatGPT (gpt‑3.5‑turbo)51k
BELLEgpl‑3.0Chinese dataset generated following Stanford Alpaca methodology0.5M/1M/2M/10M
alpaca_chinese_datasetMITManually verified ~21k Alpaca translation data, enriched with many Chinese‑specific samples>21k
COIGApache 2.0/MIT/CC‑BY‑SA‑4.0Multiple sub‑datasets totaling 191,191 instruction samples191,191
MOSScc‑by‑4.0moss‑002‑sft‑data (~590k Chinese dialogues), moss‑003‑sft‑data (~1.1M dialogues)590k/1.1M
HC3‑Chinesecc‑by‑sa‑4.0Human‑ChatGPT comparison corpus12,853
RefGPT‑Fact‑zhApache 2.0Multi‑turn dialogue dataset containing 50k Chinese factual knowledge turns50k
Safety‑PromptsApache 2.0100k Chinese safety‑scenario prompts and ChatGPT responses100k
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio