JUHE API Marketplace
DATASET
Open Source Community

zhihu_rlhf_3k

Over 3k human preference records derived from Zhihu Q&A, each question provides a pair of answers with differing up‑vote counts.

Updated 4/10/2024
github

Description

Preference Data

NameLicenseDescriptionCount
zhihu_rlhf_3kcc‑by‑2.0Over 3k human preference records derived from Zhihu Q&A, each question provides a pair of answers with differing up‑vote counts3k
huozi_rlhf_dataApache 2.016.9k manually annotated preferences (responses from huozi‑1.0)16.9k
chatbot_arena_conversationscc33K cleaned dialogues containing paired human preferences, collected from 13K unique IP addresses33k

Manual Annotations

NameLicenseDescriptionCount
ruozhibaApache 2.0Inspired by COIG‑CQIA, builds a similar dataset with more concise answer style1.5k
COIG‑CQIAOpen‑source high‑quality instruction‑tuning dataset aimed at providing Chinese NLP community with high‑quality instruction data46k
OL‑CCApache 2.0Crowdsourced, human‑generated open‑source Chinese dialogue instruction set, containing 10k+ “instruction‑answer” pairs11.6k

NLP Task Data Transformations

NameLicenseDescriptionCount
firefly‑train‑1.1MnoneConstructed on 23 common Chinese datasets by manually writing various instruction templates1.1M
pCLUEnoneDerived from 9 datasets (tnews, ocnli, etc.) with 73 prompts1.2M
xP3mt_zhapache‑2.0Chinese version obtained by translating the original English xP3 dataset3,571,636

LLM‑Generated Data

NameLicenseDescriptionCount
alpaca_gpt4_data_zh_52kApache 2.0Data generated by GPT‑4 using Chinese prompts52k
alpaca_data_zh_51kApache 2.0Chinese Alpaca data containing 51k instruction samples scraped from ChatGPT (gpt‑3.5‑turbo)51k
BELLEgpl‑3.0Chinese dataset generated following Stanford Alpaca methodology0.5M/1M/2M/10M
alpaca_chinese_datasetMITManually verified ~21k Alpaca translation data, enriched with many Chinese‑specific samples>21k
COIGApache 2.0/MIT/CC‑BY‑SA‑4.0Multiple sub‑datasets totaling 191,191 instruction samples191,191
MOSScc‑by‑4.0moss‑002‑sft‑data (~590k Chinese dialogues), moss‑003‑sft‑data (~1.1M dialogues)590k/1.1M
HC3‑Chinesecc‑by‑sa‑4.0Human‑ChatGPT comparison corpus12,853
RefGPT‑Fact‑zhApache 2.0Multi‑turn dialogue dataset containing 50k Chinese factual knowledge turns50k
Safety‑PromptsApache 2.0100k Chinese safety‑scenario prompts and ChatGPT responses100k

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Social QA
Preference Datasets

Source

Organization: github

Created: 4/25/2023

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.