Back to datasets
Dataset assetOpen Source CommunityMachine LearningNatural Language Processing

Starlento/DPO-En-Zh-20k-handbook

This dataset is a rearranged version of the original DPO‑En‑Zh‑20k dataset, split into 9,900 + 9,900 samples for training and 100 + 100 for testing. It contains fields such as language, prompt, rejected response (content and role), and chosen response (content and role), suitable for text generation and QA tasks in both Chinese and English.

Source
hugging_face
Created
Nov 28, 2025
Updated
May 2, 2024
Signals
86 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

  • DPO-En-Zh-20k-handbook

Dataset Features

  • language: string type
  • prompt: string type
  • rejected: list type, includes
    • content: string type
    • role: string type
  • chosen: list type, includes
    • content: string type
    • role: string type

Dataset Splits

  • test: 200 samples, occupying 1 354 176 bytes
  • train: 19 800 samples, occupying 107 311 936 bytes

Dataset Size

  • Download size: 60 064 620 bytes
  • Dataset size: 108 666 112 bytes

Configuration Information

  • config_name: default
  • data_files:
    • test: path data/test-*
    • train: path data/train-*

Task Categories

  • Text Generation
  • Question Answering

Languages

  • Chinese
  • English

Tags

  • dpo

Size Category

  • 10K<n<100K
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio