Back to datasets
Dataset assetOpen Source CommunityMachine TranslationInstruction Fine‑tuning

alpaca-chinese-dataset

This dataset comprises a mixed Chinese‑English corpus designed for bilingual fine‑tuning and ongoing data correction. The original Alpaca English dataset contains numerous issues, such as erroneous mathematical samples, mislabeled output fields, and misaligned tags. This dataset rectifies those problems, translates the corrected samples into Chinese, and manually rewrites instructions where literal translation leads to loss of rhyme, tense inconsistencies, or other nuances. It focuses on: (1) fixing problems in the original English data, (2) translating into Chinese, (3) adjusting samples affected by direct translation, (4) leaving code and special outputs unchanged, and (5) aligning special tags or refusal outputs.

Source
github
Created
Mar 19, 2023
Updated
Mar 27, 2023
Signals
460 views
Availability
Linked source ready
Overview

Dataset description and usage context

alpaca‑chinese‑dataset Overview

Dataset Description

  • Name: alpaca Chinese instruction fine‑tuning dataset

Data Generation Methods

  • Machine Translation
  • self‑instruct

Data Format

  • Same JSON format as the original Alpaca data

Data Samples

json [ { "instruction": "Determine the nearest airport given specific coordinates.", "input": "40.728157, -73.794853", "output": "The nearest airport to the given coordinates is LaGuardia Airport (LGA) in New York City." }, { "instruction": "Output a list of different kinds of fruit", "input": "", "output": "1. Apple\n2. Banana\n3. Orange\n4. Mango\n5. Strawberry\n6. Grape\n7. Blueberry\n8. Cherry\n9. Kiwi\n10. Melon\n11. Pineapple\n12. Plum\n13. Peach" }, { "instruction": "Find the hidden message in the string", "input": "业余咖啡", "output": "The hidden message is "Coffee‑Lover"." } ]

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio