alpaca-chinese-dataset
This dataset comprises a mixed Chinese‑English corpus designed for bilingual fine‑tuning and ongoing data correction. The original Alpaca English dataset contains numerous issues, such as erroneous mathematical samples, mislabeled output fields, and misaligned tags. This dataset rectifies those problems, translates the corrected samples into Chinese, and manually rewrites instructions where literal translation leads to loss of rhyme, tense inconsistencies, or other nuances. It focuses on: (1) fixing problems in the original English data, (2) translating into Chinese, (3) adjusting samples affected by direct translation, (4) leaving code and special outputs unchanged, and (5) aligning special tags or refusal outputs.
Dataset description and usage context
alpaca‑chinese‑dataset Overview
Dataset Description
- Name: alpaca Chinese instruction fine‑tuning dataset
Data Generation Methods
- Machine Translation
- self‑instruct
Data Format
- Same JSON format as the original Alpaca data
Data Samples
json [ { "instruction": "Determine the nearest airport given specific coordinates.", "input": "40.728157, -73.794853", "output": "The nearest airport to the given coordinates is LaGuardia Airport (LGA) in New York City." }, { "instruction": "Output a list of different kinds of fruit", "input": "", "output": "1. Apple\n2. Banana\n3. Orange\n4. Mango\n5. Strawberry\n6. Grape\n7. Blueberry\n8. Cherry\n9. Kiwi\n10. Melon\n11. Pineapple\n12. Plum\n13. Peach" }, { "instruction": "Find the hidden message in the string", "input": "业余咖啡", "output": "The hidden message is "Coffee‑Lover"." } ]
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.