alpaca-chinese-dataset
This dataset comprises a mixed Chinese‑English corpus designed for bilingual fine‑tuning and ongoing data correction. The original Alpaca English dataset contains numerous issues, such as erroneous mathematical samples, mislabeled output fields, and misaligned tags. This dataset rectifies those problems, translates the corrected samples into Chinese, and manually rewrites instructions where literal translation leads to loss of rhyme, tense inconsistencies, or other nuances. It focuses on: (1) fixing problems in the original English data, (2) translating into Chinese, (3) adjusting samples affected by direct translation, (4) leaving code and special outputs unchanged, and (5) aligning special tags or refusal outputs.
Description
alpaca‑chinese‑dataset Overview
Dataset Description
- Name: alpaca Chinese instruction fine‑tuning dataset
Data Generation Methods
- Machine Translation
- self‑instruct
Data Format
- Same JSON format as the original Alpaca data
Data Samples
json [ { "instruction": "Determine the nearest airport given specific coordinates.", "input": "40.728157, -73.794853", "output": "The nearest airport to the given coordinates is LaGuardia Airport (LGA) in New York City." }, { "instruction": "Output a list of different kinds of fruit", "input": "", "output": "1. Apple\n2. Banana\n3. Orange\n4. Mango\n5. Strawberry\n6. Grape\n7. Blueberry\n8. Cherry\n9. Kiwi\n10. Melon\n11. Pineapple\n12. Plum\n13. Peach" }, { "instruction": "Find the hidden message in the string", "input": "业余咖啡", "output": "The hidden message is "Coffee‑Lover"." } ]
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 3/19/2023
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.