JUHE API Marketplace
DATASET
Open Source Community

alpaca-chinese-dataset

This dataset comprises a mixed Chinese‑English corpus designed for bilingual fine‑tuning and ongoing data correction. The original Alpaca English dataset contains numerous issues, such as erroneous mathematical samples, mislabeled output fields, and misaligned tags. This dataset rectifies those problems, translates the corrected samples into Chinese, and manually rewrites instructions where literal translation leads to loss of rhyme, tense inconsistencies, or other nuances. It focuses on: (1) fixing problems in the original English data, (2) translating into Chinese, (3) adjusting samples affected by direct translation, (4) leaving code and special outputs unchanged, and (5) aligning special tags or refusal outputs.

Updated 3/27/2023
github

Description

alpaca‑chinese‑dataset Overview

Dataset Description

  • Name: alpaca Chinese instruction fine‑tuning dataset

Data Generation Methods

  • Machine Translation
  • self‑instruct

Data Format

  • Same JSON format as the original Alpaca data

Data Samples

json [ { "instruction": "Determine the nearest airport given specific coordinates.", "input": "40.728157, -73.794853", "output": "The nearest airport to the given coordinates is LaGuardia Airport (LGA) in New York City." }, { "instruction": "Output a list of different kinds of fruit", "input": "", "output": "1. Apple\n2. Banana\n3. Orange\n4. Mango\n5. Strawberry\n6. Grape\n7. Blueberry\n8. Cherry\n9. Kiwi\n10. Melon\n11. Pineapple\n12. Plum\n13. Peach" }, { "instruction": "Find the hidden message in the string", "input": "业余咖啡", "output": "The hidden message is "Coffee‑Lover"." } ]

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Instruction Fine‑tuning
Machine Translation

Source

Organization: github

Created: 3/19/2023

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.