Dataset assetOpen Source CommunityMulti-domain DataMachine Learning Fine‑tuning

infCapital/viet-llama2-ft

--- dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2088825146 num_examples: 1932833 download_size: 874832201 dataset_size: 2088825146 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset mix from: + databricks/databricks-dolly-15k + ewof/alpaca-instruct-unfiltered + garage/bAInd_Open-Platypus + gbharti/finance-alpaca + Honkware/oasst1-alpaca + medical/chat + pankajmathur/WizardLM_Orca + teknium/GPTeacher-General-Instruct + LIMA + Chain-of-Thought + Dynosaur/dynosaur-full + nam194_vietnews + quora_chat + stackoverflow_chat # Dataset Creation: + The source language dataset was translated into Vietnamese using the OpenAI GPT-3.5 API. + 2% of the translations got translation errors. These translations were skipped. + The remaining translations were merged into 1 main dataset for Fine-Tuning # Important Notes: + This dataset was translated by a machine learning model, and may contain errors or inaccuracies. + 2% of the original dataset could not be processed automatically and were skipped.

Source

hugging_face

Created

Nov 28, 2025

Updated

Sep 28, 2023

Signals

42 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Information

Features

instruction: string
input: string
output: string

Splits

train: 2,088,825,146 bytes, 1,932,833 samples

Size

Download size: 874,832,201 bytes
Dataset size: 2,088,825,146 bytes

Configuration

default: data file path is data/train-*

Dataset Creation

The original language dataset was translated into Vietnamese via the OpenAI GPT‑3.5 API.
2 % of the translations contain errors and were omitted.
The remaining translations were merged into a primary dataset for fine‑tuning.

Important Notes

The dataset was translated by a machine‑learning model and may contain errors or inaccuracies.
2 % of the original data could not be processed automatically and were omitted.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio