Back to datasets
Dataset assetOpen Source CommunityMulti-domain DataMachine Learning Fine‑tuning

infCapital/viet-llama2-ft

--- dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2088825146 num_examples: 1932833 download_size: 874832201 dataset_size: 2088825146 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset mix from: + databricks/databricks-dolly-15k + ewof/alpaca-instruct-unfiltered + garage/bAInd_Open-Platypus + gbharti/finance-alpaca + Honkware/oasst1-alpaca + medical/chat + pankajmathur/WizardLM_Orca + teknium/GPTeacher-General-Instruct + LIMA + Chain-of-Thought + Dynosaur/dynosaur-full + nam194_vietnews + quora_chat + stackoverflow_chat # Dataset Creation: + The source language dataset was translated into Vietnamese using the OpenAI GPT-3.5 API. + 2% of the translations got translation errors. These translations were skipped. + The remaining translations were merged into 1 main dataset for Fine-Tuning # Important Notes: + This dataset was translated by a machine learning model, and may contain errors or inaccuracies. + 2% of the original dataset could not be processed automatically and were skipped.

Source
hugging_face
Created
Nov 28, 2025
Updated
Sep 28, 2023
Signals
42 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Information

Features

  • instruction: string
  • input: string
  • output: string

Splits

  • train: 2,088,825,146 bytes, 1,932,833 samples

Size

  • Download size: 874,832,201 bytes
  • Dataset size: 2,088,825,146 bytes

Configuration

  • default: data file path is data/train-*

Dataset Creation

  • The original language dataset was translated into Vietnamese via the OpenAI GPT‑3.5 API.
  • 2 % of the translations contain errors and were omitted.
  • The remaining translations were merged into a primary dataset for fine‑tuning.

Important Notes

  • The dataset was translated by a machine‑learning model and may contain errors or inaccuracies.
  • 2 % of the original data could not be processed automatically and were omitted.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio