infCapital/viet-llama2-ft
--- dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2088825146 num_examples: 1932833 download_size: 874832201 dataset_size: 2088825146 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset mix from: + databricks/databricks-dolly-15k + ewof/alpaca-instruct-unfiltered + garage/bAInd_Open-Platypus + gbharti/finance-alpaca + Honkware/oasst1-alpaca + medical/chat + pankajmathur/WizardLM_Orca + teknium/GPTeacher-General-Instruct + LIMA + Chain-of-Thought + Dynosaur/dynosaur-full + nam194_vietnews + quora_chat + stackoverflow_chat # Dataset Creation: + The source language dataset was translated into Vietnamese using the OpenAI GPT-3.5 API. + 2% of the translations got translation errors. These translations were skipped. + The remaining translations were merged into 1 main dataset for Fine-Tuning # Important Notes: + This dataset was translated by a machine learning model, and may contain errors or inaccuracies. + 2% of the original dataset could not be processed automatically and were skipped.
Dataset description and usage context
Dataset Information
Features
- instruction: string
- input: string
- output: string
Splits
- train: 2,088,825,146 bytes, 1,932,833 samples
Size
- Download size: 874,832,201 bytes
- Dataset size: 2,088,825,146 bytes
Configuration
- default: data file path is
data/train-*
Dataset Creation
- The original language dataset was translated into Vietnamese via the OpenAI GPT‑3.5 API.
- 2 % of the translations contain errors and were omitted.
- The remaining translations were merged into a primary dataset for fine‑tuning.
Important Notes
- The dataset was translated by a machine‑learning model and may contain errors or inaccuracies.
- 2 % of the original data could not be processed automatically and were omitted.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.