Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingDialogue Generation

g-ronimo/oasst2_top4k_en

The dataset contains two primary features: messages, each comprising the sub‑features content and role. It is split into a training set with 4,000 samples and a test set with 400 samples. The data were selected from top‑ranked dialogues in OpenAssistant/oasst2, followed by deduplication and similarity filtering (long answers with similarity > 0.8 were excluded). The dataset includes only English content and was generated using a specific script.

Source
hugging_face
Created
Nov 28, 2025
Updated
Mar 5, 2024
Signals
115 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Information

  • Features:
    • messages:
      • content: data type is string
      • role: data type is string
  • Splits:
    • train:
      • Bytes: 7,744,472.411884111
      • Samples: 4,000
    • test:
      • Bytes: 774,447.2411884111
      • Samples: 400
  • Download size: 4,492,003 bytes
  • Dataset size: 8,518,919.653072523 bytes

Configuration

  • Default configuration:
    • data_files:
      • train: path is data/train-*
      • test: path is data/test-*
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio