g-ronimo/oasst2_top4k_en

The dataset contains two primary features: messages, each comprising the sub‑features content and role. It is split into a training set with 4,000 samples and a test set with 400 samples. The data were selected from top‑ranked dialogues in OpenAssistant/oasst2, followed by deduplication and similarity filtering (long answers with similarity > 0.8 were excluded). The dataset includes only English content and was generated using a specific script.

Updated 3/5/2024

hugging_face

Description

Dataset Overview

Dataset Information

Features:
- messages:
  - content: data type is string
  - role: data type is string
Splits:
- train:
  - Bytes: 7,744,472.411884111
  - Samples: 4,000
- test:
  - Bytes: 774,447.2411884111
  - Samples: 400
Download size: 4,492,003 bytes
Dataset size: 8,518,919.653072523 bytes

Configuration

Default configuration:
- data_files:
  - train: path is data/train-*
  - test: path is data/test-*

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Dialogue Generation

Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →