JUHE API Marketplace
DATASET
Open Source Community

OpenAssistant/oasst2

The OpenAssistant Conversations Release 2 (OASST2) dataset contains message trees, each rooted by an initial prompt message and potentially multiple child messages as replies, which themselves may have further replies. All messages have a role attribute, either "assistant" or "prompter". The dataset includes multilingual messages and provides detailed JSON examples illustrating the message and conversation‑tree structure. It also supplies primary file information, statistics, and instructions on loading the dataset with HuggingFace Datasets.

Updated 1/11/2024
hugging_face

Description

Dataset Overview

Dataset Name

  • Name: Open Assistant Conversations Release 2 (OASST2)
  • Alias: OASST2

Dataset Content

  • Type: Dialogue dataset
  • Structure: Contains message trees; each tree is rooted by an initial prompt message and may have multiple reply levels.
  • Roles: Message roles are "assistant" and "prompter", alternating strictly throughout the dialogue.

Dataset Features

  • Feature List:
    • message_id (string)
    • parent_id (string)
    • user_id (string)
    • created_date (string)
    • text (string)
    • role (string)
    • lang (string)
    • review_count (integer)
    • review_result (boolean)
    • deleted (boolean)
    • rank (integer)
    • synthetic (boolean)
    • model_name (string)
    • detoxify (struct with multiple toxicity scores, all float)
    • message_tree_id (string)
    • tree_state (string)
    • emojis (sequence of name and count, string and integer)
    • labels (sequence of name, value, count; string, float, integer)

Dataset Splits

  • Training Set: 128,575 samples, size 158,850,455 bytes
  • Validation Set: 6,599 samples, size 7,963,122 bytes

Dataset Size

  • Download Size: 66,674,129 bytes
  • Dataset Size: 166,813,577 bytes

Supported Languages

  • Supports multiple languages, including but not limited to English, Spanish, Russian, German, Polish, Thai, etc.

Dataset Files

  • Message Tree File: .trees.jsonl.gz
  • Flattened Message File: .messages.jsonl.gz

Dataset Status

  • Ready‑to‑Export: 13,854 trees, total 135,174 messages
  • Full: 70,642 trees, total 208,584 messages

Additional Exports

  • Garbage Messages: 19,296 matching messages
  • Prompt Messages: 64,592 matching messages

Using HuggingFace Datasets

  • Provides training and validation splits that can be directly loaded via HuggingFace Datasets.

Data Visualization

  • Data visualized using Bunka technology, offering interactive map exploration of the content.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Dialogue Systems
Text Generation

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.