OpenAssistant/oasst2
The OpenAssistant Conversations Release 2 (OASST2) dataset contains message trees, each rooted by an initial prompt message and potentially multiple child messages as replies, which themselves may have further replies. All messages have a role attribute, either "assistant" or "prompter". The dataset includes multilingual messages and provides detailed JSON examples illustrating the message and conversation‑tree structure. It also supplies primary file information, statistics, and instructions on loading the dataset with HuggingFace Datasets.
Description
Dataset Overview
Dataset Name
- Name: Open Assistant Conversations Release 2 (OASST2)
- Alias: OASST2
Dataset Content
- Type: Dialogue dataset
- Structure: Contains message trees; each tree is rooted by an initial prompt message and may have multiple reply levels.
- Roles: Message roles are "assistant" and "prompter", alternating strictly throughout the dialogue.
Dataset Features
- Feature List:
- message_id (string)
- parent_id (string)
- user_id (string)
- created_date (string)
- text (string)
- role (string)
- lang (string)
- review_count (integer)
- review_result (boolean)
- deleted (boolean)
- rank (integer)
- synthetic (boolean)
- model_name (string)
- detoxify (struct with multiple toxicity scores, all float)
- message_tree_id (string)
- tree_state (string)
- emojis (sequence of name and count, string and integer)
- labels (sequence of name, value, count; string, float, integer)
Dataset Splits
- Training Set: 128,575 samples, size 158,850,455 bytes
- Validation Set: 6,599 samples, size 7,963,122 bytes
Dataset Size
- Download Size: 66,674,129 bytes
- Dataset Size: 166,813,577 bytes
Supported Languages
- Supports multiple languages, including but not limited to English, Spanish, Russian, German, Polish, Thai, etc.
Dataset Files
- Message Tree File:
.trees.jsonl.gz - Flattened Message File:
.messages.jsonl.gz
Dataset Status
- Ready‑to‑Export: 13,854 trees, total 135,174 messages
- Full: 70,642 trees, total 208,584 messages
Additional Exports
- Garbage Messages: 19,296 matching messages
- Prompt Messages: 64,592 matching messages
Using HuggingFace Datasets
- Provides training and validation splits that can be directly loaded via HuggingFace Datasets.
Data Visualization
- Data visualized using Bunka technology, offering interactive map exploration of the content.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.