Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingDialogue Systems

pfb30/multi_woz_v22

The Multi‑Domain Wizard‑of‑Oz (MultiWOZ) dataset is a fully annotated collection of written human‑human dialogues spanning multiple domains and topics. Version 2.1 fixes numerous annotation errors from the original release, while version 2.2 further corrects dialogue state errors, redefines the ontology, and introduces standardized slot‑span annotations. The dataset supports tasks such as dialogue modeling, intent‑state tracking, and dialogue act prediction. It is split into training, validation, and test sets containing 8,437, 1,000, and 1,000 dialogues respectively.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 18, 2024
Signals
186 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Name: Multi‑domain Wizard‑of‑Oz (MultiWOZ)

Version: v2.2

Language: English (en)

License: Apache‑2.0

Multilinguality: Monolingual

Size: 10K < n < 100K

Source: Original data

Task Categories:

  • Text Generation (text-generation)
  • Fill‑Mask (fill-mask)
  • Token Classification (token-classification)
  • Text Classification (text-classification)

Specific Tasks:

  • Dialogue Modeling (dialogue-modeling)
  • Multi‑class Classification (multi-class-classification)
  • Parsing (parsing)

Dataset Information:

  • Config Name: v2.2
  • Features:
    • dialogue_id: unique identifier (string).
    • services: list of services mentioned (string sequence).
    • turns: sequence of dialogue turns, each containing:
      • turn_id: unique turn ID (string).
      • speaker: USER or SYSTEM (categorical).
      • utterance: spoken text (string).
      • frames: intent and belief state (structured).
      • dialogue_acts: dialogue acts (structured).
  • Splits:
    • train: 8,437 examples, 68,222,649 bytes.
    • validation: 1,000 examples, 8,990,945 bytes.
    • test: 1,000 examples, 9,027,095 bytes.

Dataset Size: 86,240,689 bytes

Download Size: 276,592,909 bytes

Structure

Data Instances: Complete multi‑turn dialogues with annotations per turn.

Fields: dialogue_id, services, turns (including turn_id, speaker, utterance, frames, dialogue_acts).

Splits: train, validation, test.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio