Back to datasets
Dataset assetOpen Source CommunityDialogue SystemsQuestion Answering

IBM/doc2dial

Doc2dial is a document‑grounded goal‑oriented dialogue dataset containing more than 4,500 annotated dialogues (approximately 14 turns per dialogue) based on over 450 documents from four domains. Compared with earlier document‑based dialogue corpora, Doc2dial covers a wider range of information‑seeking scenarios. Supported tasks include question answering, and the dataset is monolingual (English). Its structure comprises dialogue, document, and reading‑comprehension domains, each with detailed field descriptions.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 18, 2024
Signals
310 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name: doc2dial

Language: English

License: CC‑BY‑3.0

Multilinguality: Monolingual

Size: 1K < n < 10K

Source Dataset: Original

Task Category: Question Answering

Task ID: closed-domain-qa

Dataset Structure

Data Instances

  • dialogue_domain: contains fields such as dialogue ID, document ID, domain, and dialogue turns.
  • document_domain: contains document ID, domain, HTML content, and plain‑text content.
  • doc2dial_rc: contains ID, title, context, question, answer, and domain.

Data Fields

  • dialogue_domain:

    • dial_id: dialogue ID
    • doc_id: associated document ID
    • domain: domain
    • turns: list of dialogue turns, each with turn_id, role, da, references, utterance, etc.
  • document_domain:

    • doc_id: document ID
    • title: document title
    • domain: domain
    • doc_text: plain‑text content
    • doc_html_ts: HTML content with annotated spans
    • doc_html_raw: raw HTML content
    • spans: all spans in the document, each with IDs, start/end offsets, text, section information, titles, and parent titles.
  • doc2dial_rc:

    • id: ID
    • title: title
    • context: context
    • question: question
    • answers: list of answers with text and answer_start
    • domain: domain

Data Splits

  • dialogue_domain:

    • train: 3,474 instances (size: 6,924,209 bytes)
    • validation: 661 instances (size: 1,315,815 bytes)
  • document_domain:

    • train: 3,416 instances (size: 204,874,908 bytes)
  • doc2dial_rc:

    • validation: 3,972 instances (size: 22,705,288 bytes)
    • train: 20,431 instances (size: 114,778,994 bytes)

Dataset Creation

  • Annotation Creators: expert generated
  • Language Creators: discovery

Usage Notes

  • The dataset contains personal and sensitive information; use with caution.
  • Potential bias and known limitations exist; consider social impact when using.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio