JUHE API Marketplace
DATASET
Open Source Community

IBM/doc2dial

Doc2dial is a document‑grounded goal‑oriented dialogue dataset containing more than 4,500 annotated dialogues (approximately 14 turns per dialogue) based on over 450 documents from four domains. Compared with earlier document‑based dialogue corpora, Doc2dial covers a wider range of information‑seeking scenarios. Supported tasks include question answering, and the dataset is monolingual (English). Its structure comprises dialogue, document, and reading‑comprehension domains, each with detailed field descriptions.

Updated 1/18/2024
hugging_face

Description

Dataset Overview

Dataset Name: doc2dial

Language: English

License: CC‑BY‑3.0

Multilinguality: Monolingual

Size: 1K < n < 10K

Source Dataset: Original

Task Category: Question Answering

Task ID: closed-domain-qa

Dataset Structure

Data Instances

  • dialogue_domain: contains fields such as dialogue ID, document ID, domain, and dialogue turns.
  • document_domain: contains document ID, domain, HTML content, and plain‑text content.
  • doc2dial_rc: contains ID, title, context, question, answer, and domain.

Data Fields

  • dialogue_domain:

    • dial_id: dialogue ID
    • doc_id: associated document ID
    • domain: domain
    • turns: list of dialogue turns, each with turn_id, role, da, references, utterance, etc.
  • document_domain:

    • doc_id: document ID
    • title: document title
    • domain: domain
    • doc_text: plain‑text content
    • doc_html_ts: HTML content with annotated spans
    • doc_html_raw: raw HTML content
    • spans: all spans in the document, each with IDs, start/end offsets, text, section information, titles, and parent titles.
  • doc2dial_rc:

    • id: ID
    • title: title
    • context: context
    • question: question
    • answers: list of answers with text and answer_start
    • domain: domain

Data Splits

  • dialogue_domain:

    • train: 3,474 instances (size: 6,924,209 bytes)
    • validation: 661 instances (size: 1,315,815 bytes)
  • document_domain:

    • train: 3,416 instances (size: 204,874,908 bytes)
  • doc2dial_rc:

    • validation: 3,972 instances (size: 22,705,288 bytes)
    • train: 20,431 instances (size: 114,778,994 bytes)

Dataset Creation

  • Annotation Creators: expert generated
  • Language Creators: discovery

Usage Notes

  • The dataset contains personal and sensitive information; use with caution.
  • Potential bias and known limitations exist; consider social impact when using.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Dialogue Systems
Question Answering

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.