IBM/doc2dial
Doc2dial is a document‑grounded goal‑oriented dialogue dataset containing more than 4,500 annotated dialogues (approximately 14 turns per dialogue) based on over 450 documents from four domains. Compared with earlier document‑based dialogue corpora, Doc2dial covers a wider range of information‑seeking scenarios. Supported tasks include question answering, and the dataset is monolingual (English). Its structure comprises dialogue, document, and reading‑comprehension domains, each with detailed field descriptions.
Description
Dataset Overview
Dataset Name: doc2dial
Language: English
License: CC‑BY‑3.0
Multilinguality: Monolingual
Size: 1K < n < 10K
Source Dataset: Original
Task Category: Question Answering
Task ID: closed-domain-qa
Dataset Structure
Data Instances
- dialogue_domain: contains fields such as dialogue ID, document ID, domain, and dialogue turns.
- document_domain: contains document ID, domain, HTML content, and plain‑text content.
- doc2dial_rc: contains ID, title, context, question, answer, and domain.
Data Fields
-
dialogue_domain:
dial_id: dialogue IDdoc_id: associated document IDdomain: domainturns: list of dialogue turns, each withturn_id,role,da,references,utterance, etc.
-
document_domain:
doc_id: document IDtitle: document titledomain: domaindoc_text: plain‑text contentdoc_html_ts: HTML content with annotated spansdoc_html_raw: raw HTML contentspans: all spans in the document, each with IDs, start/end offsets, text, section information, titles, and parent titles.
-
doc2dial_rc:
id: IDtitle: titlecontext: contextquestion: questionanswers: list of answers withtextandanswer_startdomain: domain
Data Splits
-
dialogue_domain:
train: 3,474 instances (size: 6,924,209 bytes)validation: 661 instances (size: 1,315,815 bytes)
-
document_domain:
train: 3,416 instances (size: 204,874,908 bytes)
-
doc2dial_rc:
validation: 3,972 instances (size: 22,705,288 bytes)train: 20,431 instances (size: 114,778,994 bytes)
Dataset Creation
- Annotation Creators: expert generated
- Language Creators: discovery
Usage Notes
- The dataset contains personal and sensitive information; use with caution.
- Potential bias and known limitations exist; consider social impact when using.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.