maveriq/DocBank
DocBank is a new large‑scale dataset constructed using weak supervision, designed to provide integrated text and layout information for downstream tasks. Currently, the DocBank dataset contains 500,000 pages of documents, with 400,000 for training, 50,000 for validation, and 50,000 for testing. Annotations are machine‑generated, the language is English, and the dataset is monolingual. Fields include image, token, bounding box, color, font, and label.
Description
Dataset Overview
Dataset Name
- Name: DocBank
Dataset Summary
- Summary: DocBank is a large‑scale dataset built with weak supervision to provide integrated text and layout information for downstream tasks. It contains 500,000 document pages, with 400,000 for training, 50,000 for validation, and 50,000 for testing.
Supported Tasks
- Task: Document AI (text and layout)
Language
- Language: English
Dataset Structure
- Data Instances: Information to be added.
- Data Fields:
- image
- token
- bounding_box
- color
- font
- label
Data Splits
- Training Set: 400,000 instances
- Validation Set: 50,000 instances
- Test Set: 50,000 instances
Dataset Creation
- License: Apache 2.0
- Contributors: @doc-analysis
Citation
title={DocBank: A Benchmark Dataset for Document Layout Analysis},
author={Minghao Li and Yiheng Xu and Lei Cui and Shaohan Huang and Furu Wei and Zhoujun Li and Ming Zhou},
year={2020},
eprint={2006.01038},
archivePrefix={arXiv},
primaryClass={cs.CL}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.