Back to datasets
Dataset assetOpen Source CommunityArtificial IntelligenceDocument Analysis

maveriq/DocBank

DocBank is a new large‑scale dataset constructed using weak supervision, designed to provide integrated text and layout information for downstream tasks. Currently, the DocBank dataset contains 500,000 pages of documents, with 400,000 for training, 50,000 for validation, and 50,000 for testing. Annotations are machine‑generated, the language is English, and the dataset is monolingual. Fields include image, token, bounding box, color, font, and label.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 5, 2023
Signals
397 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

  • Name: DocBank

Dataset Summary

  • Summary: DocBank is a large‑scale dataset built with weak supervision to provide integrated text and layout information for downstream tasks. It contains 500,000 document pages, with 400,000 for training, 50,000 for validation, and 50,000 for testing.

Supported Tasks

  • Task: Document AI (text and layout)

Language

  • Language: English

Dataset Structure

  • Data Instances: Information to be added.
  • Data Fields:
    • image
    • token
    • bounding_box
    • color
    • font
    • label

Data Splits

  • Training Set: 400,000 instances
  • Validation Set: 50,000 instances
  • Test Set: 50,000 instances

Dataset Creation

Citation

title={DocBank: A Benchmark Dataset for Document Layout Analysis},
author={Minghao Li and Yiheng Xu and Lei Cui and Shaohan Huang and Furu Wei and Zhoujun Li and Ming Zhou},
year={2020},
eprint={2006.01038},
archivePrefix={arXiv},
primaryClass={cs.CL}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio