Dataset assetOpen Source CommunityDocument Layout AnalysisAutomated Annotation

jordanparker6/publaynet

PubLayNet is a large document image dataset whose layout annotations consist of bounding boxes and polygon segmentations. The data originate from the PubMed Central Open Access Subset, and annotations are generated by automatically matching PDF and XML formats. This dataset is the largest in the field of document layout analysis; see the paper "PubLayNet: largest dataset ever for document layout analysis" for details.

Source

hugging_face

Created

Nov 28, 2025

Updated

Jul 19, 2022

Signals

549 views

Availability

Linked source ready

Overview

Dataset description and usage context

PubLayNet Dataset Overview

Basic Information

Title: PubLayNet
License: Community Data License Agreement – Permissive – Version 1.0
Language: English (en)
Size Category: 100B < size < 1T
Task Category: image-to-text

Dataset Description

PubLayNet is a large document image dataset whose layout is annotated with bounding boxes and polygon segmentations. The data originate from the PubMed Central Open Access Subset and annotations are generated by automatically matching PDF and XML formats. This dataset is the largest in the field of document layout analysis; refer to the paper "PubLayNet: largest dataset ever for document layout analysis."

Dataset Source

Original File Location: https://developer.ibm.com/exchanges/data/all/publaynet/

Related Publications

Paper: "PubLayNet: largest dataset ever for document layout analysis."
Authors: Zhong, Xu; Tang, Jianbin; Yepes, Antonio Jimeno
Year: 2019

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio