Back to datasets
Dataset assetOpen Source CommunityDocument Layout AnalysisAutomated Annotation

jordanparker6/publaynet

PubLayNet is a large document image dataset whose layout annotations consist of bounding boxes and polygon segmentations. The data originate from the PubMed Central Open Access Subset, and annotations are generated by automatically matching PDF and XML formats. This dataset is the largest in the field of document layout analysis; see the paper "PubLayNet: largest dataset ever for document layout analysis" for details.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jul 19, 2022
Signals
549 views
Availability
Linked source ready
Overview

Dataset description and usage context

PubLayNet Dataset Overview

Basic Information

  • Title: PubLayNet
  • License: Community Data License Agreement – Permissive – Version 1.0
  • Language: English (en)
  • Size Category: 100B < size < 1T
  • Task Category: image-to-text

Dataset Description

PubLayNet is a large document image dataset whose layout is annotated with bounding boxes and polygon segmentations. The data originate from the PubMed Central Open Access Subset and annotations are generated by automatically matching PDF and XML formats. This dataset is the largest in the field of document layout analysis; refer to the paper "PubLayNet: largest dataset ever for document layout analysis."

Dataset Source

Related Publications

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio