jordanparker6/publaynet
PubLayNet is a large document image dataset whose layout annotations consist of bounding boxes and polygon segmentations. The data originate from the PubMed Central Open Access Subset, and annotations are generated by automatically matching PDF and XML formats. This dataset is the largest in the field of document layout analysis; see the paper "PubLayNet: largest dataset ever for document layout analysis" for details.
Dataset description and usage context
PubLayNet Dataset Overview
Basic Information
- Title: PubLayNet
- License: Community Data License Agreement – Permissive – Version 1.0
- Language: English (en)
- Size Category: 100B < size < 1T
- Task Category: image-to-text
Dataset Description
PubLayNet is a large document image dataset whose layout is annotated with bounding boxes and polygon segmentations. The data originate from the PubMed Central Open Access Subset and annotations are generated by automatically matching PDF and XML formats. This dataset is the largest in the field of document layout analysis; refer to the paper "PubLayNet: largest dataset ever for document layout analysis."
Dataset Source
- Original File Location: https://developer.ibm.com/exchanges/data/all/publaynet/
Related Publications
- Paper: "PubLayNet: largest dataset ever for document layout analysis."
- Authors: Zhong, Xu; Tang, Jianbin; Yepes, Antonio Jimeno
- Year: 2019
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.