Dataset assetOpen Source CommunityComputer VisionDocument Layout Analysis

Doraemon-AI/pdf-layout-chinese

pdf-layout-chinese is a Chinese document layout analysis dataset focusing on Chinese scholarly documents (e.g., papers). The dataset provides 10 layout classes: text, title, image, image title, table, table title, header, footer, caption, and formula. It contains 5,000 training images and 1,000 validation images; each image has a correspondingly named JSON annotation file. Annotations were created with labelme and support polygon shapes.

Source

hugging_face

Created

Nov 28, 2025

Updated

Apr 18, 2024

Signals

254 views

Availability

Linked source ready

Overview

Dataset description and usage context

pdf-layout-chinese Dataset Overview

Basic Information

Name: pdf-layout-chinese
License: AFL-3.0
Task Type: Feature Extraction
Languages: English, Chinese
Size Category: 100M < n < 1B

Content

Description: pdf-layout-chinese is a Chinese document layout analysis dataset targeting scholarly paper scenarios.
Labels: 10 classes – text, title, image, image title, table, table title, header, footer, caption, formula.
Splits: 5,000 training images and 1,000 validation images, stored in train and val directories respectively.
Annotation Files: Each image has a same‑named JSON annotation file generated with labelme.

Annotation Format

Tool: labelme
Structure: Aligns with labelme format and includes key fields:
- shapes: list of annotation instances.
- labels: class labels.
- points: polygon coordinates.
- shape_type: polygon
- imagePath: image path/name
- imageHeight: image height
- imageWidth: image width

Conversion

Conversion Tool: labelme2coco.py
Commands:
- Training set: python3 labelme2coco.py train train_save_path --labels labels.txt
- Validation set: python3 labelme2coco.py val val_save_path --labels labels.txt
Output Location: Saved under train_save_path / val_save_path directories.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio