Back to datasets
Dataset assetOpen Source CommunityComputer VisionDocument Layout Analysis
Doraemon-AI/pdf-layout-chinese
pdf-layout-chinese is a Chinese document layout analysis dataset focusing on Chinese scholarly documents (e.g., papers). The dataset provides 10 layout classes: text, title, image, image title, table, table title, header, footer, caption, and formula. It contains 5,000 training images and 1,000 validation images; each image has a correspondingly named JSON annotation file. Annotations were created with labelme and support polygon shapes.
Source
hugging_face
Created
Nov 28, 2025
Updated
Apr 18, 2024
Signals
254 views
Availability
Linked source ready
Overview
Dataset description and usage context
pdf-layout-chinese Dataset Overview
Basic Information
- Name: pdf-layout-chinese
- License: AFL-3.0
- Task Type: Feature Extraction
- Languages: English, Chinese
- Size Category: 100M < n < 1B
Content
- Description: pdf-layout-chinese is a Chinese document layout analysis dataset targeting scholarly paper scenarios.
- Labels: 10 classes – text, title, image, image title, table, table title, header, footer, caption, formula.
- Splits: 5,000 training images and 1,000 validation images, stored in
trainandvaldirectories respectively. - Annotation Files: Each image has a same‑named JSON annotation file generated with labelme.
Annotation Format
- Tool: labelme
- Structure: Aligns with labelme format and includes key fields:
shapes: list of annotation instances.labels: class labels.points: polygon coordinates.shape_type:polygonimagePath: image path/nameimageHeight: image heightimageWidth: image width
Conversion
- Conversion Tool:
labelme2coco.py - Commands:
- Training set:
python3 labelme2coco.py train train_save_path --labels labels.txt - Validation set:
python3 labelme2coco.py val val_save_path --labels labels.txt
- Training set:
- Output Location: Saved under
train_save_path/val_save_pathdirectories.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.