Doraemon-AI/pdf-layout-chinese
pdf-layout-chinese is a Chinese document layout analysis dataset focusing on Chinese scholarly documents (e.g., papers). The dataset provides 10 layout classes: text, title, image, image title, table, table title, header, footer, caption, and formula. It contains 5,000 training images and 1,000 validation images; each image has a correspondingly named JSON annotation file. Annotations were created with labelme and support polygon shapes.
Description
pdf-layout-chinese Dataset Overview
Basic Information
- Name: pdf-layout-chinese
- License: AFL-3.0
- Task Type: Feature Extraction
- Languages: English, Chinese
- Size Category: 100M < n < 1B
Content
- Description: pdf-layout-chinese is a Chinese document layout analysis dataset targeting scholarly paper scenarios.
- Labels: 10 classes – text, title, image, image title, table, table title, header, footer, caption, formula.
- Splits: 5,000 training images and 1,000 validation images, stored in
trainandvaldirectories respectively. - Annotation Files: Each image has a same‑named JSON annotation file generated with labelme.
Annotation Format
- Tool: labelme
- Structure: Aligns with labelme format and includes key fields:
shapes: list of annotation instances.labels: class labels.points: polygon coordinates.shape_type:polygonimagePath: image path/nameimageHeight: image heightimageWidth: image width
Conversion
- Conversion Tool:
labelme2coco.py - Commands:
- Training set:
python3 labelme2coco.py train train_save_path --labels labels.txt - Validation set:
python3 labelme2coco.py val val_save_path --labels labels.txt
- Training set:
- Output Location: Saved under
train_save_path/val_save_pathdirectories.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.