Back to datasets
Dataset assetOpen Source CommunityComputer VisionDocument Layout Analysis

Doraemon-AI/pdf-layout-chinese

pdf-layout-chinese is a Chinese document layout analysis dataset focusing on Chinese scholarly documents (e.g., papers). The dataset provides 10 layout classes: text, title, image, image title, table, table title, header, footer, caption, and formula. It contains 5,000 training images and 1,000 validation images; each image has a correspondingly named JSON annotation file. Annotations were created with labelme and support polygon shapes.

Source
hugging_face
Created
Nov 28, 2025
Updated
Apr 18, 2024
Signals
254 views
Availability
Linked source ready
Overview

Dataset description and usage context

pdf-layout-chinese Dataset Overview

Basic Information

  • Name: pdf-layout-chinese
  • License: AFL-3.0
  • Task Type: Feature Extraction
  • Languages: English, Chinese
  • Size Category: 100M < n < 1B

Content

  • Description: pdf-layout-chinese is a Chinese document layout analysis dataset targeting scholarly paper scenarios.
  • Labels: 10 classes – text, title, image, image title, table, table title, header, footer, caption, formula.
  • Splits: 5,000 training images and 1,000 validation images, stored in train and val directories respectively.
  • Annotation Files: Each image has a same‑named JSON annotation file generated with labelme.

Annotation Format

  • Tool: labelme
  • Structure: Aligns with labelme format and includes key fields:
    • shapes: list of annotation instances.
    • labels: class labels.
    • points: polygon coordinates.
    • shape_type: polygon
    • imagePath: image path/name
    • imageHeight: image height
    • imageWidth: image width

Conversion

  • Conversion Tool: labelme2coco.py
  • Commands:
    • Training set: python3 labelme2coco.py train train_save_path --labels labels.txt
    • Validation set: python3 labelme2coco.py val val_save_path --labels labels.txt
  • Output Location: Saved under train_save_path / val_save_path directories.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio