High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

jordanparker6/publaynet

PubLayNet is a large document image dataset whose layout annotations consist of bounding boxes and polygon segmentations. The data originate from the PubMed Central Open Access Subset, and annotations are generated by automatically matching PDF and XML formats. This dataset is the largest in the field of document layout analysis; see the paper "PubLayNet: largest dataset ever for document layout analysis" for details.

hugging_face

View Details

Doraemon-AI/pdf-layout-chinese

Document Layout Analysis

Computer Vision

pdf-layout-chinese is a Chinese document layout analysis dataset focusing on Chinese scholarly documents (e.g., papers). The dataset provides 10 layout classes: text, title, image, image title, table, table title, header, footer, caption, and formula. It contains 5,000 training images and 1,000 validation images; each image has a correspondingly named JSON annotation file. Annotations were created with labelme and support polygon shapes.

hugging_face

View Details