High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

jordanparker6/publaynet

PubLayNet is a large document image dataset whose layout annotations consist of bounding boxes and polygon segmentations. The data originate from the PubMed Central Open Access Subset, and annotations are generated by automatically matching PDF and XML formats. This dataset is the largest in the field of document layout analysis; see the paper "PubLayNet: largest dataset ever for document layout analysis" for details.

hugging_face

View Details