Explore high-quality datasets for your AI and machine learning projects.
PubLayNet is a large document image dataset whose layout annotations consist of bounding boxes and polygon segmentations. The data originate from the PubMed Central Open Access Subset, and annotations are generated by automatically matching PDF and XML formats. This dataset is the largest in the field of document layout analysis; see the paper "PubLayNet: largest dataset ever for document layout analysis" for details.