Explore high-quality datasets for your AI and machine learning projects.
DocBank is a new large‑scale dataset constructed using weak supervision, designed to provide integrated text and layout information for downstream tasks. Currently, the DocBank dataset contains 500,000 pages of documents, with 400,000 for training, 50,000 for validation, and 50,000 for testing. Annotations are machine‑generated, the language is English, and the dataset is monolingual. Fields include image, token, bounding box, color, font, and label.