Explore high-quality datasets for your AI and machine learning projects.
This dataset converts the original SynthTabNet tables into OTSL format for table‑structure recognition tasks. It comprises four parts, each containing 150 k tables (total 600 k). Each part is divided by table appearance, size, structure, and content, and split into training, test, and validation sets. The structure includes cell content, OTSL tokens, HTML structure, restored HTML, column count, row count, and image. An OTSL vocabulary defines cell token types. The dataset was transformed and maintained by IBM Research's Deep Search team.
ICDAR‑2013.c dataset, released in 2023, is a branch of the original ICDAR‑2013 dataset modified by different authors. It includes minor corrections to the original data and automated fixes (e.g., normalization) to address over‑segmentation and make the dataset more consistent with other table structure recognition (TSR) datasets such as PubTables‑1M. For more details on this version and manual corrections, refer to the associated paper.