Explore high-quality datasets for your AI and machine learning projects.
The Docmatix‑IR dataset is derived from the original Docmatix collection and is specifically intended for training document visual embedding models for open‑domain visual question answering. By filtering unsuitable questions and mining hard negatives, the dataset provides high‑quality training data. Concretely, the Document Screenshot Embedding (DSE) model encodes the entire Docmatix corpus, and retrieval results are used to select questions. The final result consists of 5.61 M high‑quality training samples, after filtering out roughly 4 M questions.