docmatix-ir
The Docmatix‑IR dataset is derived from the original Docmatix collection and is specifically intended for training document visual embedding models for open‑domain visual question answering. By filtering unsuitable questions and mining hard negatives, the dataset provides high‑quality training data. Concretely, the Document Screenshot Embedding (DSE) model encodes the entire Docmatix corpus, and retrieval results are used to select questions. The final result consists of 5.61 M high‑quality training samples, after filtering out roughly 4 M questions.
Dataset description and usage context
Docmatix‑IR Dataset Overview
Dataset Description
Docmatix‑IR is derived from the original Docmatix dataset and is specifically intended for training document visual embedding models for open‑domain visual question answering tasks. The original Docmatix dataset contains a large number of PDF images (2.4 M) and associated questions (9.5 M), but many questions are unsuitable for open‑domain QA.
Data Processing Steps
- Question Filtering: Remove overly specific questions that are not suitable for open‑domain QA, such as “What is the summary of the text?”
- Hard Negative Mining: Identify challenging negative samples for each question to create high‑quality training data.
The concrete processing method involves encoding the entire Docmatix corpus with the Document Screenshot Embedding (DSE) model and retrieving 100 candidate documents for each question. If the original paired PDF image (the positive document) does not appear among the top‑100 results, the query is considered unsuitable for open‑domain retrieval and is filtered out. If the positive document appears in the top‑100, non‑positive documents are treated as hard negatives for that question.
Dataset Scale
After filtering and hard‑negative mining, the final dataset contains 5.61 M high‑quality training samples, having filtered out roughly 4 M questions.
Dataset Usage
The dataset is used together with the original Docmatix dataset; the original serves as the corpus for retrieving the corresponding image data. In Docmatix‑IR, the format of query IDs and document IDs is as follows:
- Document ID:
{example_idx}_{image_idx} - Query ID:
{example_idx}_{question_idx}
where {example_idx} corresponds to the example index in the original Docmatix dataset.
For example, to obtain the image data for document ID {123_1}:
from datasets import load_dataset
corpus = load_dataset("HuggingFaceM4/Docmatix", split="train")
docid = "123_1"
example_idx, image_idx = docid.split("_")
target_image = corpus[int(example_idx)]["images"][int(image_idx)]
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.