Dataset assetOpen Source CommunityVisual Question AnsweringDocument Retrieval

docmatix-ir

The Docmatix‑IR dataset is derived from the original Docmatix collection and is specifically intended for training document visual embedding models for open‑domain visual question answering. By filtering unsuitable questions and mining hard negatives, the dataset provides high‑quality training data. Concretely, the Document Screenshot Embedding (DSE) model encodes the entire Docmatix corpus, and retrieval results are used to select questions. The final result consists of 5.61 M high‑quality training samples, after filtering out roughly 4 M questions.

Source

huggingface

Created

Jul 24, 2024

Updated

Jul 24, 2024

Signals

160 views

Availability

Linked source ready

Overview

Dataset description and usage context

Docmatix‑IR Dataset Overview

Dataset Description

Docmatix‑IR is derived from the original Docmatix dataset and is specifically intended for training document visual embedding models for open‑domain visual question answering tasks. The original Docmatix dataset contains a large number of PDF images (2.4 M) and associated questions (9.5 M), but many questions are unsuitable for open‑domain QA.

Data Processing Steps

Question Filtering: Remove overly specific questions that are not suitable for open‑domain QA, such as “What is the summary of the text?”
Hard Negative Mining: Identify challenging negative samples for each question to create high‑quality training data.

The concrete processing method involves encoding the entire Docmatix corpus with the Document Screenshot Embedding (DSE) model and retrieving 100 candidate documents for each question. If the original paired PDF image (the positive document) does not appear among the top‑100 results, the query is considered unsuitable for open‑domain retrieval and is filtered out. If the positive document appears in the top‑100, non‑positive documents are treated as hard negatives for that question.

Dataset Scale

After filtering and hard‑negative mining, the final dataset contains 5.61 M high‑quality training samples, having filtered out roughly 4 M questions.

Dataset Usage

The dataset is used together with the original Docmatix dataset; the original serves as the corpus for retrieving the corresponding image data. In Docmatix‑IR, the format of query IDs and document IDs is as follows:

Document ID: {example_idx}_{image_idx}
Query ID: {example_idx}_{question_idx}

where {example_idx} corresponds to the example index in the original Docmatix dataset.

For example, to obtain the image data for document ID {123_1}:

from datasets import load_dataset

corpus = load_dataset("HuggingFaceM4/Docmatix", split="train")
docid = "123_1"
example_idx, image_idx = docid.split("_")
target_image = corpus[int(example_idx)]["images"][int(image_idx)]

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio