JUHE API Marketplace
DATASET
Open Source Community

docmatix-ir

The Docmatix‑IR dataset is derived from the original Docmatix collection and is specifically intended for training document visual embedding models for open‑domain visual question answering. By filtering unsuitable questions and mining hard negatives, the dataset provides high‑quality training data. Concretely, the Document Screenshot Embedding (DSE) model encodes the entire Docmatix corpus, and retrieval results are used to select questions. The final result consists of 5.61 M high‑quality training samples, after filtering out roughly 4 M questions.

Updated 7/24/2024
huggingface

Description

Docmatix‑IR Dataset Overview

Dataset Description

Docmatix‑IR is derived from the original Docmatix dataset and is specifically intended for training document visual embedding models for open‑domain visual question answering tasks. The original Docmatix dataset contains a large number of PDF images (2.4 M) and associated questions (9.5 M), but many questions are unsuitable for open‑domain QA.

Data Processing Steps

  1. Question Filtering: Remove overly specific questions that are not suitable for open‑domain QA, such as “What is the summary of the text?”
  2. Hard Negative Mining: Identify challenging negative samples for each question to create high‑quality training data.

The concrete processing method involves encoding the entire Docmatix corpus with the Document Screenshot Embedding (DSE) model and retrieving 100 candidate documents for each question. If the original paired PDF image (the positive document) does not appear among the top‑100 results, the query is considered unsuitable for open‑domain retrieval and is filtered out. If the positive document appears in the top‑100, non‑positive documents are treated as hard negatives for that question.

Dataset Scale

After filtering and hard‑negative mining, the final dataset contains 5.61 M high‑quality training samples, having filtered out roughly 4 M questions.

Dataset Usage

The dataset is used together with the original Docmatix dataset; the original serves as the corpus for retrieving the corresponding image data. In Docmatix‑IR, the format of query IDs and document IDs is as follows:

  • Document ID: {example_idx}_{image_idx}
  • Query ID: {example_idx}_{question_idx}

where {example_idx} corresponds to the example index in the original Docmatix dataset.

For example, to obtain the image data for document ID {123_1}:

from datasets import load_dataset

corpus = load_dataset("HuggingFaceM4/Docmatix", split="train")
docid = "123_1"
example_idx, image_idx = docid.split("_")
target_image = corpus[int(example_idx)]["images"][int(image_idx)]

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Visual Question Answering
Document Retrieval

Source

Organization: huggingface

Created: 7/24/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.