docmatix-ir
The Docmatix‑IR dataset is derived from the original Docmatix collection and is specifically intended for training document visual embedding models for open‑domain visual question answering. By filtering unsuitable questions and mining hard negatives, the dataset provides high‑quality training data. Concretely, the Document Screenshot Embedding (DSE) model encodes the entire Docmatix corpus, and retrieval results are used to select questions. The final result consists of 5.61 M high‑quality training samples, after filtering out roughly 4 M questions.
Description
Docmatix‑IR Dataset Overview
Dataset Description
Docmatix‑IR is derived from the original Docmatix dataset and is specifically intended for training document visual embedding models for open‑domain visual question answering tasks. The original Docmatix dataset contains a large number of PDF images (2.4 M) and associated questions (9.5 M), but many questions are unsuitable for open‑domain QA.
Data Processing Steps
- Question Filtering: Remove overly specific questions that are not suitable for open‑domain QA, such as “What is the summary of the text?”
- Hard Negative Mining: Identify challenging negative samples for each question to create high‑quality training data.
The concrete processing method involves encoding the entire Docmatix corpus with the Document Screenshot Embedding (DSE) model and retrieving 100 candidate documents for each question. If the original paired PDF image (the positive document) does not appear among the top‑100 results, the query is considered unsuitable for open‑domain retrieval and is filtered out. If the positive document appears in the top‑100, non‑positive documents are treated as hard negatives for that question.
Dataset Scale
After filtering and hard‑negative mining, the final dataset contains 5.61 M high‑quality training samples, having filtered out roughly 4 M questions.
Dataset Usage
The dataset is used together with the original Docmatix dataset; the original serves as the corpus for retrieving the corresponding image data. In Docmatix‑IR, the format of query IDs and document IDs is as follows:
- Document ID:
{example_idx}_{image_idx} - Query ID:
{example_idx}_{question_idx}
where {example_idx} corresponds to the example index in the original Docmatix dataset.
For example, to obtain the image data for document ID {123_1}:
from datasets import load_dataset
corpus = load_dataset("HuggingFaceM4/Docmatix", split="train")
docid = "123_1"
example_idx, image_idx = docid.split("_")
target_image = corpus[int(example_idx)]["images"][int(image_idx)]
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 7/24/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.