Back to datasets
Dataset assetOpen Source CommunityVisual Question AnsweringInfographic Question Answering

vidore/infovqa_test_subsampled

This dataset is a test split extracted from the InfoVQA dataset, containing infographics collected from the internet with manually annotated questions and answers. To ensure benchmark consistency, the original test set was sampled to 500 pairs and column names were renamed. Each data instance includes multiple features such as questionId, query, image, etc.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jun 27, 2024
Signals
280 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Description

  • Source: This dataset is extracted from the InfoVQA dataset test split, containing infographics collected from the internet by searching for “infographics”. Questions and answers are manually annotated.

Data Structure

  • Features:
    • questionId: Question ID (string)
    • query: Query content (string)
    • answer: Answer (empty)
    • answer_type: Answer type (empty)
    • image: Image (image)
    • image_filename: Image filename (string)
    • operation/reasoning: Operation/Reasoning (empty)
    • ocr: OCR text (string)
    • data_split: Data split (string)
    • source: Data source (string)

Data Split

  • Test set:
    • test: Contains 500 samples, total size 277,995,931 bytes.

Dataset Size

  • Download size: 218,577,138 bytes.
  • Dataset size: 277,995,931 bytes.

Data Loading

  • Loading method:
from datasets import load_dataset

ds = load_dataset("vidore/infovqa_test_subsampled", split="test")

Citation Information

  • Citation format:
@misc{mathew_infographicvqa_2021,
  title = {{InfographicVQA}},
  copyright = {arXiv.org perpetual, non-exclusive license},
  url = {https://arxiv.org/abs/2104.12756},
  doi = {10.48550/ARXIV.2104.12756},
  urldate = {2024-06-02},
  publisher = {arXiv},
  author = {Mathew, Minesh and Bagal, Viraj and Tito, Rubèn Pérez and Karatzas, Dimosthenis and Valveny, Ernest and Jawahar, C. V},
  year = {2021},
  note = {Version Number: 2},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV)},
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio