Explore high-quality datasets for your AI and machine learning projects.
DARE (Diverse Visual Question Answering with Robustness Evaluation) is a carefully curated multiple‑choice VQA benchmark. It evaluates visual‑language models across five categories and includes four robustness assessments based on prompt, answer‑option subset, output format, and number of correct answers. The validation split contains images, questions, answer options, and correct answers, while the test split hides correct answers to prevent leakage.
DocVQA is a dataset for visual question answering on document images, containing 50,000 questions based on 12,767 images. It is split 80‑10‑10 into train, validation, and test sets (39,463 questions & 10,194 images for training, 5,349 questions & 1,286 images for validation, 5,188 questions & 1,287 images for testing). Document images originate from the UCSF Industry Documents Library and include printed, typed, and handwritten content such as letters, memos, notes, and reports.
This dataset is used for visual question answering and QA tasks, supporting both Chinese and English. It includes multiple configuration files such as ai2d_train_12k, chartqa_train_18k, etc., each corresponding to different types of training data files.
This dataset is a test split extracted from the InfoVQA dataset, containing infographics collected from the internet with manually annotated questions and answers. To ensure benchmark consistency, the original test set was sampled to 500 pairs and column names were renamed. Each data instance includes multiple features such as questionId, query, image, etc.
We contributed to the development of the VQA‑RAD dataset by acquiring radiology reports. Our work involved collecting and validating these reports to ensure clear structure and accurate textual information corresponding to each image.
This dataset, named vqa_v2, contains multiple features such as question type, multiple-choice answer, answer list (including answer, answer confidence, and answer ID), image ID, answer type, question ID, question, and image. The dataset is split into training, validation, and test parts, containing 443,757, 214,354, and 447,793 samples respectively. The download size is 34,818,002,031 bytes, and the total size is 171,555,262,245.114 bytes.
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: image dtype: image - name: question_id dtype: string - name: question dtype: string - name: choices list: string - name: correct_choice_idx dtype: int8 - name: direct_answers dtype: string - name: difficult_direct_answer dtype: bool - name: rationales list: string splits: - name: train num_bytes: 929295572.0 num_examples: 17056 - name: validation num_bytes: 60797340.875 num_examples: 1145 - name: test num_bytes: 338535925.25 num_examples: 6702 download_size: 1323807326 dataset_size: 1328628838.125 --- # Dataset Card for "A-OKVQA" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The DriveLM dataset supports perception, prediction, planning, behavior and motion tasks through graph‑structured question‑answer pairs. It consists of two parts: DriveLM‑nuScenes and DriveLM‑CARLA. DriveLM‑nuScenes is built on the nuScenes dataset, while DriveLM‑CARLA is collected from the CARLA simulator. Currently, only the training split of DriveLM‑nuScenes is publicly available. The dataset includes a series of questions and answers together with the associated images.
The Docmatix‑IR dataset is derived from the original Docmatix collection and is specifically intended for training document visual embedding models for open‑domain visual question answering. By filtering unsuitable questions and mining hard negatives, the dataset provides high‑quality training data. Concretely, the Document Screenshot Embedding (DSE) model encodes the entire Docmatix corpus, and retrieval results are used to select questions. The final result consists of 5.61 M high‑quality training samples, after filtering out roughly 4 M questions.
WorldCuisines is a large‑scale multilingual and multicultural visual question answering (VQA) benchmark that focuses on cross‑cultural understanding through global cuisines. The dataset comprises text‑image pairs in 30 languages and dialects, spanning nine language families, and contains over one million data points, making it the largest multicultural VQA benchmark to date. It includes two primary tasks: dish name prediction and location prediction. The construction process involves dish selection, metadata annotation, quality assurance, and data compilation. Two evaluation subsets (12,000 and 60,000 instances) and one training set (1,080,000 instances) are provided.
The VGQA dataset is the first comprehensive benchmark for evaluating large language models (LLMs) on vector graphics processing and generation capabilities.