MVCap-4M
The MVCap‑4M dataset is a large‑scale multi‑view image‑text pair dataset designed for studying viewpoint invariance of vision‑language pretraining (VLP) models. It contains over 4.6 million multi‑view image‑text pairs covering more than 100 000 objects. The dataset combines multiple 3D assets with real‑world multi‑view data, renders extensive multi‑view images, and employs visual large language models (VLLM) for automatic caption generation, yielding semantically rich descriptions. A class‑guided prompting strategy ensures category consistency across viewpoints.
Dataset description and usage context
MVCap‑4M Dataset Overview
Dataset Information
- Name: MVCap‑4M
- Language: English
- Task Categories:
- Zero‑Shot Classification
- Feature Extraction
- Scale: 1M < n < 10M
- Configuration:
- Default configuration
- Data files:
- Training set: metadata.json
Dataset Description
MVCap‑4M is a large‑scale dataset expressly designed for viewpoint‑invariant research of vision‑language pretraining models. It comprises over 4.6 million multi‑view image‑text pairs involving more than 100 000 objects. The dataset was constructed by integrating various 3D assets and real‑world multi‑view data, using visual large language models (VLLM) to automatically generate captions with rich semantics. To maintain category consistency across viewpoints, a class‑guided prompting strategy is applied.
Data File Structure
-
metadata.json: Stores each image sample’s path, caption, object ID, and image ID.
{ "path": "./views/54cadb86f3db4aa6920f673aeff0d1e3/026.png", "caption": "The rocking chair in the image is made of metal and has a green cushion on it.", "obj_id": 3177, "img_id": 317726 } -
Source Multi‑View Images: Sampled from three existing 3D datasets.
- Objavers‑80k: stored in
/views - IM3D: stored in
/im3d - MVImgNet: stored in
/mvimgnet
- Objavers‑80k: stored in
Citation
If you use this dataset, please cite:
@article{Ruan2024Omniview,
title={Omniview‑Tuning: Boosting Viewpoint Invariance of Vision‑Language Pre‑training Models},
author={{Shouwei Ruan, Yinpeng Dong, Hanqing Liu, Yao Huang, Hang Su, Xingxing Wei}},
journal={European Conference on Computer Vision (ECCV)},
year={2024}
}
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.