Explore high-quality datasets for your AI and machine learning projects.
The MVCap‑4M dataset is a large‑scale multi‑view image‑text pair dataset designed for studying viewpoint invariance of vision‑language pretraining (VLP) models. It contains over 4.6 million multi‑view image‑text pairs covering more than 100 000 objects. The dataset combines multiple 3D assets with real‑world multi‑view data, renders extensive multi‑view images, and employs visual large language models (VLLM) for automatic caption generation, yielding semantically rich descriptions. A class‑guided prompting strategy ensures category consistency across viewpoints.