Dataset assetOpen Source CommunityVision-Language PretrainingMulti-View Image Processing

MVCap-4M

The MVCap‑4M dataset is a large‑scale multi‑view image‑text pair dataset designed for studying viewpoint invariance of vision‑language pretraining (VLP) models. It contains over 4.6 million multi‑view image‑text pairs covering more than 100 000 objects. The dataset combines multiple 3D assets with real‑world multi‑view data, renders extensive multi‑view images, and employs visual large language models (VLLM) for automatic caption generation, yielding semantically rich descriptions. A class‑guided prompting strategy ensures category consistency across viewpoints.

Source

huggingface

Created

Jul 4, 2024

Updated

Jul 4, 2024

Signals

206 views

Availability

Linked source ready

Overview

Dataset description and usage context

MVCap‑4M Dataset Overview

Dataset Information

Name: MVCap‑4M
Language: English
Task Categories:
- Zero‑Shot Classification
- Feature Extraction
Scale: 1M < n < 10M
Configuration:
- Default configuration
- Data files:
  - Training set: metadata.json

Dataset Description

MVCap‑4M is a large‑scale dataset expressly designed for viewpoint‑invariant research of vision‑language pretraining models. It comprises over 4.6 million multi‑view image‑text pairs involving more than 100 000 objects. The dataset was constructed by integrating various 3D assets and real‑world multi‑view data, using visual large language models (VLLM) to automatically generate captions with rich semantics. To maintain category consistency across viewpoints, a class‑guided prompting strategy is applied.

Data File Structure

metadata.json: Stores each image sample’s path, caption, object ID, and image ID.

{
    "path": "./views/54cadb86f3db4aa6920f673aeff0d1e3/026.png",
    "caption": "The rocking chair in the image is made of metal and has a green cushion on it.",
    "obj_id": 3177,
    "img_id": 317726
}

Source Multi‑View Images: Sampled from three existing 3D datasets.
- Objavers‑80k: stored in /views
- IM3D: stored in /im3d
- MVImgNet: stored in /mvimgnet

Citation

If you use this dataset, please cite:

@article{Ruan2024Omniview,
  title={Omniview‑Tuning: Boosting Viewpoint Invariance of Vision‑Language Pre‑training Models},
  author={{Shouwei Ruan, Yinpeng Dong, Hanqing Liu, Yao Huang, Hang Su, Xingxing Wei}},
  journal={European Conference on Computer Vision (ECCV)},
  year={2024}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio