JUHE API Marketplace
DATASET
Open Source Community

MVCap-4M

The MVCap‑4M dataset is a large‑scale multi‑view image‑text pair dataset designed for studying viewpoint invariance of vision‑language pretraining (VLP) models. It contains over 4.6 million multi‑view image‑text pairs covering more than 100 000 objects. The dataset combines multiple 3D assets with real‑world multi‑view data, renders extensive multi‑view images, and employs visual large language models (VLLM) for automatic caption generation, yielding semantically rich descriptions. A class‑guided prompting strategy ensures category consistency across viewpoints.

Updated 7/4/2024
huggingface

Description

MVCap‑4M Dataset Overview

Dataset Information

  • Name: MVCap‑4M
  • Language: English
  • Task Categories:
    • Zero‑Shot Classification
    • Feature Extraction
  • Scale: 1M < n < 10M
  • Configuration:
    • Default configuration
    • Data files:
      • Training set: metadata.json

Dataset Description

MVCap‑4M is a large‑scale dataset expressly designed for viewpoint‑invariant research of vision‑language pretraining models. It comprises over 4.6 million multi‑view image‑text pairs involving more than 100 000 objects. The dataset was constructed by integrating various 3D assets and real‑world multi‑view data, using visual large language models (VLLM) to automatically generate captions with rich semantics. To maintain category consistency across viewpoints, a class‑guided prompting strategy is applied.

Data File Structure

  • metadata.json: Stores each image sample’s path, caption, object ID, and image ID.

    {
        "path": "./views/54cadb86f3db4aa6920f673aeff0d1e3/026.png",
        "caption": "The rocking chair in the image is made of metal and has a green cushion on it.",
        "obj_id": 3177,
        "img_id": 317726
    }
    
  • Source Multi‑View Images: Sampled from three existing 3D datasets.

    • Objavers‑80k: stored in /views
    • IM3D: stored in /im3d
    • MVImgNet: stored in /mvimgnet

Citation

If you use this dataset, please cite:

@article{Ruan2024Omniview,
  title={Omniview‑Tuning: Boosting Viewpoint Invariance of Vision‑Language Pre‑training Models},
  author={{Shouwei Ruan, Yinpeng Dong, Hanqing Liu, Yao Huang, Hang Su, Xingxing Wei}},
  journal={European Conference on Computer Vision (ECCV)},
  year={2024}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Vision-Language Pretraining
Multi-View Image Processing

Source

Organization: huggingface

Created: 7/4/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.