Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingImage Processing

pixelprose

PixelProse is a comprehensive dataset containing 16 million synthetically generated image captions created with the Gemini 1.0 Pro Vision model. The dataset provides rich variables such as image unique identifiers, URLs, captioning model, and caption text, and supports multiple download and usage options.

Source
huggingface
Created
Jun 14, 2024
Updated
Jun 18, 2024
Signals
216 views
Availability
Linked source ready
Overview

Dataset description and usage context

PixelProse Dataset Overview

Basic Information

  • License: cc-by-4.0
  • Task Categories:
    • Image‑to‑Text
    • Text‑to‑Image
    • Visual Question Answering
  • Language: English
  • Tag: croissant
  • Name: PixelProse
  • Size Category: 10M<n<100M

Configuration

  • Default Config:
    • Training Set: data/vlm_captions_*.parquet
    • CC12M: data/vlm_captions_cc12m_*.parquet
    • CommonPool: data/vlm_captions_common-pool_*.parquet
    • RedCaps: data/vlm_captions_redcaps_*.parquet

Details

  • Total Image‑Text Pairs: 16,896,214 (16.9M)
    • CommonPool: 6,538,898 (6.5M)
    • CC12M: 9,066,455 (9.1M)
    • RedCaps: 1,290,861 (1.3M)

Data Download

  • Parquet Files:
    • Via Git LFS:
      git lfs install
      git clone https://huggingface.co/datasets/tomg-group-umd/pixelprose
      
    • Via HuggingFace API:
      from datasets import load_dataset
      ds = load_dataset("tomg-group-umd/pixelprose")
      
    • Direct Link: access the data directory to download required files.

Columns

  • uid: unique image identifier
  • url: image URL
  • key: image‑related key
  • status: status returned by vlm_model
  • original_caption: original inherited caption
  • vlm_model: model used for captioning
  • vlm_caption: dense caption from PixelProse
  • toxicity: score for general harmful behavior
  • severe_toxicity: score for extremely harmful or abusive language
  • obscene: score for obscene or inappropriate language
  • identity_attack: score for language targeting individuals or groups based on identity
  • insult: score for language meant to insult or demean
  • threat: score for language conveying threats of harm
  • sexual_explicit: score for language containing explicit sexual content
  • watermark_class_id: watermark class (0 = watermarked image, 1 = non‑watermarked, 2 = non‑watermarked with text)
  • watermark_class_score: prediction scores for each watermark class, range [0, 1]
  • aesthetic_score: aesthetic rating, range [0, 10]
  • error_message: error message returned by vlm_model
  • width / height: image dimensions used for running vlm_model
  • original_width / original_height: original image dimensions
  • exif: EXIF metadata of the image file
  • sha256: SHA256 hash of the image file
  • image_id, author, subreddit, score: attributes inherited from RedCaps (not available in CC12M and CommonPool)
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio