Dataset assetOpen Source CommunityNatural Language ProcessingImage Processing

pixelprose

PixelProse is a comprehensive dataset containing 16 million synthetically generated image captions created with the Gemini 1.0 Pro Vision model. The dataset provides rich variables such as image unique identifiers, URLs, captioning model, and caption text, and supports multiple download and usage options.

Source

huggingface

Created

Jun 14, 2024

Updated

Jun 18, 2024

Signals

216 views

Availability

Linked source ready

Overview

Dataset description and usage context

PixelProse Dataset Overview

Basic Information

License: cc-by-4.0
Task Categories:
- Image‑to‑Text
- Text‑to‑Image
- Visual Question Answering
Language: English
Tag: croissant
Name: PixelProse
Size Category: 10M<n<100M

Configuration

Default Config:
- Training Set: data/vlm_captions_*.parquet
- CC12M: data/vlm_captions_cc12m_*.parquet
- CommonPool: data/vlm_captions_common-pool_*.parquet
- RedCaps: data/vlm_captions_redcaps_*.parquet

Details

Total Image‑Text Pairs: 16,896,214 (16.9M)
- CommonPool: 6,538,898 (6.5M)
- CC12M: 9,066,455 (9.1M)
- RedCaps: 1,290,861 (1.3M)

Data Download

Parquet Files:

Via Git LFS:

git lfs install
git clone https://huggingface.co/datasets/tomg-group-umd/pixelprose

Via HuggingFace API:

from datasets import load_dataset
ds = load_dataset("tomg-group-umd/pixelprose")

Direct Link: access the data directory to download required files.

Columns

uid: unique image identifier
url: image URL
key: image‑related key
status: status returned by vlm_model
original_caption: original inherited caption
vlm_model: model used for captioning
vlm_caption: dense caption from PixelProse
toxicity: score for general harmful behavior
severe_toxicity: score for extremely harmful or abusive language
obscene: score for obscene or inappropriate language
identity_attack: score for language targeting individuals or groups based on identity
insult: score for language meant to insult or demean
threat: score for language conveying threats of harm
sexual_explicit: score for language containing explicit sexual content
watermark_class_id: watermark class (0 = watermarked image, 1 = non‑watermarked, 2 = non‑watermarked with text)
watermark_class_score: prediction scores for each watermark class, range [0, 1]
aesthetic_score: aesthetic rating, range [0, 10]
error_message: error message returned by vlm_model
width / height: image dimensions used for running vlm_model
original_width / original_height: original image dimensions
exif: EXIF metadata of the image file
sha256: SHA256 hash of the image file
image_id, author, subreddit, score: attributes inherited from RedCaps (not available in CC12M and CommonPool)

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio