Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingImage Processing
pixelprose
PixelProse is a comprehensive dataset containing 16 million synthetically generated image captions created with the Gemini 1.0 Pro Vision model. The dataset provides rich variables such as image unique identifiers, URLs, captioning model, and caption text, and supports multiple download and usage options.
Source
huggingface
Created
Jun 14, 2024
Updated
Jun 18, 2024
Signals
216 views
Availability
Linked source ready
Overview
Dataset description and usage context
PixelProse Dataset Overview
Basic Information
- License: cc-by-4.0
- Task Categories:
- Image‑to‑Text
- Text‑to‑Image
- Visual Question Answering
- Language: English
- Tag: croissant
- Name: PixelProse
- Size Category: 10M<n<100M
Configuration
- Default Config:
- Training Set: data/vlm_captions_*.parquet
- CC12M: data/vlm_captions_cc12m_*.parquet
- CommonPool: data/vlm_captions_common-pool_*.parquet
- RedCaps: data/vlm_captions_redcaps_*.parquet
Details
- Total Image‑Text Pairs: 16,896,214 (16.9M)
- CommonPool: 6,538,898 (6.5M)
- CC12M: 9,066,455 (9.1M)
- RedCaps: 1,290,861 (1.3M)
Data Download
- Parquet Files:
- Via Git LFS:
git lfs install git clone https://huggingface.co/datasets/tomg-group-umd/pixelprose - Via HuggingFace API:
from datasets import load_dataset ds = load_dataset("tomg-group-umd/pixelprose") - Direct Link: access the data directory to download required files.
- Via Git LFS:
Columns
uid: unique image identifierurl: image URLkey: image‑related keystatus: status returned byvlm_modeloriginal_caption: original inherited captionvlm_model: model used for captioningvlm_caption: dense caption from PixelProsetoxicity: score for general harmful behaviorsevere_toxicity: score for extremely harmful or abusive languageobscene: score for obscene or inappropriate languageidentity_attack: score for language targeting individuals or groups based on identityinsult: score for language meant to insult or demeanthreat: score for language conveying threats of harmsexual_explicit: score for language containing explicit sexual contentwatermark_class_id: watermark class (0= watermarked image,1= non‑watermarked,2= non‑watermarked with text)watermark_class_score: prediction scores for each watermark class, range[0, 1]aesthetic_score: aesthetic rating, range[0, 10]error_message: error message returned byvlm_modelwidth / height: image dimensions used for runningvlm_modeloriginal_width / original_height: original image dimensionsexif: EXIF metadata of the image filesha256: SHA256 hash of the image fileimage_id,author,subreddit,score: attributes inherited from RedCaps (not available in CC12M and CommonPool)
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.