DATASET
Open Source Community
pixelprose
PixelProse is a comprehensive dataset containing 16 million synthetically generated image captions created with the Gemini 1.0 Pro Vision model. The dataset provides rich variables such as image unique identifiers, URLs, captioning model, and caption text, and supports multiple download and usage options.
Updated 6/18/2024
huggingface
Description
PixelProse Dataset Overview
Basic Information
- License: cc-by-4.0
- Task Categories:
- Image‑to‑Text
- Text‑to‑Image
- Visual Question Answering
- Language: English
- Tag: croissant
- Name: PixelProse
- Size Category: 10M<n<100M
Configuration
- Default Config:
- Training Set: data/vlm_captions_*.parquet
- CC12M: data/vlm_captions_cc12m_*.parquet
- CommonPool: data/vlm_captions_common-pool_*.parquet
- RedCaps: data/vlm_captions_redcaps_*.parquet
Details
- Total Image‑Text Pairs: 16,896,214 (16.9M)
- CommonPool: 6,538,898 (6.5M)
- CC12M: 9,066,455 (9.1M)
- RedCaps: 1,290,861 (1.3M)
Data Download
- Parquet Files:
- Via Git LFS:
git lfs install git clone https://huggingface.co/datasets/tomg-group-umd/pixelprose - Via HuggingFace API:
from datasets import load_dataset ds = load_dataset("tomg-group-umd/pixelprose") - Direct Link: access the data directory to download required files.
- Via Git LFS:
Columns
uid: unique image identifierurl: image URLkey: image‑related keystatus: status returned byvlm_modeloriginal_caption: original inherited captionvlm_model: model used for captioningvlm_caption: dense caption from PixelProsetoxicity: score for general harmful behaviorsevere_toxicity: score for extremely harmful or abusive languageobscene: score for obscene or inappropriate languageidentity_attack: score for language targeting individuals or groups based on identityinsult: score for language meant to insult or demeanthreat: score for language conveying threats of harmsexual_explicit: score for language containing explicit sexual contentwatermark_class_id: watermark class (0= watermarked image,1= non‑watermarked,2= non‑watermarked with text)watermark_class_score: prediction scores for each watermark class, range[0, 1]aesthetic_score: aesthetic rating, range[0, 10]error_message: error message returned byvlm_modelwidth / height: image dimensions used for runningvlm_modeloriginal_width / original_height: original image dimensionsexif: EXIF metadata of the image filesha256: SHA256 hash of the image fileimage_id,author,subreddit,score: attributes inherited from RedCaps (not available in CC12M and CommonPool)
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Image Processing
Natural Language Processing
Source
Organization: huggingface
Created: 6/14/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.