| WebLI | 10B | 12B | 12B | PaLI: A Jointly-Scaled Multilingual Language-Image Model | Link | Captions(109 languages) |
| LAION-5B | 5.9B | 5.9B | 5.9B | LAION-5B: An open large-scale dataset for training next generation image-text models | Link | Captions(Multiple languages) |
| LAION-en | 2.3B | 2.3B | 2.3B | LAION-5B: An open large-scale dataset for training next generation image-text models | Link | Captions(English) |
| ALIGN | 1.8B | 1.8B | 1.8B | Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | Link | Captions(English) |
| DataComp | 1.4B | 1.4B | 1.4B | DATACOMP: In search of the next generation of multimodal datasets | Link | Captions(English) |
| COYO | 747M | 747M | 747M | COYO-700M: Large-scale Image-Text Pair Dataset | Link | Captions(English) |
| LAION-COCO | 600M | 600M | 600M | LAION COCO: 600M SYNTHETIC CAPTIONS FROM LAION2B-EN | Link | Captions(English) |
| LAION-400M | 400M | 400M | 400M | LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs | Link | Captions(English) |
| Episodic WebLI | 400M | 400M | 400M | PaLI-X: On Scaling up a Multilingual Vision and Language Model | - | Captions(English) |
| CLIP | 400M | 400M | 400M | Learning Transferable Visual Models From Natural Language Supervision | Link | Captions(English) |
| LTIP | 312M | 312M | 312M | Flamingo: a Visual Language Model for Few-Shot Learning | - | Captions(English) |
| FILIP | 300M | 300M | 300M | FILIP: Fine-grained Interactive Language-Image Pre-Training | - | Captions(English) |
| LAION-zh | 142M | 142M | 142M | LAION-5B: An open large-scale dataset for training next generation image-text models | Link | Captions(Chinese) |
| Obelics | 353M | 115M | 141M | OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents | Link | Interleaved image-text web documents |
| MMC4 | 571M | 43B | 101.2M | Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text | Link | Interleaved image-text |
| Wukong | 101M | 101M | 101M | WuKong:100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework | Link | Captions(Chinese) |
| M3W | 185M | 182GB | 43.3M | Flamingo: a Visual Language Model for Few-Shot Learning | - | Captions(English) |
| WIT | 11.5M | 37.6M | 37.6M | WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning | Link | Captions(English) |
| GQA | 113K | 22M | 22M | GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering | Link | Visual Reasoning and Compositional Question Answering(English) |
| CC12M | 12.4M | 12.4M | 12.4M | Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts | Link | Captions(English) |
| Red Caps | 12M | 12M | 12M | RedCaps: Web-curated image-text data created by the people, for the people | Link | Captions(English) |
| Visual Genome | 108k | 4.5M | 4.5M | Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations | Link | Annotations(English) |
| DVQA | 300K | 3.5M | 3.5M | DVQA: Understanding Data Visualizations via Question Answering | Link | Question answering(English) |
| CC3M | 3.3M | 3.3M | 3.3M | Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning | Link | Captions(English) |
| MS-COCO | 328k | 2.5M | 2.5M | Microsoft COCO: Common Objects in Context | Link | Object detection,Segmentation,Caption(English) |
| AI Challenger Captions | 300K | 1.5M | 1.5M | AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding | Link | Captions(English) |
| VQA v2 | 265K | 1.4M | 1.4M | Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering | Link | Visual question answering(English) |
| SBU(Image Caption) | 1M | 1M | 1M | Im2Text: Describing Images Using 1 Million Captioned Photographs | Link | Captions(English) |
| OCR-VQA | 207K | 1M | 1M | OCR-VQA: Visual Question Answering by Reading Text in Images | Link | Visual question answering(English) |
| COCO Caption | 164K | 1M | 1M | Microsoft COCO Captions: Data Collection and Evaluation Server | Link | Object detection,Segmentation,Caption(English) |
| CC595k | 595K | 595K | 595K | Visual Instruction Tuning | Link | Captions(English) |
| Visual-7W | 47.3K | 328K | 328K | Visual7W: Grounded Question Answering in Images | - | - |
| Flickr30k | 31K | 158K | 158K | From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions | Link | Annotations(English) |
| Text Captions | 28K | 145K | 145K | TextCaps: a Dataset for Image Captioning with Reading Comprehension | - | - |
| RefCOCO | 20K | 142K | 142K | ReferItGame: Referring to Objects in Photographs of Natural Scenes | - | - |