DATASET
Open Source Community
LAION
Published ultra‑large image‑text datasets such as LAION‑400M, LAION‑5B, and other various CLIP datasets.
Updated 5/1/2024
github
Description
Dataset Overview
Generative AI (Image Datasets)
General Image Datasets
| Name | Description | URL |
|---|---|---|
| LAION | Published LAION‑400M, LAION‑5B and other ultra‑large image‑text datasets, as well as various types of CLIP data. | https://laion.ai/projects/ https://huggingface.co/laion |
| Conceptual Captions Dataset | Dataset of (image URL, caption) pairs designed for training and evaluating machine‑learning image captioning systems. | https://github.com/google-research-datasets/conceptual-captions http://ai.google.com/research/ConceptualCaptions |
| laion-high-resolution-chinese | Subset of the Laion5B‑high‑resolution multimodal dataset, about 2.66 M image‑text pairs (Chinese only). | https://huggingface.co/datasets/wanng/laion-high-resolution-chinese |
Virtual Try‑On Datasets
| Name | Description | URL |
|---|---|---|
| StreetTryOn | New outdoor virtual try‑on dataset containing 12,364 training images and 2,089 validation images. | https://github.com/cuiaiyu/street-tryon-benchmark |
| CLOTH4D | Large‑scale 4D dataset containing 3D human bodies, clothing, texture models, SMPL pose parameters, and high‑resolution images. | https://github.com/AemikaChow/CLOTH4D |
| DressCode | Dataset focusing on modeling underlying 3D geometry and appearance of people and their clothing. | https://docs.google.com/forms/d/e/1FAIpQLSeWVzxWcj3JSALtthuw-2QDAbf2ymiK37sA4pRQD4tZz2vqsw/viewform https://arxiv.org/pdf/2204.08532.pdf |
| VITON‑HD | High‑resolution virtual try‑on dataset with 13,679 pairs of 1024 × 768 images. | https://www.dropbox.com/s/10bfat0kg4si1bu/zalando-hd-resized.zip?dl=0 https://psh01087.github.io/VITON-HD/ |
| VITON | First image‑based virtual try‑on dataset, containing 16,253 image pairs. | https://drive.google.com/file/d/1MxCUvKxejnwWnoZ‑KoCyMCXo3TLhRuTo/view http://openaccess.thecvf.com/content_cvpr_2018/papers/Han_VITON_An_Image‑Based_CVPR_2018_paper.pdf |
| MPV | Multi‑pose virtual try‑on dataset with 35,687/13,524 person/clothing images. | https://drive.google.com/drive/folders/1e3ThRpSj8j9PaCUw8IrqzKPDVJK_grcA https://arxiv.org/abs/1902.11026 |
| Deep Fashion3D | Large‑scale 3D clothing dataset with diverse styles and rich annotations. | https://arxiv.org/abs/2003.12753 |
| DeepFashion MultiModal | Multimodal virtual try‑on dataset containing unpaired person and clothing images. | https://github.com/yumingj/DeepFashion-MultiModal |
| Digital Wardrobe | High‑quality 3D clothing dataset from real‑consumer photos with 2D‑3D aligned annotations. | http://virtualhumans.mpi‑inf.mpg.de/mgn/ |
| TailorNet Dataset | Paired images with consistent geometry and pose of 3D humans wearing clothes for garment transfer. | https://github.com/zycliao/TailorNet_dataset http://virtualhumans.mpi‑inf.mpg.de/tailornet/ |
| CLOTH3D | First 3D clothing dataset containing digital garments and 3D human models. | https://arxiv.org/abs/1912.02792 |
| 3DPeople | Dataset of 80 3D humans wearing various clothes and poses. | https://www.albertpumarola.com/research/3DPeople/index.html |
| THUman Dataset | High‑resolution 3D textured human dataset with 7,000+ models and 200+ subjects. | http://www.liuyebin.com/deephuman/deephuman.html |
| Garment Dataset | 3D clothing dataset with digitally created garments suitable for real humans. | http://geometry.cs.ucl.ac.uk/projects/2018/garment_design/ |
Generative AI (Video Datasets)
General Video Datasets
| Name | Description | URL |
|---|---|---|
| (none listed) |
Multimodal Model Datasets
Pre‑training Alignment Datasets
| Name | Description | URL |
|---|---|---|
| LAION | Published LAION‑400M, LAION‑5B and other ultra‑large image‑text datasets, as well as various types of CLIP data. | https://laion.ai/projects/ https://huggingface.co/laion |
| Conceptual Captions Dataset | Dataset of (image URL, caption) pairs designed for training and evaluating machine‑learning image captioning systems. | https://github.com/google-research-datasets/conceptual-captions http://ai.google.com/research/ConceptualCaptions |
| COYO‑700M | Large‑scale image‑text pair dataset for training and evaluating image‑text matching models. | https://github.com/kakaobrain/coyo-dataset/ |
| ShareGPT4V | Large image‑text dataset containing GPT‑4 generated captions to improve multimodal models. | https://arxiv.org/pdf/2311.12793.pdf |
| AS‑1B | Full‑scene project dataset with over 1 billion regions annotated with semantic labels, QA pairs, and captions for panoramic visual recognition. | https://arxiv.org/pdf/2308.01907.pdf |
| InternVid | Large‑scale video‑text dataset for multimodal understanding and generation. | https://arxiv.org/pdf/2307.06942.pdf |
| MS‑COCO | Microsoft COCO dataset for large‑scale object detection, segmentation, and captioning. | https://arxiv.org/pdf/1405.0312.pdf |
| SBU Captions | SBU captioned photo dataset containing 1 M images with user‑provided captions collected from Flickr. | https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf |
| Conceptual Captions | Cleaned web‑scraped image alt‑text dataset for training image captioning models. | https://aclanthology.org/P18-1238.pdf |
| LAION‑400M | Open, large‑scale dataset containing 400 M CLIP‑filtered image‑text pairs. | https://arxiv.org/pdf/2111.02114.pdf https://laion.ai/projects/ https://huggingface.co/laion |
| VG Captions | Visual Genome dataset linking structured image concepts with language via crowdsourced annotations. | https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf |
| Flickr30k | Flickr30k Entities dataset with 30 k images, five captions each, annotated with bounding boxes and entity mentions. | https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf |
| AI‑Caps | AI Challenger: large Chinese dataset containing millions of images and natural‑language descriptions. | https://arxiv.org/pdf/1711.06475.pdf |
| Wukong Captions | 100 M‑scale Chinese cross‑modal pre‑training benchmark dataset. | https://proceedings.neurips.cc/paper_files/paper/2022/file/a90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets_and_Benchmarks.pdf |
| GRIT | Grounded multimodal language model dataset containing image‑paragraph alignments. | https://arxiv.org/pdf/2306.14824.pdf |
| Youku‑mPLUG | 10 M‑scale Chinese video‑language pre‑training dataset. | https://arxiv.org/pdf/2306.04362.pdf |
| MSR‑VTT | Large‑scale video description dataset bridging video and language. | https://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR‑VTT_A_Large_CVPR_2016_paper.pdf |
| Webvid10M | Large video‑text dataset for joint video‑language representation learning. | https://arxiv.org/pdf/2104.00650.pdf |
| WavCaps | WavCaps: ChatGPT‑assisted weakly‑labelled audio caption dataset. | https://arxiv.org/pdf/2303.17395.pdf |
| AISHELL‑1 | AISHELL‑1: Open Mandarin speech corpus and ASR benchmark. | https://arxiv.org/pdf/1709.05522.pdf |
| AISHELL‑2 | AISHELL‑2: Industrial‑scale Mandarin ASR dataset. | https://arxiv.org/pdf/1808.10583.pdf |
| VSDial‑CN | Chinese visual‑semantic dialogue dataset for multimodal language model research. | https://arxiv.org/pdf/2305.04160.pdf |
Multimodal Instruction Tuning Datasets
| Name | Description | URL |
|---|---|---|
| CogVLM‑SFT‑311K | CogVLM‑SFT‑311K, a core alignment corpus for initializing CogVLM v1.0, built from ~3.5 k high‑quality samples selected from open‑source MiniGPT‑4 (minigpt4‑3500). This subset was later combined with Llava‑Instruct‑150K and machine‑translated into Chinese. | https://github.com/THUDM/CogVLM/blob/main/dataset.md |
| ALLaVA‑4V | Multimodal instruction dataset generated by GPT‑4V. | https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V |
| IDK | Visual instruction dataset for “I Know” hallucination mitigation. | https://github.com/ncsoft/idk |
| CAP2QA | Visual instruction dataset aligned with images. | https://github.com/ncsoft/cap2qa |
| M3DBench | Large‑scale 3D instruction tuning dataset. | https://github.com/OpenM3D/M3DBench |
| ViP‑LLaVA‑Instruct | Mix of LLaVA‑1.5 instruction data and region‑level visual prompts. | https://huggingface.co/datasets/mucai/ViP-LLaVA-Instruct |
| LVIS‑Instruct4V | Visual instruction dataset generated by self‑instruction with GPT‑4V. | https://huggingface.co/datasets/X2FD/LVIS-Instruct4V |
| ComVint | Synthetic instruction dataset for complex visual reasoning. | https://github.com/RUCAIBox/ComVint#comvint-data |
| SparklesDialogue | Machine‑generated dialogue dataset customized for interleaved multi‑image and text interaction at the token level, enhancing instruction‑following LLMs across multiple images and dialogue turns. | https://github.com/HYPJUDY/Sparkles#sparklesdialogue |
| StableLLaVA | Cheap and effective method for collecting visual instruction tuning data. | https://github.com/icoz69/StableLLAVA |
| M‑HalDetect | Dataset for training and benchmarking hallucination detection and prevention models. | Coming soon |
| MGVLID | High‑quality instruction tuning dataset containing image‑text and region‑text pairs. | - |
| BuboGPT | BuboGPT: Visual grounding in multimodal LLMs. | https://huggingface.co/datasets/magicr/BuboGPT |
| SVIT | SVIT: Expanded visual instruction tuning. | https://huggingface.co/datasets/BAAI/SVIT |
| mPLUG‑DocOwl | mPLUG‑DocOwl: Modular multimodal LLM for document understanding. | https://github.com/X‑PLUG/mPLUG‑DocOwl/tree/main/DocLLM |
| PF‑1M | Visual instruction tuning using Polite Flamingo. | https://huggingface.co/datasets/chendelong/PF-1M/tree/main |
| ChartLlama | ChartLlama: Multimodal LLM for chart understanding and generation. | https://huggingface.co/datasets/listen2you002/ChartLlama-Dataset |
| LLaVAR | LLaVAR: Enhanced visual instruction tuning for text‑rich image understanding. | https://llavar.github.io/#data |
| MotionGPT | MotionGPT: Treating human motion as a foreign language. | https://github.com/OpenMotionLab/MotionGPT |
| LRV‑Instruction | Reducing hallucinations in large multimodal models via robust instruction tuning. | https://github.com/FuxiaoLiu/LRV-Instruction#visual-instruction-data-lrv-instruction |
| Macaw‑LLM | Macaw‑LLM: Multimodal language modeling integrating image, audio, video, and text. | https://github.com/lyuchenyang/Macaw-LLM/tree/main/data |
| LAMM‑Dataset | LAMM: Language‑Assisted Multimodal instruction tuning dataset, framework, and benchmark. | https://github.com/OpenLAMM/LAMM#lamm-dataset |
| Video‑ChatGPT | Video‑ChatGPT: Detailed video understanding via large visual and language models. | https://github.com/mbzuai‑oryx/Video‑ChatGPT#video‑instruction‑dataset‑open_file_folder |
| MIMIC‑IT | MIMIC‑IT: Multimodal contextual instruction tuning. | https://github.com/Luodian/Otter/blob/main/mimic‑it/README.md |
| M³IT | M³IT: Large dataset for multimodal multilingual instruction tuning. | https://huggingface.co/datasets/MMInstruction/M3IT |
| LLaVA‑Med | LLaVA‑Med: Large language and vision assistant trained on biomedical data within a day. | Coming soon |
| GPT4Tools | GPT4Tools: Teaching large language models to use tools via self‑instruction. | Link |
| MULTIS | ChatBridge: Using large language models as language catalysts to bridge modalities. | Coming soon |
| DetGPT | DetGPT: Reasoning‑based detection of required content. | Link |
| PMC‑VQA | PMC‑VQA: Visual instruction tuning for medical visual question answering. | Coming soon |
| VideoChat | VideoChat: Chat‑centered video understanding. | Link |
| X‑LLM | X‑LLM: Guiding advanced large language models by treating multimodality as a foreign language. | Link |
| LMEye | LMEye: Interactive perception network for large language models. | Link |
| cc‑sbu‑align | MiniGPT‑4: Enhancing visual‑language understanding via advanced large language models. | Link |
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Multimodal Learning
CLIP Models
Source
Organization: github
Created: 3/9/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.