Dataset assetOpen Source CommunityMultimodal LearningCLIP Models

LAION

Published ultra‑large image‑text datasets such as LAION‑400M, LAION‑5B, and other various CLIP datasets.

Source

github

Created

Mar 9, 2024

Updated

May 1, 2024

Signals

521 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Generative AI (Image Datasets)

General Image Datasets

Name	Description	URL
LAION	Published LAION‑400M, LAION‑5B and other ultra‑large image‑text datasets, as well as various types of CLIP data.	https://laion.ai/projects/ https://huggingface.co/laion
Conceptual Captions Dataset	Dataset of (image URL, caption) pairs designed for training and evaluating machine‑learning image captioning systems.	https://github.com/google-research-datasets/conceptual-captions http://ai.google.com/research/ConceptualCaptions
laion-high-resolution-chinese	Subset of the Laion5B‑high‑resolution multimodal dataset, about 2.66 M image‑text pairs (Chinese only).	https://huggingface.co/datasets/wanng/laion-high-resolution-chinese

Virtual Try‑On Datasets

Name	Description	URL
StreetTryOn	New outdoor virtual try‑on dataset containing 12,364 training images and 2,089 validation images.	https://github.com/cuiaiyu/street-tryon-benchmark
CLOTH4D	Large‑scale 4D dataset containing 3D human bodies, clothing, texture models, SMPL pose parameters, and high‑resolution images.	https://github.com/AemikaChow/CLOTH4D
DressCode	Dataset focusing on modeling underlying 3D geometry and appearance of people and their clothing.	https://docs.google.com/forms/d/e/1FAIpQLSeWVzxWcj3JSALtthuw-2QDAbf2ymiK37sA4pRQD4tZz2vqsw/viewform https://arxiv.org/pdf/2204.08532.pdf
VITON‑HD	High‑resolution virtual try‑on dataset with 13,679 pairs of 1024 × 768 images.	https://www.dropbox.com/s/10bfat0kg4si1bu/zalando-hd-resized.zip?dl=0 https://psh01087.github.io/VITON-HD/
VITON	First image‑based virtual try‑on dataset, containing 16,253 image pairs.	https://drive.google.com/file/d/1MxCUvKxejnwWnoZ‑KoCyMCXo3TLhRuTo/view http://openaccess.thecvf.com/content_cvpr_2018/papers/Han_VITON_An_Image‑Based_CVPR_2018_paper.pdf
MPV	Multi‑pose virtual try‑on dataset with 35,687/13,524 person/clothing images.	https://drive.google.com/drive/folders/1e3ThRpSj8j9PaCUw8IrqzKPDVJK_grcA https://arxiv.org/abs/1902.11026
Deep Fashion3D	Large‑scale 3D clothing dataset with diverse styles and rich annotations.	https://arxiv.org/abs/2003.12753
DeepFashion MultiModal	Multimodal virtual try‑on dataset containing unpaired person and clothing images.	https://github.com/yumingj/DeepFashion-MultiModal
Digital Wardrobe	High‑quality 3D clothing dataset from real‑consumer photos with 2D‑3D aligned annotations.	http://virtualhumans.mpi‑inf.mpg.de/mgn/
TailorNet Dataset	Paired images with consistent geometry and pose of 3D humans wearing clothes for garment transfer.	https://github.com/zycliao/TailorNet_dataset http://virtualhumans.mpi‑inf.mpg.de/tailornet/
CLOTH3D	First 3D clothing dataset containing digital garments and 3D human models.	https://arxiv.org/abs/1912.02792
3DPeople	Dataset of 80 3D humans wearing various clothes and poses.	https://www.albertpumarola.com/research/3DPeople/index.html
THUman Dataset	High‑resolution 3D textured human dataset with 7,000+ models and 200+ subjects.	http://www.liuyebin.com/deephuman/deephuman.html
Garment Dataset	3D clothing dataset with digitally created garments suitable for real humans.	http://geometry.cs.ucl.ac.uk/projects/2018/garment_design/

Generative AI (Video Datasets)

General Video Datasets

Name	Description	URL
(none listed)

Multimodal Model Datasets

Pre‑training Alignment Datasets

Name	Description	URL
LAION	Published LAION‑400M, LAION‑5B and other ultra‑large image‑text datasets, as well as various types of CLIP data.	https://laion.ai/projects/ https://huggingface.co/laion
Conceptual Captions Dataset	Dataset of (image URL, caption) pairs designed for training and evaluating machine‑learning image captioning systems.	https://github.com/google-research-datasets/conceptual-captions http://ai.google.com/research/ConceptualCaptions
COYO‑700M	Large‑scale image‑text pair dataset for training and evaluating image‑text matching models.	https://github.com/kakaobrain/coyo-dataset/
ShareGPT4V	Large image‑text dataset containing GPT‑4 generated captions to improve multimodal models.	https://arxiv.org/pdf/2311.12793.pdf
AS‑1B	Full‑scene project dataset with over 1 billion regions annotated with semantic labels, QA pairs, and captions for panoramic visual recognition.	https://arxiv.org/pdf/2308.01907.pdf
InternVid	Large‑scale video‑text dataset for multimodal understanding and generation.	https://arxiv.org/pdf/2307.06942.pdf
MS‑COCO	Microsoft COCO dataset for large‑scale object detection, segmentation, and captioning.	https://arxiv.org/pdf/1405.0312.pdf
SBU Captions	SBU captioned photo dataset containing 1 M images with user‑provided captions collected from Flickr.	https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf
Conceptual Captions	Cleaned web‑scraped image alt‑text dataset for training image captioning models.	https://aclanthology.org/P18-1238.pdf
LAION‑400M	Open, large‑scale dataset containing 400 M CLIP‑filtered image‑text pairs.	https://arxiv.org/pdf/2111.02114.pdf https://laion.ai/projects/ https://huggingface.co/laion
VG Captions	Visual Genome dataset linking structured image concepts with language via crowdsourced annotations.	https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf
Flickr30k	Flickr30k Entities dataset with 30 k images, five captions each, annotated with bounding boxes and entity mentions.	https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf
AI‑Caps	AI Challenger: large Chinese dataset containing millions of images and natural‑language descriptions.	https://arxiv.org/pdf/1711.06475.pdf
Wukong Captions	100 M‑scale Chinese cross‑modal pre‑training benchmark dataset.	https://proceedings.neurips.cc/paper_files/paper/2022/file/a90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets_and_Benchmarks.pdf
GRIT	Grounded multimodal language model dataset containing image‑paragraph alignments.	https://arxiv.org/pdf/2306.14824.pdf
Youku‑mPLUG	10 M‑scale Chinese video‑language pre‑training dataset.	https://arxiv.org/pdf/2306.04362.pdf
MSR‑VTT	Large‑scale video description dataset bridging video and language.	https://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR‑VTT_A_Large_CVPR_2016_paper.pdf
Webvid10M	Large video‑text dataset for joint video‑language representation learning.	https://arxiv.org/pdf/2104.00650.pdf
WavCaps	WavCaps: ChatGPT‑assisted weakly‑labelled audio caption dataset.	https://arxiv.org/pdf/2303.17395.pdf
AISHELL‑1	AISHELL‑1: Open Mandarin speech corpus and ASR benchmark.	https://arxiv.org/pdf/1709.05522.pdf
AISHELL‑2	AISHELL‑2: Industrial‑scale Mandarin ASR dataset.	https://arxiv.org/pdf/1808.10583.pdf
VSDial‑CN	Chinese visual‑semantic dialogue dataset for multimodal language model research.	https://arxiv.org/pdf/2305.04160.pdf

Multimodal Instruction Tuning Datasets

Name	Description	URL
CogVLM‑SFT‑311K	CogVLM‑SFT‑311K, a core alignment corpus for initializing CogVLM v1.0, built from ~3.5 k high‑quality samples selected from open‑source MiniGPT‑4 (minigpt4‑3500). This subset was later combined with Llava‑Instruct‑150K and machine‑translated into Chinese.	https://github.com/THUDM/CogVLM/blob/main/dataset.md
ALLaVA‑4V	Multimodal instruction dataset generated by GPT‑4V.	https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V
IDK	Visual instruction dataset for “I Know” hallucination mitigation.	https://github.com/ncsoft/idk
CAP2QA	Visual instruction dataset aligned with images.	https://github.com/ncsoft/cap2qa
M3DBench	Large‑scale 3D instruction tuning dataset.	https://github.com/OpenM3D/M3DBench
ViP‑LLaVA‑Instruct	Mix of LLaVA‑1.5 instruction data and region‑level visual prompts.	https://huggingface.co/datasets/mucai/ViP-LLaVA-Instruct
LVIS‑Instruct4V	Visual instruction dataset generated by self‑instruction with GPT‑4V.	https://huggingface.co/datasets/X2FD/LVIS-Instruct4V
ComVint	Synthetic instruction dataset for complex visual reasoning.	https://github.com/RUCAIBox/ComVint#comvint-data
SparklesDialogue	Machine‑generated dialogue dataset customized for interleaved multi‑image and text interaction at the token level, enhancing instruction‑following LLMs across multiple images and dialogue turns.	https://github.com/HYPJUDY/Sparkles#sparklesdialogue
StableLLaVA	Cheap and effective method for collecting visual instruction tuning data.	https://github.com/icoz69/StableLLAVA
M‑HalDetect	Dataset for training and benchmarking hallucination detection and prevention models.	Coming soon
MGVLID	High‑quality instruction tuning dataset containing image‑text and region‑text pairs.	-
BuboGPT	BuboGPT: Visual grounding in multimodal LLMs.	https://huggingface.co/datasets/magicr/BuboGPT
SVIT	SVIT: Expanded visual instruction tuning.	https://huggingface.co/datasets/BAAI/SVIT
mPLUG‑DocOwl	mPLUG‑DocOwl: Modular multimodal LLM for document understanding.	https://github.com/X‑PLUG/mPLUG‑DocOwl/tree/main/DocLLM
PF‑1M	Visual instruction tuning using Polite Flamingo.	https://huggingface.co/datasets/chendelong/PF-1M/tree/main
ChartLlama	ChartLlama: Multimodal LLM for chart understanding and generation.	https://huggingface.co/datasets/listen2you002/ChartLlama-Dataset
LLaVAR	LLaVAR: Enhanced visual instruction tuning for text‑rich image understanding.	https://llavar.github.io/#data
MotionGPT	MotionGPT: Treating human motion as a foreign language.	https://github.com/OpenMotionLab/MotionGPT
LRV‑Instruction	Reducing hallucinations in large multimodal models via robust instruction tuning.	https://github.com/FuxiaoLiu/LRV-Instruction#visual-instruction-data-lrv-instruction
Macaw‑LLM	Macaw‑LLM: Multimodal language modeling integrating image, audio, video, and text.	https://github.com/lyuchenyang/Macaw-LLM/tree/main/data
LAMM‑Dataset	LAMM: Language‑Assisted Multimodal instruction tuning dataset, framework, and benchmark.	https://github.com/OpenLAMM/LAMM#lamm-dataset
Video‑ChatGPT	Video‑ChatGPT: Detailed video understanding via large visual and language models.	https://github.com/mbzuai‑oryx/Video‑ChatGPT#video‑instruction‑dataset‑open_file_folder
MIMIC‑IT	MIMIC‑IT: Multimodal contextual instruction tuning.	https://github.com/Luodian/Otter/blob/main/mimic‑it/README.md
M³IT	M³IT: Large dataset for multimodal multilingual instruction tuning.	https://huggingface.co/datasets/MMInstruction/M3IT
LLaVA‑Med	LLaVA‑Med: Large language and vision assistant trained on biomedical data within a day.	Coming soon
GPT4Tools	GPT4Tools: Teaching large language models to use tools via self‑instruction.	Link
MULTIS	ChatBridge: Using large language models as language catalysts to bridge modalities.	Coming soon
DetGPT	DetGPT: Reasoning‑based detection of required content.	Link
PMC‑VQA	PMC‑VQA: Visual instruction tuning for medical visual question answering.	Coming soon
VideoChat	VideoChat: Chat‑centered video understanding.	Link
X‑LLM	X‑LLM: Guiding advanced large language models by treating multimodality as a foreign language.	Link
LMEye	LMEye: Interactive perception network for large language models.	Link
cc‑sbu‑align	MiniGPT‑4: Enhancing visual‑language understanding via advanced large language models.	Link

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio