JUHE API Marketplace
DATASET
Open Source Community

LAION

Published ultra‑large image‑text datasets such as LAION‑400M, LAION‑5B, and other various CLIP datasets.

Updated 5/1/2024
github

Description

Dataset Overview

Generative AI (Image Datasets)

General Image Datasets

NameDescriptionURL
LAIONPublished LAION‑400M, LAION‑5B and other ultra‑large image‑text datasets, as well as various types of CLIP data.https://laion.ai/projects/
https://huggingface.co/laion
Conceptual Captions DatasetDataset of (image URL, caption) pairs designed for training and evaluating machine‑learning image captioning systems.https://github.com/google-research-datasets/conceptual-captions
http://ai.google.com/research/ConceptualCaptions
laion-high-resolution-chineseSubset of the Laion5B‑high‑resolution multimodal dataset, about 2.66 M image‑text pairs (Chinese only).https://huggingface.co/datasets/wanng/laion-high-resolution-chinese

Virtual Try‑On Datasets

NameDescriptionURL
StreetTryOnNew outdoor virtual try‑on dataset containing 12,364 training images and 2,089 validation images.https://github.com/cuiaiyu/street-tryon-benchmark
CLOTH4DLarge‑scale 4D dataset containing 3D human bodies, clothing, texture models, SMPL pose parameters, and high‑resolution images.https://github.com/AemikaChow/CLOTH4D
DressCodeDataset focusing on modeling underlying 3D geometry and appearance of people and their clothing.https://docs.google.com/forms/d/e/1FAIpQLSeWVzxWcj3JSALtthuw-2QDAbf2ymiK37sA4pRQD4tZz2vqsw/viewform
https://arxiv.org/pdf/2204.08532.pdf
VITON‑HDHigh‑resolution virtual try‑on dataset with 13,679 pairs of 1024 × 768 images.https://www.dropbox.com/s/10bfat0kg4si1bu/zalando-hd-resized.zip?dl=0
https://psh01087.github.io/VITON-HD/
VITONFirst image‑based virtual try‑on dataset, containing 16,253 image pairs.https://drive.google.com/file/d/1MxCUvKxejnwWnoZ‑KoCyMCXo3TLhRuTo/view
http://openaccess.thecvf.com/content_cvpr_2018/papers/Han_VITON_An_Image‑Based_CVPR_2018_paper.pdf
MPVMulti‑pose virtual try‑on dataset with 35,687/13,524 person/clothing images.https://drive.google.com/drive/folders/1e3ThRpSj8j9PaCUw8IrqzKPDVJK_grcA
https://arxiv.org/abs/1902.11026
Deep Fashion3DLarge‑scale 3D clothing dataset with diverse styles and rich annotations.https://arxiv.org/abs/2003.12753
DeepFashion MultiModalMultimodal virtual try‑on dataset containing unpaired person and clothing images.https://github.com/yumingj/DeepFashion-MultiModal
Digital WardrobeHigh‑quality 3D clothing dataset from real‑consumer photos with 2D‑3D aligned annotations.http://virtualhumans.mpi‑inf.mpg.de/mgn/
TailorNet DatasetPaired images with consistent geometry and pose of 3D humans wearing clothes for garment transfer.https://github.com/zycliao/TailorNet_dataset
http://virtualhumans.mpi‑inf.mpg.de/tailornet/
CLOTH3DFirst 3D clothing dataset containing digital garments and 3D human models.https://arxiv.org/abs/1912.02792
3DPeopleDataset of 80 3D humans wearing various clothes and poses.https://www.albertpumarola.com/research/3DPeople/index.html
THUman DatasetHigh‑resolution 3D textured human dataset with 7,000+ models and 200+ subjects.http://www.liuyebin.com/deephuman/deephuman.html
Garment Dataset3D clothing dataset with digitally created garments suitable for real humans.http://geometry.cs.ucl.ac.uk/projects/2018/garment_design/

Generative AI (Video Datasets)

General Video Datasets

NameDescriptionURL
(none listed)

Multimodal Model Datasets

Pre‑training Alignment Datasets

NameDescriptionURL
LAIONPublished LAION‑400M, LAION‑5B and other ultra‑large image‑text datasets, as well as various types of CLIP data.https://laion.ai/projects/
https://huggingface.co/laion
Conceptual Captions DatasetDataset of (image URL, caption) pairs designed for training and evaluating machine‑learning image captioning systems.https://github.com/google-research-datasets/conceptual-captions
http://ai.google.com/research/ConceptualCaptions
COYO‑700MLarge‑scale image‑text pair dataset for training and evaluating image‑text matching models.https://github.com/kakaobrain/coyo-dataset/
ShareGPT4VLarge image‑text dataset containing GPT‑4 generated captions to improve multimodal models.https://arxiv.org/pdf/2311.12793.pdf
AS‑1BFull‑scene project dataset with over 1 billion regions annotated with semantic labels, QA pairs, and captions for panoramic visual recognition.https://arxiv.org/pdf/2308.01907.pdf
InternVidLarge‑scale video‑text dataset for multimodal understanding and generation.https://arxiv.org/pdf/2307.06942.pdf
MS‑COCOMicrosoft COCO dataset for large‑scale object detection, segmentation, and captioning.https://arxiv.org/pdf/1405.0312.pdf
SBU CaptionsSBU captioned photo dataset containing 1 M images with user‑provided captions collected from Flickr.https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf
Conceptual CaptionsCleaned web‑scraped image alt‑text dataset for training image captioning models.https://aclanthology.org/P18-1238.pdf
LAION‑400MOpen, large‑scale dataset containing 400 M CLIP‑filtered image‑text pairs.https://arxiv.org/pdf/2111.02114.pdf
https://laion.ai/projects/
https://huggingface.co/laion
VG CaptionsVisual Genome dataset linking structured image concepts with language via crowdsourced annotations.https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf
Flickr30kFlickr30k Entities dataset with 30 k images, five captions each, annotated with bounding boxes and entity mentions.https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf
AI‑CapsAI Challenger: large Chinese dataset containing millions of images and natural‑language descriptions.https://arxiv.org/pdf/1711.06475.pdf
Wukong Captions100 M‑scale Chinese cross‑modal pre‑training benchmark dataset.https://proceedings.neurips.cc/paper_files/paper/2022/file/a90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets_and_Benchmarks.pdf
GRITGrounded multimodal language model dataset containing image‑paragraph alignments.https://arxiv.org/pdf/2306.14824.pdf
Youku‑mPLUG10 M‑scale Chinese video‑language pre‑training dataset.https://arxiv.org/pdf/2306.04362.pdf
MSR‑VTTLarge‑scale video description dataset bridging video and language.https://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR‑VTT_A_Large_CVPR_2016_paper.pdf
Webvid10MLarge video‑text dataset for joint video‑language representation learning.https://arxiv.org/pdf/2104.00650.pdf
WavCapsWavCaps: ChatGPT‑assisted weakly‑labelled audio caption dataset.https://arxiv.org/pdf/2303.17395.pdf
AISHELL‑1AISHELL‑1: Open Mandarin speech corpus and ASR benchmark.https://arxiv.org/pdf/1709.05522.pdf
AISHELL‑2AISHELL‑2: Industrial‑scale Mandarin ASR dataset.https://arxiv.org/pdf/1808.10583.pdf
VSDial‑CNChinese visual‑semantic dialogue dataset for multimodal language model research.https://arxiv.org/pdf/2305.04160.pdf

Multimodal Instruction Tuning Datasets

NameDescriptionURL
CogVLM‑SFT‑311KCogVLM‑SFT‑311K, a core alignment corpus for initializing CogVLM v1.0, built from ~3.5 k high‑quality samples selected from open‑source MiniGPT‑4 (minigpt4‑3500). This subset was later combined with Llava‑Instruct‑150K and machine‑translated into Chinese.https://github.com/THUDM/CogVLM/blob/main/dataset.md
ALLaVA‑4VMultimodal instruction dataset generated by GPT‑4V.https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V
IDKVisual instruction dataset for “I Know” hallucination mitigation.https://github.com/ncsoft/idk
CAP2QAVisual instruction dataset aligned with images.https://github.com/ncsoft/cap2qa
M3DBenchLarge‑scale 3D instruction tuning dataset.https://github.com/OpenM3D/M3DBench
ViP‑LLaVA‑InstructMix of LLaVA‑1.5 instruction data and region‑level visual prompts.https://huggingface.co/datasets/mucai/ViP-LLaVA-Instruct
LVIS‑Instruct4VVisual instruction dataset generated by self‑instruction with GPT‑4V.https://huggingface.co/datasets/X2FD/LVIS-Instruct4V
ComVintSynthetic instruction dataset for complex visual reasoning.https://github.com/RUCAIBox/ComVint#comvint-data
SparklesDialogueMachine‑generated dialogue dataset customized for interleaved multi‑image and text interaction at the token level, enhancing instruction‑following LLMs across multiple images and dialogue turns.https://github.com/HYPJUDY/Sparkles#sparklesdialogue
StableLLaVACheap and effective method for collecting visual instruction tuning data.https://github.com/icoz69/StableLLAVA
M‑HalDetectDataset for training and benchmarking hallucination detection and prevention models.Coming soon
MGVLIDHigh‑quality instruction tuning dataset containing image‑text and region‑text pairs.-
BuboGPTBuboGPT: Visual grounding in multimodal LLMs.https://huggingface.co/datasets/magicr/BuboGPT
SVITSVIT: Expanded visual instruction tuning.https://huggingface.co/datasets/BAAI/SVIT
mPLUG‑DocOwlmPLUG‑DocOwl: Modular multimodal LLM for document understanding.https://github.com/X‑PLUG/mPLUG‑DocOwl/tree/main/DocLLM
PF‑1MVisual instruction tuning using Polite Flamingo.https://huggingface.co/datasets/chendelong/PF-1M/tree/main
ChartLlamaChartLlama: Multimodal LLM for chart understanding and generation.https://huggingface.co/datasets/listen2you002/ChartLlama-Dataset
LLaVARLLaVAR: Enhanced visual instruction tuning for text‑rich image understanding.https://llavar.github.io/#data
MotionGPTMotionGPT: Treating human motion as a foreign language.https://github.com/OpenMotionLab/MotionGPT
LRV‑InstructionReducing hallucinations in large multimodal models via robust instruction tuning.https://github.com/FuxiaoLiu/LRV-Instruction#visual-instruction-data-lrv-instruction
Macaw‑LLMMacaw‑LLM: Multimodal language modeling integrating image, audio, video, and text.https://github.com/lyuchenyang/Macaw-LLM/tree/main/data
LAMM‑DatasetLAMM: Language‑Assisted Multimodal instruction tuning dataset, framework, and benchmark.https://github.com/OpenLAMM/LAMM#lamm-dataset
Video‑ChatGPTVideo‑ChatGPT: Detailed video understanding via large visual and language models.https://github.com/mbzuai‑oryx/Video‑ChatGPT#video‑instruction‑dataset‑open_file_folder
MIMIC‑ITMIMIC‑IT: Multimodal contextual instruction tuning.https://github.com/Luodian/Otter/blob/main/mimic‑it/README.md
M³ITM³IT: Large dataset for multimodal multilingual instruction tuning.https://huggingface.co/datasets/MMInstruction/M3IT
LLaVA‑MedLLaVA‑Med: Large language and vision assistant trained on biomedical data within a day.Coming soon
GPT4ToolsGPT4Tools: Teaching large language models to use tools via self‑instruction.Link
MULTISChatBridge: Using large language models as language catalysts to bridge modalities.Coming soon
DetGPTDetGPT: Reasoning‑based detection of required content.Link
PMC‑VQAPMC‑VQA: Visual instruction tuning for medical visual question answering.Coming soon
VideoChatVideoChat: Chat‑centered video understanding.Link
X‑LLMX‑LLM: Guiding advanced large language models by treating multimodality as a foreign language.Link
LMEyeLMEye: Interactive perception network for large language models.Link
cc‑sbu‑alignMiniGPT‑4: Enhancing visual‑language understanding via advanced large language models.Link

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Multimodal Learning
CLIP Models

Source

Organization: github

Created: 3/9/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.