Back to datasets
Dataset assetOpen Source CommunityMultilingual ProcessingSouth African Theatre

WebLI

一个包含10亿张图片和120亿个文本的数据集,用于多语言语言-图像模型的训练。

Source
github
Created
Sep 2, 2024
Updated
Sep 20, 2024
Signals
514 views
Availability
Linked source ready
Overview

Dataset description and usage context

Awesome-MLLM-Datasets

数据集概述

该项目旨在收集和整理用于多模态大模型训练的各种数据集,包括但不限于预训练数据、指令微调数据和上下文学习数据。目标是提供一个全面的资源库,支持研究人员在开发和优化多模态AI系统时更容易访问高质量的数据集。

数据集分类

预训练数据集

名称图像数量文本数量图像-文本对数量论文链接类型
WebLI10B12B12BPaLI: A Jointly-Scaled Multilingual Language-Image ModelLinkCaptions(109 languages)
LAION-5B5.9B5.9B5.9BLAION-5B: An open large-scale dataset for training next generation image-text modelsLinkCaptions(Multiple languages)
LAION-en2.3B2.3B2.3BLAION-5B: An open large-scale dataset for training next generation image-text modelsLinkCaptions(English)
ALIGN1.8B1.8B1.8BScaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionLinkCaptions(English)
DataComp1.4B1.4B1.4BDATACOMP: In search of the next generation of multimodal datasetsLinkCaptions(English)
COYO747M747M747MCOYO-700M: Large-scale Image-Text Pair DatasetLinkCaptions(English)
LAION-COCO600M600M600MLAION COCO: 600M SYNTHETIC CAPTIONS FROM LAION2B-ENLinkCaptions(English)
LAION-400M400M400M400MLAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text PairsLinkCaptions(English)
Episodic WebLI400M400M400MPaLI-X: On Scaling up a Multilingual Vision and Language Model-Captions(English)
CLIP400M400M400MLearning Transferable Visual Models From Natural Language SupervisionLinkCaptions(English)
LTIP312M312M312MFlamingo: a Visual Language Model for Few-Shot Learning-Captions(English)
FILIP300M300M300MFILIP: Fine-grained Interactive Language-Image Pre-Training-Captions(English)
LAION-zh142M142M142MLAION-5B: An open large-scale dataset for training next generation image-text modelsLinkCaptions(Chinese)
Obelics353M115M141MOBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text DocumentsLinkInterleaved image-text web documents
MMC4571M43B101.2MMultimodal C4: An Open, Billion-scale Corpus of Images Interleaved With TextLinkInterleaved image-text
Wukong101M101M101MWuKong:100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation FrameworkLinkCaptions(Chinese)
M3W185M182GB43.3MFlamingo: a Visual Language Model for Few-Shot Learning-Captions(English)
WIT11.5M37.6M37.6MWIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine LearningLinkCaptions(English)
GQA113K22M22MGQA: A New Dataset for Real-World Visual Reasoning and Compositional Question AnsweringLinkVisual Reasoning and Compositional Question Answering(English)
CC12M12.4M12.4M12.4MConceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual ConceptsLinkCaptions(English)
Red Caps12M12M12MRedCaps: Web-curated image-text data created by the people, for the peopleLinkCaptions(English)
Visual Genome108k4.5M4.5MVisual Genome: Connecting Language and Vision Using Crowdsourced Dense Image AnnotationsLinkAnnotations(English)
DVQA300K3.5M3.5MDVQA: Understanding Data Visualizations via Question AnsweringLinkQuestion answering(English)
CC3M3.3M3.3M3.3MConceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image CaptioningLinkCaptions(English)
MS-COCO328k2.5M2.5MMicrosoft COCO: Common Objects in ContextLinkObject detection,Segmentation,Caption(English)
AI Challenger Captions300K1.5M1.5MAI Challenger : A Large-scale Dataset for Going Deeper in Image UnderstandingLinkCaptions(English)
VQA v2265K1.4M1.4MMaking the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question AnsweringLinkVisual question answering(English)
SBU(Image Caption)1M1M1MIm2Text: Describing Images Using 1 Million Captioned PhotographsLinkCaptions(English)
OCR-VQA207K1M1MOCR-VQA: Visual Question Answering by Reading Text in ImagesLinkVisual question answering(English)
COCO Caption164K1M1MMicrosoft COCO Captions: Data Collection and Evaluation ServerLinkObject detection,Segmentation,Caption(English)
CC595k595K595K595KVisual Instruction TuningLinkCaptions(English)
Visual-7W47.3K328K328KVisual7W: Grounded Question Answering in Images--
Flickr30k31K158K158KFrom image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptionsLinkAnnotations(English)
Text Captions28K145K145KTextCaps: a Dataset for Image Captioning with Reading Comprehension--
RefCOCO20K142K142KReferItGame: Referring to Objects in Photographs of Natural Scenes--

多模态指令微调数据集

  • 待补充

上下文学习数据集

  • 待补充

多模态思维链数据集

  • 待补充

多模态RLHF数据集

  • 待补充

评估基准数据集

  • 待补充
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio