Datasets | JuheAPI

MMPedestron Benchmark Dataset

Pedestrian Detection

Multimodal Learning

MMPedestron Benchmark Dataset is a multimodal pedestrian detection dataset that includes sub‑datasets such as CrowdHuman, COCO‑Person, FLIR, PEDRo, etc.

github

View Details

mathvision

Mathematical Reasoning

Multimodal Learning

Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs.

github

View Details

Flickr30K, COCO

Multimodal Learning

Computer Vision

Flickr30K is a multimodal dataset containing images and text, used for training and validating algorithms that mine similarity between images and text. COCO is a large, rich image dataset primarily used for object detection, segmentation, and image captioning tasks.

github

View Details

SWHL/ChineseOCRBench

Chinese OCR

Multimodal Learning

Chinese OCRBench is a dataset specifically designed for evaluating Chinese OCR tasks, filling the evaluation gap for multimodal large language models in this domain. It comprises 3,410 images and 3,410 question‑answer pairs sourced from the ReCTS and ESTVQA datasets. Annotation includes image filename, question, answer, etc., suitable for OCR benchmarking and research.

hugging_face

View Details

IDEA-CCNL/laion2B-multi-chinese-subset

Multimodal Learning

Natural Language Processing

--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - zh license: - cc-by-4.0 multilinguality: - monolingual pretty_name: laion2B-multi-chinese-subset task_categories: - feature-extraction --- # laion2B-multi-chinese-subset - Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM) - Docs: [Fengshenbang-Docs](https://fengshenbang-doc.readthedocs.io/) ## 简介 Brief Introduction 取自Laion2B多语言多模态数据集中的中文部分，一共143M个图文对。 A subset from Laion2B (a multimodal dataset), around 143M image-text pairs (only Chinese). ## 数据集信息 Dataset Information 大约一共143M个中文图文对。大约占用19GB空间（仅仅是url等文本信息，不包含图片）。 - Homepage: [laion-5b](https://laion.ai/blog/laion-5b/) - Huggingface: [laion/laion2B-multi](https://huggingface.co/datasets/laion/laion2B-multi) ## 下载 Download ```bash mkdir laion2b_chinese_release && cd laion2b_chinese_release for i in {00000..00012}; do wget https://huggingface.co/datasets/IDEA-CCNL/laion2B-multi-chinese-subset/resolve/main/data/train-$i-of-00013.parquet; done cd .. ``` ## Lisence CC-BY-4.0 ## 引用 Citation 如果您在您的工作中使用了我们的模型，可以引用我们的[论文](https://arxiv.org/abs/2209.02970)： If you are using the resource for your work, please cite the our [paper](https://arxiv.org/abs/2209.02970): ```text @article{fengshenbang, author = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen}, title = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence}, journal = {CoRR}, volume = {abs/2209.02970}, year = {2022} } ``` 也可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/): You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/): ```text @misc{Fengshenbang-LM, title={Fengshenbang-LM}, author={IDEA-CCNL}, year={2021}, howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}}, }

hugging_face

View Details

LAION

Multimodal Learning

CLIP Models

Published ultra‑large image‑text datasets such as LAION‑400M, LAION‑5B, and other various CLIP datasets.

github

View Details

OpenGVLab/MM-NIAH

Multimodal Learning

Question Answering Systems

Needle In A Multimodal Haystack (MM‑NIAH) is a comprehensive benchmark designed to systematically evaluate the capability of existing multimodal large language models (MLLMs) in understanding long multimodal documents. The benchmark requires models to answer specific questions based on key information scattered throughout multimodal documents. MM‑NIAH's evaluation data comprises three tasks: retrieval, counting, and reasoning. Key information (called “needles”) is embedded in the document's text or images; those inserted into text are referred to as text needles, and those inserted into images as image needles. Experimental results indicate that current MLLMs perform poorly when handling image‑based key information.

hugging_face

View Details

omegalabsinc/omega-multimodal

Artificial General Intelligence

Multimodal Learning

The OMEGA Labs Bittensor Subnet dataset is a multimodal dataset aimed at accelerating artificial general intelligence (AGI) research and development. Provided via the Bittensor decentralized network, it aspires to become the world’s largest multimodal dataset, encompassing a wide range of human knowledge and creation. The dataset includes over 1 million hours of video and more than 30 million two‑minute video clips, covering over 50 scene types and more than 15 000 action phrases. Advanced models are used to map video components into a unified latent space, facilitating the development of powerful AGI models with potential impact across multiple industries.

hugging_face

View Details

siglip_400m

Multimodal Learning

Image‑Text Retrieval

The SigLIP model is a shape‑optimized model pre‑trained on the WebLI dataset, with a resolution of 384 × 384. It was introduced in the paper "Sigmoid Loss for Language‑Image Pre‑Training" by Zhai et al. and first released in Google Research's big_vision repository. SigLIP is a multimodal CLIP‑type model with an improved loss function that allows larger batch sizes without relying on global pairwise similarity normalization, and performs better with smaller batches. It is primarily used for zero‑shot image classification and image‑text retrieval. Training data include the WebLI dataset; images are resized to 384 × 384 and normalized, and text is tokenized and padded to 64 tokens. The model was trained for three days on 16 TPU‑v4 chips.

huggingface

View Details

Dataset Hub

Browse by Category

MMPedestron Benchmark Dataset

mathvision

Flickr30K, COCO

SWHL/ChineseOCRBench

IDEA-CCNL/laion2B-multi-chinese-subset

LAION

OpenGVLab/MM-NIAH

omegalabsinc/omega-multimodal

siglip_400m