JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

cmm-math

Mathematics Education
Multimodal Data

CMM‑Math is a Chinese multimodal mathematics dataset containing over 28,000 high‑quality samples covering 12 grades from primary school to high school. It includes diverse question types such as multiple‑choice and fill‑in‑the‑blank, with detailed solutions. Some questions involve visual context, making the dataset more challenging. The dataset is split into a training set (22,000+ samples) and an evaluation set (5,000+ samples).

huggingface
View Details

MHAD: Multimodal Home Activity Dataset

Multimodal Data
Home Activity Recognition

The MHAD dataset was jointly collected by JD Health, Huazhong University of Science and Technology, and Zhejiang University. It is the first multimodal dataset captured in real home environments, featuring multiple camera angles and a wide range of household scenarios. It includes the most comprehensive set of physiological signals to date and is a valuable resource for computer vision, machine learning, and biomedical engineering research.

github
View Details

Intern · WanJuan 1.0

Multimodal Data
AI Research

Intern·WanJuan 1.0 is the first open‑source version of the Intern·Wanjuan multimodal corpus, comprising text, image‑text, and video datasets, with a total data volume exceeding 2 TB. Built on the large‑model data alliance, Shanghai AI Lab performed fine‑grained cleaning, deduplication, and value alignment, resulting in a multimodal‑integrated, meticulously processed, value‑aligned, user‑friendly, and efficient dataset.

github
View Details

HUVER

Unmanned Aerial Vehicles
Multimodal Data

The HUVER dataset contains 6,051 unique drone configurations, each described by multiple formats such as grammar strings, RGB images, and GLB files. Additionally, each configuration includes an English textual descriptor that details the drone’s features in natural language. The dataset supports tasks such as image‑to‑text, image‑to‑3D, and feature extraction, and is curated by Abhiram Karri, Gary Stump, Christopher McComb, and Binyang Song under the MIT License.

huggingface
View Details

Social-Media-Dataset

Social Media
Multimodal Data

This dataset contains over 1 million tweets crawled from Twitter. After filtering and processing, it retains multimodal text‑image data, extracts emojis and embedded text, resulting in a dataset with four modalities.

github
View Details

IMed-361M

Medical Image Segmentation
Multimodal Data

IMed-361M数据集是最大的公开多模态交互式医学图像分割数据集,包含640万张图像、2.734亿个掩码(每张图像56个掩码)、14种成像模式和204个分割目标。它确保了六个解剖组之间的多样性,细粒度注释,大多数掩码覆盖的图像区域小于2%,并且广泛适用,83%的图像分辨率在256×256到1024×1024之间。IMed-361M提供的掩码数量是MedTrinity-25M的14.4倍,显著超过了其他数据集的规模和掩码数量。

github
View Details

UniMed

Medical Imaging
Multimodal Data

UniMed is a large‑scale, open‑source multimodal medical dataset created by Mohammed Bin Zayed University of AI and other institutions. It contains over 5.3 million image‑text pairs across six imaging modalities: X‑ray, CT, MRI, Ultrasound, Pathology, and Fundus. The dataset is built by converting modality‑specific classification datasets into image‑text format using large language models, and augmenting them with existing medical image‑text data, enabling scalable pre‑training of visual‑language models (VLMs). UniMed aims to alleviate the scarcity of publicly available large‑scale medical image‑text data and supports tasks such as zero‑shot classification and cross‑modal generalization.

arXiv
View Details

MedTrinity-25M

Medical Data Analysis
Multimodal Data

MedTrinity‑25M is a large‑scale multimodal medical dataset with multigranular annotations. It extracts key information from collected data, integrates metadata to generate coarse descriptions, locates regions of interest, and gathers medical knowledge, then prompts large language models to generate fine‑grained descriptions.

github
View Details