High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

Intern · WanJuan 1.0

Intern·WanJuan 1.0 is the first open‑source version of the Intern·Wanjuan multimodal corpus, comprising text, image‑text, and video datasets, with a total data volume exceeding 2 TB. Built on the large‑model data alliance, Shanghai AI Lab performed fine‑grained cleaning, deduplication, and value alignment, resulting in a multimodal‑integrated, meticulously processed, value‑aligned, user‑friendly, and efficient dataset.

github

View Details