JUHE API Marketplace
DATASET
Open Source Community

Intern · WanJuan 1.0

Intern·WanJuan 1.0 is the first open‑source version of the Intern·Wanjuan multimodal corpus, comprising text, image‑text, and video datasets, with a total data volume exceeding 2 TB. Built on the large‑model data alliance, Shanghai AI Lab performed fine‑grained cleaning, deduplication, and value alignment, resulting in a multimodal‑integrated, meticulously processed, value‑aligned, user‑friendly, and efficient dataset.

Updated 10/20/2023
github

Description

Intern·WanJuan 1.0 Dataset Overview

Intern·WanJuan 1.0 Introduction

Intern·WanJuan 1.0 is the inaugural open‑source release of the Intern·Wanjuan multimodal corpus, featuring text, image‑text, and video components with a total size of over 2 TB. The corpus is constructed on the large‑model data alliance, and the Shanghai AI Lab performed fine‑grained cleaning, deduplication, and value alignment, yielding a multimodal‑integrated, meticulously processed, value‑aligned, easy‑to‑use, and efficient dataset.

Features

  • Multimodal Integration: Includes text, images, and video across domains such as technology, literature, media, education, and law, enhancing knowledge coverage, logical reasoning, and generalisation.
  • Fine‑grained Processing: Language filtering, text extraction, format standardisation, data filtering, and cleaning ensure suitability for downstream model training.
  • Value Alignment: Content aligns with mainstream Chinese values, with algorithms and human evaluation improving corpus purity.
  • Usability & Efficiency: Unified format, detailed field descriptions, and tooling guidance facilitate rapid application to multimodal LLMs or LLM training.

Applications

Intern·WanJuan 1.0 has been employed in training Intern Multimodal, Intern Puyu, and other large models, demonstrating superior performance in semantic understanding, knowledge QA, visual understanding, and visual‑language tasks.

Intern·WanJuan 1.0 – Text Sub‑Dataset

Overview

The text sub‑dataset aggregates cleaned pre‑training corpora from webpages, encyclopaedia, books, patents, textbooks, exam questions, etc., totaling over 500 million documents and exceeding 1 TB. Data are stored in a unified JSONL format with fields id and content (plain text or Markdown). Rigorous cleaning, deduplication, and value alignment ensure a safe, reliable, high‑quality pre‑training corpus.

Example

{
    "id": "BkORdv3xK7IA0HG7pccr",
    "content": "*诗作[222]\n录自索菲娅·马克思的笔记本\n#### 人生\n时光倏忽即逝,\n宛如滔滔流水;\n..."
}

Intern·WanJuan 1.0 – Image‑Text Sub‑Dataset

Overview

Derived from public webpages, the image‑text sub‑dataset contains interleaved images and textual documents. It comprises over 22 million documents (≈140 GB without images) covering news events, personalities, natural scenery, social life, etc. Images are provided as URLs.

Example

{
    "id": "BkKuk1zxK3YAbgNSWYik",
    "img_list": [{
        "url": "http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/images/2021-01/21/02/1007771_wangjj_1611154300505_b.jpg",
        "sha256": "019cca88f37ae5ffe59ad48ad5c392fe64e489f08e841b6ea50c79c18f5c6ec3",
        "caption": "",
        "width": "400",
        "height": "266"
    }],
    "content": "![](http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/images/2021-01/21/02/1007771_wangjj_1611154300505_b.jpg)\n..."
}

Intern·WanJuan 1.0 – Video Sub‑Dataset

Overview

The video sub‑dataset originates from China Media Group and Shanghai Media Group, containing diverse program videos. Over 1 000 video files (≈900 GB) span domains such as military, arts, sports, nature, documentary, science education, etc.

Example

{
    "id": "video_id_123",
    "content": "这是一段关于历史纪录片的视频,详细介绍了中国的古代文明和历史事件。"
}

Download Links

To download the full dataset, visit: https://opendatalab.org.cn/WanJuan1.0

License

Intern·WanJuan 1.0 is released under the CC BY 4.0 license. You may share and adapt the dataset provided you give appropriate attribution, link to the license, and indicate if changes were made.

Special Notes

Some subsets may be governed by additional licenses. Please review the relevant documentation before use.


AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Multimodal Data
AI Research

Source

Organization: github

Created: 8/14/2023

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.