Dataset assetOpen Source CommunityMultimodal DataAI Research

Intern · WanJuan 1.0

Intern·WanJuan 1.0 is the first open‑source version of the Intern·Wanjuan multimodal corpus, comprising text, image‑text, and video datasets, with a total data volume exceeding 2 TB. Built on the large‑model data alliance, Shanghai AI Lab performed fine‑grained cleaning, deduplication, and value alignment, resulting in a multimodal‑integrated, meticulously processed, value‑aligned, user‑friendly, and efficient dataset.

Source

github

Created

Aug 14, 2023

Updated

Oct 20, 2023

Signals

187 views

Availability

Linked source ready

Overview

Dataset description and usage context

Intern·WanJuan 1.0 Dataset Overview

Intern·WanJuan 1.0 Introduction

Intern·WanJuan 1.0 is the inaugural open‑source release of the Intern·Wanjuan multimodal corpus, featuring text, image‑text, and video components with a total size of over 2 TB. The corpus is constructed on the large‑model data alliance, and the Shanghai AI Lab performed fine‑grained cleaning, deduplication, and value alignment, yielding a multimodal‑integrated, meticulously processed, value‑aligned, easy‑to‑use, and efficient dataset.

Features

Multimodal Integration: Includes text, images, and video across domains such as technology, literature, media, education, and law, enhancing knowledge coverage, logical reasoning, and generalisation.
Fine‑grained Processing: Language filtering, text extraction, format standardisation, data filtering, and cleaning ensure suitability for downstream model training.
Value Alignment: Content aligns with mainstream Chinese values, with algorithms and human evaluation improving corpus purity.
Usability & Efficiency: Unified format, detailed field descriptions, and tooling guidance facilitate rapid application to multimodal LLMs or LLM training.

Applications

Intern·WanJuan 1.0 has been employed in training Intern Multimodal, Intern Puyu, and other large models, demonstrating superior performance in semantic understanding, knowledge QA, visual understanding, and visual‑language tasks.

Intern·WanJuan 1.0 – Text Sub‑Dataset

Overview

The text sub‑dataset aggregates cleaned pre‑training corpora from webpages, encyclopaedia, books, patents, textbooks, exam questions, etc., totaling over 500 million documents and exceeding 1 TB. Data are stored in a unified JSONL format with fields id and content (plain text or Markdown). Rigorous cleaning, deduplication, and value alignment ensure a safe, reliable, high‑quality pre‑training corpus.

Example

{
    "id": "BkORdv3xK7IA0HG7pccr",
    "content": "*诗作[222]\n录自索菲娅·马克思的笔记本\n#### 人生\n时光倏忽即逝，\n宛如滔滔流水；\n..."
}

Intern·WanJuan 1.0 – Image‑Text Sub‑Dataset

Overview

Derived from public webpages, the image‑text sub‑dataset contains interleaved images and textual documents. It comprises over 22 million documents (≈140 GB without images) covering news events, personalities, natural scenery, social life, etc. Images are provided as URLs.

Example

{
    "id": "BkKuk1zxK3YAbgNSWYik",
    "img_list": [{
        "url": "http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/images/2021-01/21/02/1007771_wangjj_1611154300505_b.jpg",
        "sha256": "019cca88f37ae5ffe59ad48ad5c392fe64e489f08e841b6ea50c79c18f5c6ec3",
        "caption": "",
        "width": "400",
        "height": "266"
    }],
    "content": "![](http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/images/2021-01/21/02/1007771_wangjj_1611154300505_b.jpg)\n..."
}

Intern·WanJuan 1.0 – Video Sub‑Dataset

Overview

The video sub‑dataset originates from China Media Group and Shanghai Media Group, containing diverse program videos. Over 1 000 video files (≈900 GB) span domains such as military, arts, sports, nature, documentary, science education, etc.

Example

{
    "id": "video_id_123",
    "content": "这是一段关于历史纪录片的视频，详细介绍了中国的古代文明和历史事件。"
}

Download Links

To download the full dataset, visit: https://opendatalab.org.cn/WanJuan1.0

License

Intern·WanJuan 1.0 is released under the CC BY 4.0 license. You may share and adapt the dataset provided you give appropriate attribution, link to the license, and indicate if changes were made.

Special Notes

Some subsets may be governed by additional licenses. Please review the relevant documentation before use.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio

Intern · WanJuan 1.0

Dataset description and usage context

Intern·WanJuan 1.0 Dataset Overview

Intern·WanJuan 1.0 Introduction

Features

Applications

Intern·WanJuan 1.0 – Text Sub‑Dataset

Overview

Example

Intern·WanJuan 1.0 – Image‑Text Sub‑Dataset

Overview

Example

Intern·WanJuan 1.0 – Video Sub‑Dataset

Overview

Example

Download Links

License

Special Notes

Pair the dataset with AI analysis and content workflows.

Intern·WanJuan 1.0 Dataset Overview

Intern·WanJuan 1.0 Introduction

Intern·WanJuan 1.0 – Text Sub‑Dataset

Intern·WanJuan 1.0 – Image‑Text Sub‑Dataset

Intern·WanJuan 1.0 – Video Sub‑Dataset