Back to datasets
Dataset assetOpen Source CommunityText GenerationLarge-Scale Text Data

p208p2002/wudao

悟道(WuDao)数据集是一个用于文本生成任务的大型数据集,包含超过1万亿个token。数据集大小约为125GB(压缩为.parquet格式),对应悟道220G版本。数据集包含多种类别,如科技、经济、娱乐等,共计59100001条数据。使用时需引用原作者信息。

Source
hugging_face
Created
Nov 28, 2025
Updated
May 9, 2024
Signals
230 views
Availability
Linked source ready
Overview

Dataset description and usage context

悟道(WuDao)資料集

基本信息

  • 语言: 中文
  • 任务类别: 文本生成
  • 数据规模: 大于1TB
  • 配置: 默认配置
    • 数据文件:
      • 分割: 训练集
      • 路径: *.parquet

数据集描述

  • 大小: 约125GB(.parquet压缩),对应悟道220G版本。

  • 引用信息:

    @misc{ c6a3fe684227415a9db8e21bac4a15ab, author = {Zhao Xue and Hanyu Zhao and Sha Yuan and Yequan Wang}, title = {{WuDaoCorpora Text}}, year = 2022, month = dec, publisher = {Science Data Bank}, version = {V1}, doi = {10.57760/sciencedb.o00126.00004}, url = https://doi.org/10.57760/sciencedb.o00126.00004 }

使用方法

python from datasets import load_dataset load_dataset("p208p2002/wudao", streaming=True, split="train")

数据类别统计

json { "_total": 59100001, "豆瓣话题": 209027, "科技": 1278068, "经济": 1096215, "汽车": 1368193, "娱乐": 1581947, "农业": 1129758, "军事": 420949, "社会": 446228, "游戏": 754703, "教育": 1133453, "体育": 660858, "旅行": 821573, "国际": 630386, "房产": 387786, "文化": 710648, "法律": 36585, "股票": 1205, "博客": 15467790, "日报": 16971, "评论": 13867, "孕育常识": 48291, "健康": 15291, "财经": 54656, "医学问答": 314771, "资讯": 1066180, "科普文章": 60581, "百科": 27273280, "酒业": 287, "经验": 609195, "新闻": 846810, "小红书攻略": 185379, "生活": 23, "网页文本": 115830, "观点": 1268, "海外": 4, "户外": 5, "美容": 7, "理论": 247, "天气": 540, "文旅": 2999, "信托": 62, "保险": 70, "水利资讯": 17, "时尚": 1123, "亲子": 39, "百家号文章": 335591, "黄金": 216, "党建": 1, "期货": 330, "快讯": 41, "国内": 15, "国学": 614, "公益": 15, "能源": 7, "创新": 6 }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio