Dataset Hub

WikiBio (wikipedia biography dataset)

Biography Data

The dataset collects 728,321 Wikipedia biographies, intended to evaluate text generation algorithms. Each article provides the opening paragraph and infobox (both tokenized). It is used to assess algorithms that generate text from structured data, particularly in the biography domain.

github

Natural Language Processing

NeviduJ/Sample_WritingPrompts

The dataset contains three primary features: prompt and story, both of string type. The dataset is split into training, validation, and test sets with 1,400, 200, and 400 examples respectively. Download size is 4,002,221 bytes, total size 6,296,928 bytes.

Gutenberg Poetry Corpus

Poetry

This is a poetry corpus extracted from Project Gutenberg, containing approximately three million lines of poetry, particularly suitable for creative computational poetry text generation applications.

github

Mathematical Word Problems

swulling/gsm8k_chinese

The dataset contains Chinese and English math word problems with answers, suitable for text generation tasks, especially the generation of math application problems. It is split into a training set (7,473 samples) and a test set (1,319 samples). Features include question, answer, Chinese question, and answer‑only fields.

btyt7/the-italian-cook-book

Cooking

The dataset named Gutenberg: The Italian Cook Book primarily involves text generation tasks, with content related to food and recipes, in English, and its size is less than 1K.

chinese_porn_novel

Adult Content

The xbookcn_short_story dataset contains Chinese short stories for text generation tasks. Each story is split into multiple chunks, and the Qwen‑instruct model generates four summaries of varying lengths. Features include source, category, title, content, content length, URL, and four summaries. The dataset size ranges from 100 MB to 1 GB; the training set comprises 627,195 samples.

huggingface

p208p2002/wudao

Large-Scale Text Data

悟道(WuDao)数据集是一个用于文本生成任务的大型数据集，包含超过1万亿个token。数据集大小约为125GB（压缩为.parquet格式），对应悟道220G版本。数据集包含多种类别，如科技、经济、娱乐等，共计59100001条数据。使用时需引用原作者信息。

sardinelab/MT-pref

Text Evaluation

The dataset contains multiple fields such as prompt, chosen (selected response), rejected (rejected response), best_response (best response), each with its respective data type. The dataset is split into a training set comprising 15,798 samples, with a total size of 26,999,980.39076906 bytes.

zhoukz/TinyStories-Qwen

Chinese Stories

A Chinese story dataset generated using Qwen series models, modeled after the TinyStories dataset. All data are AI‑generated; the dataset is unfiltered and does not guarantee uniform distribution, safety, harmlessness, or any other properties. The seed information used for generation was randomly selected without any specific meaning.

llamafactory/glaive_toolcall_zh

Question Answering Systems

This dataset is suitable for text generation and question‑answering tasks, primarily in Chinese. It contains two main fields, `conversations` and `tools`; `conversations` is a list of objects with string fields `from` and `value`, and `tools` is a string field. The dataset size ranges from 1K to 10K entries and is released under the Apache 2.0 license. It can be used in LLaMA Factory by specifying `dataset: glaive_toolcall_zh`.

OpenAssistant/oasst2

Dialogue Systems

The OpenAssistant Conversations Release 2 (OASST2) dataset contains message trees, each rooted by an initial prompt message and potentially multiple child messages as replies, which themselves may have further replies. All messages have a role attribute, either "assistant" or "prompter". The dataset includes multilingual messages and provides detailed JSON examples illustrating the message and conversation‑tree structure. It also supplies primary file information, statistics, and instructions on loading the dataset with HuggingFace Datasets.

erhwenkuo/alpaca-data-gpt4-chinese-zhtw

Model Fine-tuning

The dataset named alpaca-data-gpt4-chinese-zhtw contains traditional Chinese instruction‑following data generated by GPT‑4 for fine‑tuning large language models. The dataset originates from a GitHub repository and is a Chinese translation of the original English version. It comprises 52 K instruction‑following entries, formatted like the Alpaca dataset, but with outputs generated by GPT‑4. The three primary fields are: instruction (task description), input (optional task context or input), and output (GPT‑4‑generated answer). Compared with the original Alpaca dataset, this version leverages GPT‑4 for response generation, resulting in higher quality and longer responses. The dataset is suitable for text generation, dialogue, and question‑answering tasks.