Explore high-quality datasets for your AI and machine learning projects.
Weather Captioned Dataset is a multimodal time‑series text dataset that includes weather data from the Max‑Planck‑Institute for Biogeochemistry, Jena, and forecast reports from publicly available weather forecasting platforms. The dataset also provides GPT‑4‑generated descriptions and pre‑embedded news texts.
The dataset collects 728,321 Wikipedia biographies, intended to evaluate text generation algorithms. Each article provides the opening paragraph and infobox (both tokenized). It is used to assess algorithms that generate text from structured data, particularly in the biography domain.
The dataset contains three primary features: prompt and story, both of string type. The dataset is split into training, validation, and test sets with 1,400, 200, and 400 examples respectively. Download size is 4,002,221 bytes, total size 6,296,928 bytes.
This is a poetry corpus extracted from Project Gutenberg, containing approximately three million lines of poetry, particularly suitable for creative computational poetry text generation applications.
The dataset contains Chinese and English math word problems with answers, suitable for text generation tasks, especially the generation of math application problems. It is split into a training set (7,473 samples) and a test set (1,319 samples). Features include question, answer, Chinese question, and answer‑only fields.
The dataset named Gutenberg: The Italian Cook Book primarily involves text generation tasks, with content related to food and recipes, in English, and its size is less than 1K.
The xbookcn_short_story dataset contains Chinese short stories for text generation tasks. Each story is split into multiple chunks, and the Qwen‑instruct model generates four summaries of varying lengths. Features include source, category, title, content, content length, URL, and four summaries. The dataset size ranges from 100 MB to 1 GB; the training set comprises 627,195 samples.
悟道(WuDao)数据集是一个用于文本生成任务的大型数据集,包含超过1万亿个token。数据集大小约为125GB(压缩为.parquet格式),对应悟道220G版本。数据集包含多种类别,如科技、经济、娱乐等,共计59100001条数据。使用时需引用原作者信息。
The dataset contains multiple fields such as prompt, chosen (selected response), rejected (rejected response), best_response (best response), each with its respective data type. The dataset is split into a training set comprising 15,798 samples, with a total size of 26,999,980.39076906 bytes.
A Chinese story dataset generated using Qwen series models, modeled after the TinyStories dataset. All data are AI‑generated; the dataset is unfiltered and does not guarantee uniform distribution, safety, harmlessness, or any other properties. The seed information used for generation was randomly selected without any specific meaning.
This dataset is suitable for text generation and question‑answering tasks, primarily in Chinese. It contains two main fields, `conversations` and `tools`; `conversations` is a list of objects with string fields `from` and `value`, and `tools` is a string field. The dataset size ranges from 1K to 10K entries and is released under the Apache 2.0 license. It can be used in LLaMA Factory by specifying `dataset: glaive_toolcall_zh`.
The OpenAssistant Conversations Release 2 (OASST2) dataset contains message trees, each rooted by an initial prompt message and potentially multiple child messages as replies, which themselves may have further replies. All messages have a role attribute, either "assistant" or "prompter". The dataset includes multilingual messages and provides detailed JSON examples illustrating the message and conversation‑tree structure. It also supplies primary file information, statistics, and instructions on loading the dataset with HuggingFace Datasets.
The dataset named alpaca-data-gpt4-chinese-zhtw contains traditional Chinese instruction‑following data generated by GPT‑4 for fine‑tuning large language models. The dataset originates from a GitHub repository and is a Chinese translation of the original English version. It comprises 52 K instruction‑following entries, formatted like the Alpaca dataset, but with outputs generated by GPT‑4. The three primary fields are: instruction (task description), input (optional task context or input), and output (GPT‑4‑generated answer). Compared with the original Alpaca dataset, this version leverages GPT‑4 for response generation, resulting in higher quality and longer responses. The dataset is suitable for text generation, dialogue, and question‑answering tasks.