JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

Weather Captioned Dataset

Weather Data
Text Generation

Weather Captioned Dataset is a multimodal time‑series text dataset that includes weather data from the Max‑Planck‑Institute for Biogeochemistry, Jena, and forecast reports from publicly available weather forecasting platforms. The dataset also provides GPT‑4‑generated descriptions and pre‑embedded news texts.

github
View Details

WikiBio (wikipedia biography dataset)

Text Generation
Biography Data

The dataset collects 728,321 Wikipedia biographies, intended to evaluate text generation algorithms. Each article provides the opening paragraph and infobox (both tokenized). It is used to assess algorithms that generate text from structured data, particularly in the biography domain.

github
View Details

NeviduJ/Sample_WritingPrompts

Natural Language Processing
Text Generation

The dataset contains three primary features: prompt and story, both of string type. The dataset is split into training, validation, and test sets with 1,400, 200, and 400 examples respectively. Download size is 4,002,221 bytes, total size 6,296,928 bytes.

hugging_face
View Details

Gutenberg Poetry Corpus

Poetry
Text Generation

This is a poetry corpus extracted from Project Gutenberg, containing approximately three million lines of poetry, particularly suitable for creative computational poetry text generation applications.

github
View Details

swulling/gsm8k_chinese

Mathematical Word Problems
Text Generation

The dataset contains Chinese and English math word problems with answers, suitable for text generation tasks, especially the generation of math application problems. It is split into a training set (7,473 samples) and a test set (1,319 samples). Features include question, answer, Chinese question, and answer‑only fields.

hugging_face
View Details

btyt7/the-italian-cook-book

Cooking
Text Generation

The dataset named Gutenberg: The Italian Cook Book primarily involves text generation tasks, with content related to food and recipes, in English, and its size is less than 1K.

hugging_face
View Details

chinese_porn_novel

Text Generation
Adult Content

The xbookcn_short_story dataset contains Chinese short stories for text generation tasks. Each story is split into multiple chunks, and the Qwen‑instruct model generates four summaries of varying lengths. Features include source, category, title, content, content length, URL, and four summaries. The dataset size ranges from 100 MB to 1 GB; the training set comprises 627,195 samples.

huggingface
View Details

p208p2002/wudao

Large-Scale Text Data
Text Generation

悟道(WuDao)数据集是一个用于文本生成任务的大型数据集,包含超过1万亿个token。数据集大小约为125GB(压缩为.parquet格式),对应悟道220G版本。数据集包含多种类别,如科技、经济、娱乐等,共计59100001条数据。使用时需引用原作者信息。

hugging_face
View Details

sardinelab/MT-pref

Text Generation
Text Evaluation

The dataset contains multiple fields such as prompt, chosen (selected response), rejected (rejected response), best_response (best response), each with its respective data type. The dataset is split into a training set comprising 15,798 samples, with a total size of 26,999,980.39076906 bytes.

hugging_face
View Details

zhoukz/TinyStories-Qwen

Text Generation
Chinese Stories

A Chinese story dataset generated using Qwen series models, modeled after the TinyStories dataset. All data are AI‑generated; the dataset is unfiltered and does not guarantee uniform distribution, safety, harmlessness, or any other properties. The seed information used for generation was randomly selected without any specific meaning.

hugging_face
View Details

llamafactory/glaive_toolcall_zh

Text Generation
Question Answering Systems

This dataset is suitable for text generation and question‑answering tasks, primarily in Chinese. It contains two main fields, `conversations` and `tools`; `conversations` is a list of objects with string fields `from` and `value`, and `tools` is a string field. The dataset size ranges from 1K to 10K entries and is released under the Apache 2.0 license. It can be used in LLaMA Factory by specifying `dataset: glaive_toolcall_zh`.

hugging_face
View Details

OpenAssistant/oasst2

Dialogue Systems
Text Generation

The OpenAssistant Conversations Release 2 (OASST2) dataset contains message trees, each rooted by an initial prompt message and potentially multiple child messages as replies, which themselves may have further replies. All messages have a role attribute, either "assistant" or "prompter". The dataset includes multilingual messages and provides detailed JSON examples illustrating the message and conversation‑tree structure. It also supplies primary file information, statistics, and instructions on loading the dataset with HuggingFace Datasets.

hugging_face
View Details

erhwenkuo/alpaca-data-gpt4-chinese-zhtw

Text Generation
Model Fine-tuning

The dataset named alpaca-data-gpt4-chinese-zhtw contains traditional Chinese instruction‑following data generated by GPT‑4 for fine‑tuning large language models. The dataset originates from a GitHub repository and is a Chinese translation of the original English version. It comprises 52 K instruction‑following entries, formatted like the Alpaca dataset, but with outputs generated by GPT‑4. The three primary fields are: instruction (task description), input (optional task context or input), and output (GPT‑4‑generated answer). Compared with the original Alpaca dataset, this version leverages GPT‑4 for response generation, resulting in higher quality and longer responses. The dataset is suitable for text generation, dialogue, and question‑answering tasks.

hugging_face
View Details