High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

allenai/WildChat

WildChat是一个包含65万条人类用户与ChatGPT之间对话的数据集。该数据集通过向在线用户免费提供OpenAI的GPT-3.5和GPT-4访问权限收集而成。数据集涵盖了多种用户与聊天机器人的交互场景，如模糊的用户请求、代码转换、话题转换、政治讨论等。WildChat既可作为指令微调的数据集，也可作为研究用户行为的宝贵资源。需要注意的是，该数据集包含有毒的用户输入和ChatGPT的响应，并提供了一个无毒的子集。数据集支持多语言，包含66种语言，并且已经过脱敏处理。

hugging_face

View Details

IBM/doc2dial

Dialogue Systems

Question Answering

Doc2dial is a document‑grounded goal‑oriented dialogue dataset containing more than 4,500 annotated dialogues (approximately 14 turns per dialogue) based on over 450 documents from four domains. Compared with earlier document‑based dialogue corpora, Doc2dial covers a wider range of information‑seeking scenarios. Supported tasks include question answering, and the dataset is monolingual (English). Its structure comprises dialogue, document, and reading‑comprehension domains, each with detailed field descriptions.

hugging_face

View Details

Cornell Movie Dialogs Corpus

Natural Language Processing

Dialogue Systems

The Cornell Movie Dialogs Corpus is a collection of fictional dialogues extracted from movie scripts. Due to its richness and diversity, it is well suited for training and evaluating dialogue agents.

github

View Details

pfb30/multi_woz_v22

Dialogue Systems

Natural Language Processing

The Multi‑Domain Wizard‑of‑Oz (MultiWOZ) dataset is a fully annotated collection of written human‑human dialogues spanning multiple domains and topics. Version 2.1 fixes numerous annotation errors from the original release, while version 2.2 further corrects dialogue state errors, redefines the ontology, and introduces standardized slot‑span annotations. The dataset supports tasks such as dialogue modeling, intent‑state tracking, and dialogue act prediction. It is split into training, validation, and test sets containing 8,437, 1,000, and 1,000 dialogues respectively.

hugging_face

View Details

OpenAssistant/oasst2

Dialogue Systems

Text Generation

The OpenAssistant Conversations Release 2 (OASST2) dataset contains message trees, each rooted by an initial prompt message and potentially multiple child messages as replies, which themselves may have further replies. All messages have a role attribute, either "assistant" or "prompter". The dataset includes multilingual messages and provides detailed JSON examples illustrating the message and conversation‑tree structure. It also supplies primary file information, statistics, and instructions on loading the dataset with HuggingFace Datasets.

hugging_face

View Details