High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

IBM/doc2dial

Doc2dial is a document‑grounded goal‑oriented dialogue dataset containing more than 4,500 annotated dialogues (approximately 14 turns per dialogue) based on over 450 documents from four domains. Compared with earlier document‑based dialogue corpora, Doc2dial covers a wider range of information‑seeking scenarios. Supported tasks include question answering, and the dataset is monolingual (English). Its structure comprises dialogue, document, and reading‑comprehension domains, each with detailed field descriptions.

hugging_face

View Details

CodeFeedback-Python105K

Python Programming

Question Answering

This dataset is a subset extracted from the `m-a-p/CodeFeedback-Filtered-Instruction` dataset, specifically selecting 104,848 samples written in Python. The dataset includes two main features: 'query' and 'response', both of string type. It is divided into a training set containing 104,848 samples. The dataset is suitable for question‑answering tasks, in English, with a sample size between 10,000 and 100,000.

huggingface

View Details