High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

g-ronimo/oasst2_top4k_en

The dataset contains two primary features: messages, each comprising the sub‑features content and role. It is split into a training set with 4,000 samples and a test set with 400 samples. The data were selected from top‑ranked dialogues in OpenAssistant/oasst2, followed by deduplication and similarity filtering (long answers with similarity > 0.8 were excluded). The dataset includes only English content and was generated using a specific script.

hugging_face

View Details

Education Dialogue Dataset

Education Dialogue

Dialogue Generation

The Education Dialogue dataset comprises dialogues generated by Gemini Ultra, occurring between teachers and students. Teachers are prompted to teach specific topics, while students are prompted with their learning preferences. The dataset includes 40,000 training examples and 7,234 test examples, each consisting of a complete teacher‑student conversation with metadata on the topic and teacher/student preferences.

github

View Details

noobmaster29/Verified-Camel-zh

Dialogue Generation

Multidisciplinary QA

This is a Chinese version of the Verified‑Camel dataset translated directly with GPT‑4. The dataset covers tasks such as dialogue, question answering, and text generation, in English and Chinese, with labels spanning physics, chemistry, mathematics, biology, culture, and logic. The dataset size is less than 1 K.

hugging_face

View Details