Explore high-quality datasets for your AI and machine learning projects.
The dataset contains two primary features: messages, each comprising the sub‑features content and role. It is split into a training set with 4,000 samples and a test set with 400 samples. The data were selected from top‑ranked dialogues in OpenAssistant/oasst2, followed by deduplication and similarity filtering (long answers with similarity > 0.8 were excluded). The dataset includes only English content and was generated using a specific script.
The Education Dialogue dataset comprises dialogues generated by Gemini Ultra, occurring between teachers and students. Teachers are prompted to teach specific topics, while students are prompted with their learning preferences. The dataset includes 40,000 training examples and 7,234 test examples, each consisting of a complete teacher‑student conversation with metadata on the topic and teacher/student preferences.
This is a Chinese version of the Verified‑Camel dataset translated directly with GPT‑4. The dataset covers tasks such as dialogue, question answering, and text generation, in English and Chinese, with labels spanning physics, chemistry, mathematics, biology, culture, and logic. The dataset size is less than 1 K.