JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

smoltalk-chinese

Language Models
Chinese Language Processing

smoltalk‑chinese is a Chinese fine‑tuning dataset referenced from the SmolTalk dataset, designed to provide high‑quality synthetic data for training large language models (LLMs). The dataset consists entirely of synthetic data, covering more than 700,000 entries, and is composed of multiple parts including tasks referenced from magpie‑ultra, other SmolTalk tasks, simulated daily‑life dialogues, and mathematics problems from the Chinese version of Math23K. The generation process follows strict standards to ensure data quality and diversity. Experiments show that models fine‑tuned on smoltalk‑chinese achieve significant advantages on multiple metrics.

huggingface
View Details

LiveBench

Language Models
Benchmarking

LiveBench is a large‑language‑model (LLM) benchmark created jointly by Abacus.AI, NYU, Nvidia, UMD, and USC. It contains 18 tasks spanning mathematics, programming, reasoning, language understanding, instruction following, and data analysis. LiveBench's questions are sourced from up‑to‑date materials such as recent math competitions, arXiv papers, news articles, and datasets, and answers are automatically scored against objective facts, eliminating the need for LLM or human judges. The benchmark aims to address data contamination issues in traditional evaluations, ensuring fairness and validity.

arXiv
View Details

wangrongsheng/RerankerLLM-Dataset

Information Retrieval
Language Models

This dataset supports training and testing of re‑ranking large language models (LLMs). It contains queries sampled from the MS MARCO dataset along with rankings predicted by ChatGPT. Files include 10 K‑ and 100 K‑scale query sets and their corresponding ChatGPT predictions.

hugging_face
View Details

ChineseWebText2.0

Natural Language Processing
Language Models

ChineseWebText 2.0 is a large‑scale high‑quality Chinese web‑text dataset containing 3.8 TB of data. Each text is accompanied by a quality score, single‑label and multi‑label domain tags, as well as toxicity classification and scores, enabling LLM researchers to select data based on new quality thresholds. The dataset was constructed and filtered using the MDFG‑tool, ensuring high data quality and multidimensional fine‑grained information.

huggingface
View Details

MBZUAI/VideoInstruct-100K

Video Understanding
Language Models

VideoInstruct100K is a high‑quality video‑dialogue dataset created through human‑in‑the‑loop and semi‑automatic annotation techniques. The Q&A content covers video summarization, description‑based question answering (exploring spatial, temporal, relational, and reasoning concepts), and creative/generative question answering.

hugging_face
View Details

ToxicityPrompts/RealToxicityPrompts

Toxicity Evaluation
Language Models

The Real Toxicity Prompts dataset contains 100 k sentence fragments extracted from the web, intended to help researchers further address toxicity degeneration in neural network models. Each instance includes a prompt and its metadata, with toxicity scores generated via the Perspective API. The dataset is built from the OPEN‑WEBTEXT CORPUS, composed of English web pages extracted from Reddit URLs, and was stratified‑sampled across toxicity ranges. Language: English. License: Apache 2.0.

hugging_face
View Details

shareAI/doc2markmap

Language Models
Mind Maps

This dataset is designed to improve the ability of small‑parameter language models to convert articles into markmaps (markdown‑based mind maps). The source documents were collected from WeChat public accounts and CSDN, then processed through multiple rounds of transformation and cleaning using large language models and complex prompting. The dataset is intended for research and educational purposes only.

hugging_face
View Details

chiayewken/bamboogle

Language Models
Natural Language Processing

The Bamboogle dataset contains data for studying the compositionality gap in language models. It includes two features—question and answer—and consists of a test split with 125 examples, totalling 10,747 bytes. The dataset is associated with the paper "Measuring and Narrowing the Compositionality Gap in Language Models" and is released under the MIT License.

hugging_face
View Details

52AI/TinyStoriesZh

Language Models
Children's Stories

The TinyStories dataset is used to explore the capability boundaries of small language models (LMs), specifically studying how small LMs can still fluently tell stories. The stories are generated by GPT‑3.5 and GPT‑4, and the difficulty is limited to a level understandable by 3–4‑year‑old children. The Chinese stories are translations of the English stories using a machine translator.

hugging_face
View Details