Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Showing 9 of 9 datasets

Category: Language Models

smoltalk-chinese

Language ModelsChinese Language Processing

smoltalk‑chinese is a Chinese fine‑tuning dataset referenced from the SmolTalk dataset, designed to provide high‑quality synthetic data for training large language models (LLMs). The dataset consists entirely of synthetic data, covering more than 700,000 entries, and is composed of multiple parts including tasks referenced from magpie‑ultra, other SmolTalk tasks, simulated daily‑life dialogues, and mathematics problems from the Chinese version of Math23K. The generation process follows strict standards to ensure data quality and diversity. Experiments show that models fine‑tuned on smoltalk‑chinese achieve significant advantages on multiple metrics.

Source huggingfaceUpdated Jan 2, 2025248 viewsLinked

Inspect dataset

LiveBench

Language ModelsBenchmarking

LiveBench is a large‑language‑model (LLM) benchmark created jointly by Abacus.AI, NYU, Nvidia, UMD, and USC. It contains 18 tasks spanning mathematics, programming, reasoning, language understanding, instruction following, and data analysis. LiveBench's questions are sourced from up‑to‑date materials such as recent math competitions, arXiv papers, news articles, and datasets, and answers are automatically scored against objective facts, eliminating the need for LLM or human judges. The benchmark aims to address data contamination issues in traditional evaluations, ensuring fairness and validity.

Source arXivUpdated Jun 28, 2024493 viewsLinked

Inspect dataset

wangrongsheng/RerankerLLM-Dataset

Information RetrievalLanguage Models

This dataset supports training and testing of re‑ranking large language models (LLMs). It contains queries sampled from the MS MARCO dataset along with rankings predicted by ChatGPT. Files include 10 K‑ and 100 K‑scale query sets and their corresponding ChatGPT predictions.

Source hugging_faceUpdated Apr 5, 202448 viewsLinked

Inspect dataset

ChineseWebText2.0

Natural Language ProcessingLanguage Models

ChineseWebText 2.0 is a large‑scale high‑quality Chinese web‑text dataset containing 3.8 TB of data. Each text is accompanied by a quality score, single‑label and multi‑label domain tags, as well as toxicity classification and scores, enabling LLM researchers to select data based on new quality thresholds. The dataset was constructed and filtered using the MDFG‑tool, ensuring high data quality and multidimensional fine‑grained information.

Source huggingfaceUpdated Nov 27, 2024224 viewsLinked

Inspect dataset

MBZUAI/VideoInstruct-100K

Video UnderstandingLanguage Models

VideoInstruct100K is a high‑quality video‑dialogue dataset created through human‑in‑the‑loop and semi‑automatic annotation techniques. The Q&A content covers video summarization, description‑based question answering (exploring spatial, temporal, relational, and reasoning concepts), and creative/generative question answering.

Source hugging_faceUpdated Sep 29, 2023159 viewsLinked

Inspect dataset

ToxicityPrompts/RealToxicityPrompts

Toxicity EvaluationLanguage Models

The Real Toxicity Prompts dataset contains 100 k sentence fragments extracted from the web, intended to help researchers further address toxicity degeneration in neural network models. Each instance includes a prompt and its metadata, with toxicity scores generated via the Perspective API. The dataset is built from the OPEN‑WEBTEXT CORPUS, composed of English web pages extracted from Reddit URLs, and was stratified‑sampled across toxicity ranges. Language: English. License: Apache 2.0.

Source hugging_faceUpdated May 8, 2024654 viewsLinked

Inspect dataset

shareAI/doc2markmap

Language ModelsMind Maps

This dataset is designed to improve the ability of small‑parameter language models to convert articles into markmaps (markdown‑based mind maps). The source documents were collected from WeChat public accounts and CSDN, then processed through multiple rounds of transformation and cleaning using large language models and complex prompting. The dataset is intended for research and educational purposes only.

Source hugging_faceUpdated Jul 4, 202495 viewsLinked

Inspect dataset

chiayewken/bamboogle

Language ModelsNatural Language Processing

The Bamboogle dataset contains data for studying the compositionality gap in language models. It includes two features—question and answer—and consists of a test split with 125 examples, totalling 10,747 bytes. The dataset is associated with the paper "Measuring and Narrowing the Compositionality Gap in Language Models" and is released under the MIT License.

Source hugging_faceUpdated Oct 27, 2023344 viewsLinked

Inspect dataset

52AI/TinyStoriesZh

Language ModelsChildren's Stories

The TinyStories dataset is used to explore the capability boundaries of small language models (LMs), specifically studying how small LMs can still fluently tell stories. The stories are generated by GPT‑3.5 and GPT‑4, and the difficulty is limited to a level understandable by 3–4‑year‑old children. The Chinese stories are translations of the English stories using a machine translator.

Source hugging_faceUpdated Aug 19, 2023236 viewsLinked

Inspect dataset