Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Showing 5 of 5 datasets

Category: Model Training

FIT-RS

FIT‑RS is a large‑scale fine‑grained instruction‑tuning dataset containing 1,800,851 high‑quality instruction samples, designed to enhance the fine‑grained understanding capabilities of Remote Sensing Large Multimodal Models (RSLMMs).

Source githubUpdated Jun 7, 2024280 viewsLinked

Inspect dataset

WebVi3D

Multi‑View ImagesModel Training

WebVi3D is a multi‑view image dataset containing 320 M frames extracted from 16 M video clips, used for training See3D models. The dataset expands training data by automatically filtering video clips with inconsistent multi‑view information or insufficient observations, yielding a high‑quality, diverse multi‑view image collection.

Source githubUpdated Dec 10, 2024259 viewsLinked

Inspect dataset

json-training

Model TrainingJSON Data Processing

This dataset is intended to support fine‑tuning of small yet powerful models (e.g., Qwen2 0.5B and SmolLM 135M/360M) that struggle with JSON‑structured data generation tasks. It contains three fields—`query`, `schema`, and `response`—representing the user's plain‑text query, the desired output JSON schema, and an LLM response that conforms to the schema. The data were synthesized by large language models such as Llama 3.1 8B and Claude 3.5 Sonnet and will be updated regularly.

Source huggingfaceUpdated Aug 22, 2024218 viewsLinked

Inspect dataset

MLCE(Medical-LLMs-Chinese-Exam)

Medical DomainModel Training

The MLCE dataset gathers Chinese medical examination and competition datasets to support evaluation of large language models on specialized medical abilities and to enable targeted training, aiming to promote development of comprehensive medical LLMs.

Source githubUpdated Jul 7, 2024176 viewsLinked

Inspect dataset

agicorp/MathInstruct

MathematicsModel Training

MathInstruct is a carefully curated instruction‑tuning dataset that is lightweight yet versatile. It aggregates 13 math reasoning datasets, six of which are newly curated in this work. The dataset uniquely focuses on a mix of chain‑of‑thought (CoT) and program‑of‑thought (PoT) reasoning, ensuring broad coverage across mathematical domains. It is used for text generation tasks, primarily in English, with sizes ranging from 100 k to 1 M examples. It is associated with models based on Llama‑2 and Code Llama, ranging from 7 B to 70 B parameters. License information for each subset is provided.

Source hugging_faceUpdated Mar 23, 2024137 viewsLinked

Inspect dataset