Explore high-quality datasets for your AI and machine learning projects.
FIT‑RS is a large‑scale fine‑grained instruction‑tuning dataset containing 1,800,851 high‑quality instruction samples, designed to enhance the fine‑grained understanding capabilities of Remote Sensing Large Multimodal Models (RSLMMs).
WebVi3D is a multi‑view image dataset containing 320 M frames extracted from 16 M video clips, used for training See3D models. The dataset expands training data by automatically filtering video clips with inconsistent multi‑view information or insufficient observations, yielding a high‑quality, diverse multi‑view image collection.
This dataset is intended to support fine‑tuning of small yet powerful models (e.g., Qwen2 0.5B and SmolLM 135M/360M) that struggle with JSON‑structured data generation tasks. It contains three fields—`query`, `schema`, and `response`—representing the user's plain‑text query, the desired output JSON schema, and an LLM response that conforms to the schema. The data were synthesized by large language models such as Llama 3.1 8B and Claude 3.5 Sonnet and will be updated regularly.
The MLCE dataset gathers Chinese medical examination and competition datasets to support evaluation of large language models on specialized medical abilities and to enable targeted training, aiming to promote development of comprehensive medical LLMs.
MathInstruct is a carefully curated instruction‑tuning dataset that is lightweight yet versatile. It aggregates 13 math reasoning datasets, six of which are newly curated in this work. The dataset uniquely focuses on a mix of chain‑of‑thought (CoT) and program‑of‑thought (PoT) reasoning, ensuring broad coverage across mathematical domains. It is used for text generation tasks, primarily in English, with sizes ranging from 100 k to 1 M examples. It is associated with models based on Llama‑2 and Code Llama, ranging from 7 B to 70 B parameters. License information for each subset is provided.