Explore high-quality datasets for your AI and machine learning projects.
This dataset was automatically generated during the evaluation of model CultriX/MonaTrix‑v4‑7B‑DPO. It comprises 63 configurations, each mapping to a specific evaluation task. Each run creates a split named after its timestamp; the `train` split always points to the latest results. An additional `results` configuration aggregates outcomes from all runs for metric computation on the Open LLM Leaderboard.
This is a multi‑label emotion classification dataset based on the Go Emotion parameters. The dataset was annotated by a team of 12 engineers with custom tags. Additionally, evaluation results of three models (RoBERTa, BERT‑cased, and BERT‑uncased) on this dataset are presented.
This dataset was automatically created during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard. It consists of 64 configurations, each corresponding to an evaluation task. The dataset was generated from two runs, with each run represented as a specific split within each configuration. The "train" split always points to the latest results. An additional "results" configuration stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard. The README also provides an example of how to load run details using the `load_dataset` function from the `datasets` library. The latest run results are provided in JSON format, showing metrics such as EM, F1, and accuracy for various tasks.
UniSim‑Bench is a multimodal perception similarity benchmark created by New York University and EPFL, containing seven multimodal perception similarity tasks across 25 datasets. It covers various image‑to‑text tasks and is designed to evaluate model generalisation across tasks. The benchmark aggregates existing perception tasks and trains models using multi‑task learning. UniSim‑Bench is widely used to assess and improve multimodal perception models, especially for cross‑modal similarity evaluation and generative model quality assessment.
This dataset was automatically created during the evaluation run of model yleo/EmertonOmniBeagle-7B-dpo on the Open LLM Leaderboard. It comprises 63 configurations, each corresponding to an evaluated task, containing results from a single run. The "train" split always points to the latest results. An additional configuration named "results" stores aggregated results from all runs, used to compute and display aggregated metrics on the Open LLM Leaderboard. The README also provides a Python example for loading the dataset using the 🤗 datasets library and includes the latest results for a specific run.
DAHL is a long‑form biomedical text generation hallucination evaluation benchmark curated by Seoul National University. It comprises 8,573 questions across 29 categories sourced from PubMed Central biomedical research papers. Questions were automatically generated and manually filtered to ensure high quality and answerability. DAHL evaluates large language models' hallucination in the biomedical domain by decomposing model responses into atomic units for factual accuracy assessment, offering a deeper evaluation than traditional multiple‑choice tasks. Its primary applications lie in biomedical and clinical research to address factual conflicts in generated texts.
This dataset was automatically generated during the evaluation runs of the model Danielbrdz/Barcenas‑Tiny‑1.1b‑DPO. It comprises 63 configurations, each representing a distinct evaluation task. For each run, a split named after the run’s timestamp is created; the "train" split always points to the latest results. An additional "results" configuration stores aggregated metrics for all runs, which are used to compute and display aggregate scores on the Open LLM Leaderboard.
A-Eval is a benchmark for evaluating chat large language models (LLMs) of various scales from an application-driven perspective. The dataset contains 678 question-answer pairs spanning 5 categories, 27 sub-categories, and 3 difficulty levels. A-Eval provides clear empirical and engineering guidelines for selecting the "best" model for real-world applications.
VBench++ is a comprehensive video generation model evaluation benchmark jointly created by Nanyang Technological University and the Shanghai Artificial Intelligence Laboratory. The benchmark comprises 16 dimensions, each with about 100 text prompts, to assess the performance of video generation models. It covers aspects such as video quality and conditional consistency, aiming to reveal model strengths and weaknesses through fine‑grained evaluation. The research team designed multi‑level evaluation dimensions and validated the results with human‑preference annotations to ensure alignment with human perception. VBench++ addresses key challenges in video‑generation evaluation, including technical quality assessment and model trustworthiness assessment.
This dataset was automatically created during the evaluation run of the model OpenBuddy/openbuddy‑qwen1.5‑14b‑v21.1‑32k for evaluation on the Open LLM Leaderboard. The dataset comprises 63 configurations, each corresponding to an evaluation task. The dataset is generated from a single run; each run can be found in each configuration, with splits named after the run timestamp. The 'train' split always points to the latest results. Additionally, a 'results' configuration stores aggregated results of all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard.
This dataset is a benchmark for evaluating language‑model performance across a range of tasks. It is also used to assess models that have been fine‑tuned on multiple tasks.
The dataset includes multiple features such as question, question concept, options, answer, input, output, etc. Specific fields are id, question, question_concept, choices (with label and text), answerKey, input, extracted_baseline_llama_1b, reasoning_64_a128_mix_mmlu_csqa_gsm8k_even, baseline_llama_1b, output_w_reasoning_llama_1b, extracted_output_w_reasoning_llama_1b, and eval_baseline_vs_mixed_reasoning. It consists of a validation split with 1 221 samples totaling 7 882 991 bytes.
--- language: - en license: apache-2.0 --- # MiniMuSiQue by Morph Labs  **https://morph.so/blog/self-teaching/** We describe two evaluation datasets that we have derived from the MuSiQue multi-hop question-answering dataset, called MiniMuSiQue-hard (filtered for questions answerable by GPT-4 but not GPT-3.5, where performance significantly degrades if the first pivot document is removed) and MiniMuSiQue-easy (a larger dataset of convoluted off-distribution single-hop question-answer pairs). ## Table of Contents 1. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#dataset-description" target="_blank">Dataset Description</a>** 2. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#uses" target="_blank">Uses</a>** 3. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#contact" target="_blank">Contact</a>** 4. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#blogpost-and-citation" target="_blank">Blogpost and Citation</a>** ### Dataset Description We refined the MuSiQue dataset to focus on questions that demand complex multi-hop reasoning, by selecting questions which (1) GPT-4 could answer but GPT-3.5 could not, and which (2) were not answerable without the context relevant to the first reasoning step (the "first hop pivot document") for each question. Specifically, we selected 768 random examples from the MuSiQue training set, ranked them based on a combined score of difficulty (measured by the difference in ROUGE-L recall between GPT-4 and GPT-3.5) and the necessity for multi-hop reasoning (assessed by the change in ROUGE-L recall when the first hop pivot document was removed). We refer to the top-ranked 128 examples as MiniMuSiQue, and obtain MiniMuSiQue-hard by associating the original difficult MuSiQue multi-hop question-answer pair to each example. To additionally test off-distribution single-hop factual recall, for each example we synthesized convoluted off-distribution single-hop question-answer pairs for up to five entities per document in MiniMuSiQue, resulting in the much larger single-hop dataset MiniMuSiQue-easy. Each MiniMuSiQue example consists of twenty documents sampled from different Wikipedia articles, to which we associate a hard MuSiQue multi-hop reasoning question for MiniMuSiQue, and many single-hop questions for MiniMuSiQue-easy. - **Developed by:** **<a href="https://www.morph.so" target="_blank">Morph Labs</a>** - **Refined from:** **<a href="https://arxiv.org/abs/2108.00573" target="_blank">MuSiQue</a>** - **Language(s):** English - **License:** **<a href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank">Apache 2.0</a>** ## Uses A particularly challenging form of question for models historically has been multi-hop questions, which require a series of interconnected reasoning steps over multiple documents. However, creating multi-hop questions that truly necessitate knowledge-based reasoning is challenging. For instance, early benchmarks like HotpotQA were found to be largely solvable through shortcuts. The construction of questions and corresponding contexts that avoid such shortcuts, and verifying their effectiveness, requires a comprehensive dataset development process. The MuSiQue dataset addresses many weaknesses of prior work and contains difficult multi-hop questions less susceptible to shortcuts. We derive MiniMuSiQue from the original MuSiQue to better assess model capabilities to answer multi-hop questions that truly necessitate knowledge-based reasoning. ## Contact hello@morph.so ## Blogpost and Citation **https://morph.so/blog/self-teaching/** @misc{MiniMuSiQue, title={MiniMuSiQue}, author={Morph Labs, Jesse Michael Han, Eric Yu, Bentley Long, Pranav Mital, Brando Miranda}, year={2023}}
This dataset was automatically generated during the evaluation of the model TheBloke/VicUnlocked-30B-LoRA-HF, containing three configurations, each corresponding to an evaluation task. The dataset was created from two runs; the results of each run are stored as specific splits within the configurations, with split names using the run timestamps. The "train" split always points to the latest results. Additionally, a "results" configuration stores the aggregated results of all runs for computing and displaying aggregate metrics on the Open LLM Leaderboard.