Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingModel Evaluation

open-llm-leaderboard-old/details_CalderaAI__13B-Legerdemain-L2

This dataset was automatically created during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard. It consists of 64 configurations, each corresponding to an evaluation task. The dataset was generated from two runs, with each run represented as a specific split within each configuration. The "train" split always points to the latest results. An additional "results" configuration stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard. The README also provides an example of how to load run details using the `load_dataset` function from the `datasets` library. The latest run results are provided in JSON format, showing metrics such as EM, F1, and accuracy for various tasks.

Source
hugging_face
Created
Nov 28, 2025
Updated
Oct 12, 2023
Signals
131 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Introduction

The dataset is automatically generated during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard.

Dataset Structure

  • The dataset contains 64 configurations, each corresponding to an evaluation task.
  • It is created from two runs; each run appears as a specific split within each configuration, with split names using timestamps.
  • The "train" split always points to the latest results.
  • An extra configuration named "results" stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard.

Data Loading Example

from datasets import load_dataset

data = load_dataset(
    "open-llm-leaderboard/details_CalderaAI__13B-Legerdemain-L2",
    "harness_winogrande_5",
    split="train"
)

Latest Results

The most recent run (2023‑10‑12T20:33:10.328879) yields:

{
    "all": {
        "em": 0.002726510067114094,
        "em_stderr": 0.0005340111700415904,
        "f1": 0.06216547818791966,
        "f1_stderr": 0.0013785278979549318,
        "acc": 0.4412861505062612,
        "acc_stderr": 0.010705008172209724
    },
    "harness|drop|3": {
        "em": 0.002726510067114094,
        "em_stderr": 0.0005340111700415904,
        "f1": 0.06216547818791966,
        "f1_stderr": 0.0013785278979549318
    },
    "harness|gsm8k|5": {
        "acc": 0.13040181956027294,
        "acc_stderr": 0.0092756303245541
    },
    "harness|winogrande|5": {
        "acc": 0.7521704814522494,
        "acc_stderr": 0.01213438601986535
    }
}

Configuration Details

Examples of configuration entries:

  • harness_arc_challenge_25

    • Split: 2023_08_09T11_34_37.986977
    • Path: **/details_harness|arc:challenge|25_2023-08-09T11:34:37.986977.parquet
    • Split: latest
    • Path: same as above
  • harness_drop_3

    • Split: 2023_10_12T20_33_10.328879
    • Path: **/details_harness|drop|3_2023-10-12T20-33-10.328879.parquet
    • Split: latest
    • Path: same as above

(Additional configurations follow the same pattern.)

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio