Dataset assetOpen Source CommunityNatural Language ProcessingModel Evaluation

open-llm-leaderboard-old/details_CalderaAI__13B-Legerdemain-L2

This dataset was automatically created during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard. It consists of 64 configurations, each corresponding to an evaluation task. The dataset was generated from two runs, with each run represented as a specific split within each configuration. The "train" split always points to the latest results. An additional "results" configuration stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard. The README also provides an example of how to load run details using the `load_dataset` function from the `datasets` library. The latest run results are provided in JSON format, showing metrics such as EM, F1, and accuracy for various tasks.

Source

hugging_face

Created

Nov 28, 2025

Updated

Oct 12, 2023

Signals

131 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Introduction

The dataset is automatically generated during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard.

Dataset Structure

The dataset contains 64 configurations, each corresponding to an evaluation task.
It is created from two runs; each run appears as a specific split within each configuration, with split names using timestamps.
The "train" split always points to the latest results.
An extra configuration named "results" stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard.

Data Loading Example

from datasets import load_dataset

data = load_dataset(
    "open-llm-leaderboard/details_CalderaAI__13B-Legerdemain-L2",
    "harness_winogrande_5",
    split="train"
)

Latest Results

The most recent run (2023‑10‑12T20:33:10.328879) yields:

{
    "all": {
        "em": 0.002726510067114094,
        "em_stderr": 0.0005340111700415904,
        "f1": 0.06216547818791966,
        "f1_stderr": 0.0013785278979549318,
        "acc": 0.4412861505062612,
        "acc_stderr": 0.010705008172209724
    },
    "harness|drop|3": {
        "em": 0.002726510067114094,
        "em_stderr": 0.0005340111700415904,
        "f1": 0.06216547818791966,
        "f1_stderr": 0.0013785278979549318
    },
    "harness|gsm8k|5": {
        "acc": 0.13040181956027294,
        "acc_stderr": 0.0092756303245541
    },
    "harness|winogrande|5": {
        "acc": 0.7521704814522494,
        "acc_stderr": 0.01213438601986535
    }
}

Configuration Details

Examples of configuration entries:

harness_arc_challenge_25
- Split: 2023_08_09T11_34_37.986977
- Path: **/details_harness|arc:challenge|25_2023-08-09T11:34:37.986977.parquet
- Split: latest
- Path: same as above
harness_drop_3
- Split: 2023_10_12T20_33_10.328879
- Path: **/details_harness|drop|3_2023-10-12T20-33-10.328879.parquet
- Split: latest
- Path: same as above

(Additional configurations follow the same pattern.)

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio