open-llm-leaderboard-old/details_CalderaAI__13B-Legerdemain-L2
This dataset was automatically created during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard. It consists of 64 configurations, each corresponding to an evaluation task. The dataset was generated from two runs, with each run represented as a specific split within each configuration. The "train" split always points to the latest results. An additional "results" configuration stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard. The README also provides an example of how to load run details using the `load_dataset` function from the `datasets` library. The latest run results are provided in JSON format, showing metrics such as EM, F1, and accuracy for various tasks.
Dataset description and usage context
Dataset Overview
Dataset Introduction
The dataset is automatically generated during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard.
Dataset Structure
- The dataset contains 64 configurations, each corresponding to an evaluation task.
- It is created from two runs; each run appears as a specific split within each configuration, with split names using timestamps.
- The "train" split always points to the latest results.
- An extra configuration named "results" stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard.
Data Loading Example
from datasets import load_dataset
data = load_dataset(
"open-llm-leaderboard/details_CalderaAI__13B-Legerdemain-L2",
"harness_winogrande_5",
split="train"
)
Latest Results
The most recent run (2023‑10‑12T20:33:10.328879) yields:
{
"all": {
"em": 0.002726510067114094,
"em_stderr": 0.0005340111700415904,
"f1": 0.06216547818791966,
"f1_stderr": 0.0013785278979549318,
"acc": 0.4412861505062612,
"acc_stderr": 0.010705008172209724
},
"harness|drop|3": {
"em": 0.002726510067114094,
"em_stderr": 0.0005340111700415904,
"f1": 0.06216547818791966,
"f1_stderr": 0.0013785278979549318
},
"harness|gsm8k|5": {
"acc": 0.13040181956027294,
"acc_stderr": 0.0092756303245541
},
"harness|winogrande|5": {
"acc": 0.7521704814522494,
"acc_stderr": 0.01213438601986535
}
}
Configuration Details
Examples of configuration entries:
-
harness_arc_challenge_25
- Split: 2023_08_09T11_34_37.986977
- Path:
**/details_harness|arc:challenge|25_2023-08-09T11:34:37.986977.parquet - Split: latest
- Path: same as above
-
harness_drop_3
- Split: 2023_10_12T20_33_10.328879
- Path:
**/details_harness|drop|3_2023-10-12T20-33-10.328879.parquet - Split: latest
- Path: same as above
(Additional configurations follow the same pattern.)
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.