JUHE API Marketplace
DATASET
Open Source Community

open-llm-leaderboard-old/details_CalderaAI__13B-Legerdemain-L2

This dataset was automatically created during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard. It consists of 64 configurations, each corresponding to an evaluation task. The dataset was generated from two runs, with each run represented as a specific split within each configuration. The "train" split always points to the latest results. An additional "results" configuration stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard. The README also provides an example of how to load run details using the `load_dataset` function from the `datasets` library. The latest run results are provided in JSON format, showing metrics such as EM, F1, and accuracy for various tasks.

Updated 10/12/2023
hugging_face

Description

Dataset Overview

Dataset Introduction

The dataset is automatically generated during the evaluation of the model CalderaAI/13B‑Legerdemain‑L2 on the Open LLM Leaderboard.

Dataset Structure

  • The dataset contains 64 configurations, each corresponding to an evaluation task.
  • It is created from two runs; each run appears as a specific split within each configuration, with split names using timestamps.
  • The "train" split always points to the latest results.
  • An extra configuration named "results" stores aggregated results from all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard.

Data Loading Example

from datasets import load_dataset

data = load_dataset(
    "open-llm-leaderboard/details_CalderaAI__13B-Legerdemain-L2",
    "harness_winogrande_5",
    split="train"
)

Latest Results

The most recent run (2023‑10‑12T20:33:10.328879) yields:

{
    "all": {
        "em": 0.002726510067114094,
        "em_stderr": 0.0005340111700415904,
        "f1": 0.06216547818791966,
        "f1_stderr": 0.0013785278979549318,
        "acc": 0.4412861505062612,
        "acc_stderr": 0.010705008172209724
    },
    "harness|drop|3": {
        "em": 0.002726510067114094,
        "em_stderr": 0.0005340111700415904,
        "f1": 0.06216547818791966,
        "f1_stderr": 0.0013785278979549318
    },
    "harness|gsm8k|5": {
        "acc": 0.13040181956027294,
        "acc_stderr": 0.0092756303245541
    },
    "harness|winogrande|5": {
        "acc": 0.7521704814522494,
        "acc_stderr": 0.01213438601986535
    }
}

Configuration Details

Examples of configuration entries:

  • harness_arc_challenge_25

    • Split: 2023_08_09T11_34_37.986977
    • Path: **/details_harness|arc:challenge|25_2023-08-09T11:34:37.986977.parquet
    • Split: latest
    • Path: same as above
  • harness_drop_3

    • Split: 2023_10_12T20_33_10.328879
    • Path: **/details_harness|drop|3_2023-10-12T20-33-10.328879.parquet
    • Split: latest
    • Path: same as above

(Additional configurations follow the same pattern.)

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Model Evaluation
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.