JUHE API Marketplace
DATASET
Open Source Community

open-llm-leaderboard-old/details_OpenBuddy__openbuddy-qwen1.5-14b-v21.1-32k

This dataset was automatically created during the evaluation run of the model OpenBuddy/openbuddy‑qwen1.5‑14b‑v21.1‑32k for evaluation on the Open LLM Leaderboard. The dataset comprises 63 configurations, each corresponding to an evaluation task. The dataset is generated from a single run; each run can be found in each configuration, with splits named after the run timestamp. The 'train' split always points to the latest results. Additionally, a 'results' configuration stores aggregated results of all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard.

Updated 4/9/2024
hugging_face

Description

Dataset Overview

Dataset Name

Evaluation run of OpenBuddy/openbuddy-qwen1.5-14b-v21.1-32k

Dataset Summary

This dataset was automatically created during the evaluation of the model OpenBuddy/openbuddy-qwen1.5-14b-v21.1-32k on the Open LLM Leaderboard.

Dataset Composition

  • The dataset contains 63 configurations, each corresponding to an evaluation task.
  • It is generated from a single run; each run can be accessed as a specific split within each configuration, with the split name using the run timestamp.
  • The "train" split always points to the latest results.
  • An additional configuration "results" stores aggregated results of all runs, used for computing and displaying aggregated metrics on the Open LLM Leaderboard.

Data Loading Example

from datasets import load_dataset
data = load_dataset(
    "open-llm-leaderboard/details_OpenBuddy__openbuddy-qwen1.5-14b-v21.1-32k",
    "harness_winogrande_5",
    split="train"
)

Latest Results

The latest results are from the 2024-04-09T06:57:17.996714 run:

{
    "all": {
        "acc": 0.6783573743130548,
        "acc_stderr": 0.031630411639720406,
        "acc_norm": 0.6843006798291303,
        "acc_norm_stderr": 0.032244439733683676,
        "mc1": 0.39657282741738065,
        "mc1_stderr": 0.017124930942023518,
        "mc2": 0.5584410548633238,
        "mc2_stderr": 0.014920454151130717
    },
    "harness|arc:challenge|25": {
        "acc": 0.5358361774744027,
        "acc_stderr": 0.01457381366473572,
        "acc_norm": 0.5793515358361775,
        "acc_norm_stderr": 0.014426211252508403
    },
    "harness|hellaswag|10": {
        "acc": 0.5926110336586338,
        "acc_stderr": 0.004903441680003823,
        "acc_norm": 0.788388767177853,
        "acc_norm_stderr": 0.004076158744346766
    },
    "harness|hendrycksTest-abstract_algebra|5": {
        "acc": 0.38,
        "acc_stderr": 0.048783173121456316,
        "acc_norm": 0.38,
        "acc_norm_stderr": 0.048783173121456316
    },
    "harness|hendrycksTest-anatomy|5": {
        "acc": 0.6222222222222222,
        "acc_stderr": 0.04188307537595852,
        "acc_norm": 0.6222222222222222,
        "acc_norm_stderr": 0.04188307537595852
    },
    "harness|hendrycksTest-astronomy|5": {
        "acc": 0.7763157894736842,
        "acc_stderr": 0.033911609343436025,
        "acc_norm": 0.7763157894736842,
        "acc_norm_stderr": 0.033911609343436025
    },
    "harness|hendrycksTest-business_ethics|5": {
        "acc": 0.75,
        "acc_stderr": 0.04351941398892446,
        "acc_norm": 0.75,
        "acc_norm_stderr": 0.04351941398892446
    },
    "harness|hendrycksTest-clinical_knowledge|5": {
        "acc": 0.7245283018867924,
        "acc_stderr": 0.027495663683724057,
        "acc_norm": 0.7245283018867924,
        "acc_norm_stderr": 0.027495663683724057
    },
    "harness|hendrycksTest-college_biology|5": {
        "acc": 0.7222222222222222,
        "acc_stderr": 0.03745554791462457,
        "acc_norm": 0.7222222222222222,
        "acc_norm_stderr": 0.03745554791462457
    },
    "harness|hendrycksTest-college_chemistry|5": {
        "acc": 0.55,
        "acc_stderr": 0.05,
        "acc_norm": 0.55,
        "acc_norm_stderr": 0.05
    },
    "harness|hendrycksTest-college_computer_science|5": {
        "acc": 0.6,
        "acc_stderr": 0.04923659639173309,
        "acc_norm": 0.6,
        "acc_norm_stderr": 0.04923659639173309
    },
    "harness|hendrycksTest-college_mathematics|5": {
        "acc": 0.48,
        "acc_stderr": 0.05021167315686779,
        "acc_norm": 0.48,
        "acc_norm_stderr": 0.05021167315686779
    },
    "harness|hendrycksTest-college_medicine|5": {
        "acc": 0.6994219653179191,
        "acc_stderr": 0.0349610148119118,
        "acc_norm": 0.6994219653179191,
        "acc_norm_stderr": 0.0349610148119118
    },
    "harness|hendrycksTest-college_physics|5": {
        "acc": 0.4215686274509804,
        "acc_stderr": 0.049135952012744975,
        "acc_norm": 0.4215686274509804,
        "acc_norm_stderr": 0.049135952012744975
    },
    "harness|hendrycksTest-computer_security|5": {
        "acc": 0.81,
        "acc_stderr": 0.039427724440366234,
        "acc_norm": 0.81,
        "acc_norm_stderr": 0.039427724440366234
    },
    "harness|hendrycksTest-conceptual_physics|5": {
        "acc": 0.6723404255319149,
        "acc_stderr": 0.030683020843231004,
        "acc_norm": 0.6723404255319149,
        "acc_norm_stderr": 0.030683020843231004
    },
    "harness|hendrycksTest-econometrics|5": {
        "acc": 0.5614035087719298,
        "acc_stderr": 0.04668000738510455,
        "acc_norm": 0.5614035087719298,
        "acc_norm_stderr": 0.04668000738510455
    },
    "harness|hendrycksTest-electrical_engineering|5": {
        "acc": 0.7103448275862069,
        "acc_stderr": 0.03780019230438014,
        "acc_norm": 0.7103448275862069,
        "acc_norm_stderr": 0.03780019230438014
    },
    "harness|hendrycksTest-elementary_mathematics|5": {
        "acc": 0.5555555555555556,
        "acc_stderr": 0.02559185776138218,
        "acc_norm": 0.5555555555555556,
        "acc_norm_stderr": 0.02559185776138218
    },
    "harness|hendrycksTest-formal_logic|5": {
        "acc": 0.5317460317460317,
        "acc_stderr": 0.04463112720677172,
        "acc_norm": 0.5317460317460317,
        "acc_norm_stderr": 0.04463112720677172
    },
    "harness|hendrycksTest-global_facts|5": {
        "acc": 0.44,
        "acc_stderr": 0.04988876515698589,
        "acc_norm": 0.44,
        "acc_norm_stderr": 0.04988876515698589
    },
    "harness|hendrycksTest-high_school_biology|5": {
        "acc": 0.8161290322580645,
        "acc_stderr": 0.02203721734026782,
        "acc_norm": 0.8161290322580645,
        "acc_norm_stderr": 0.02203721734026782
    },
    "harness|hendrycksTest-high_school_chemistry|5": {
        "acc": 0.5960591133004927,
        "acc_stderr": 0.03452453903822032,
        "acc_norm": 0.5960591133004927,
        "acc_norm_stderr": 0.03452453903822032
    },
    "harness|hendrycksTest-high_school_computer_science|5": {
        "acc": 0.75,
        "acc_stderr": 0.04351941398892446,
        "acc_norm": 0.75,
        "acc_norm_stderr": 0.04351941398892446
    },
    "harness|hendrycksTest-high_school_european_history|5": {
        "acc": 0.8363636363636363,
        "acc_stderr": 0.02888787239548795,
        "acc_norm": 0.8363636363636363,
        "acc_norm_stderr": 0.02888787239548795
    },
    "harness|hendrycksTest-high_school_geography|5": {
        "acc": 0.8737373737373737,
        "acc_stderr": 0.023664359402880215,
        "acc_norm": 0.8737373737373737,
        "acc_norm_stderr": 0.023664359402880215
    },
    "harness|hendrycksTest-high_school_government_and_politics|5": {
        "acc": 0.8911917098445595,
        "acc_stderr": 0.02247325333276875,
        "acc_norm": 0.8911917098445595,
        "acc_norm_stderr": 0.02247325333276875
    },
    "harness|hendrycksTest-high_school_macroeconomics|5": {
        "acc": 0.6974358974358974,
        "acc_stderr": 0.023290888053772732,
        "acc_norm": 0.6974358974358974,
        "acc_norm_stderr": 0.023290888053772732
    },
    "harness|hendrycksTest-high_school_mathematics|5": {
        "acc": 0.4074074074074074"}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Model Evaluation
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.