Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingModel Evaluation

open-llm-leaderboard-old/details_OpenBuddy__openbuddy-qwen1.5-14b-v21.1-32k

This dataset was automatically created during the evaluation run of the model OpenBuddy/openbuddy‑qwen1.5‑14b‑v21.1‑32k for evaluation on the Open LLM Leaderboard. The dataset comprises 63 configurations, each corresponding to an evaluation task. The dataset is generated from a single run; each run can be found in each configuration, with splits named after the run timestamp. The 'train' split always points to the latest results. Additionally, a 'results' configuration stores aggregated results of all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard.

Source
hugging_face
Created
Nov 28, 2025
Updated
Apr 9, 2024
Signals
69 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

Evaluation run of OpenBuddy/openbuddy-qwen1.5-14b-v21.1-32k

Dataset Summary

This dataset was automatically created during the evaluation of the model OpenBuddy/openbuddy-qwen1.5-14b-v21.1-32k on the Open LLM Leaderboard.

Dataset Composition

  • The dataset contains 63 configurations, each corresponding to an evaluation task.
  • It is generated from a single run; each run can be accessed as a specific split within each configuration, with the split name using the run timestamp.
  • The "train" split always points to the latest results.
  • An additional configuration "results" stores aggregated results of all runs, used for computing and displaying aggregated metrics on the Open LLM Leaderboard.

Data Loading Example

from datasets import load_dataset
data = load_dataset(
    "open-llm-leaderboard/details_OpenBuddy__openbuddy-qwen1.5-14b-v21.1-32k",
    "harness_winogrande_5",
    split="train"
)

Latest Results

The latest results are from the 2024-04-09T06:57:17.996714 run:

{
    "all": {
        "acc": 0.6783573743130548,
        "acc_stderr": 0.031630411639720406,
        "acc_norm": 0.6843006798291303,
        "acc_norm_stderr": 0.032244439733683676,
        "mc1": 0.39657282741738065,
        "mc1_stderr": 0.017124930942023518,
        "mc2": 0.5584410548633238,
        "mc2_stderr": 0.014920454151130717
    },
    "harness|arc:challenge|25": {
        "acc": 0.5358361774744027,
        "acc_stderr": 0.01457381366473572,
        "acc_norm": 0.5793515358361775,
        "acc_norm_stderr": 0.014426211252508403
    },
    "harness|hellaswag|10": {
        "acc": 0.5926110336586338,
        "acc_stderr": 0.004903441680003823,
        "acc_norm": 0.788388767177853,
        "acc_norm_stderr": 0.004076158744346766
    },
    "harness|hendrycksTest-abstract_algebra|5": {
        "acc": 0.38,
        "acc_stderr": 0.048783173121456316,
        "acc_norm": 0.38,
        "acc_norm_stderr": 0.048783173121456316
    },
    "harness|hendrycksTest-anatomy|5": {
        "acc": 0.6222222222222222,
        "acc_stderr": 0.04188307537595852,
        "acc_norm": 0.6222222222222222,
        "acc_norm_stderr": 0.04188307537595852
    },
    "harness|hendrycksTest-astronomy|5": {
        "acc": 0.7763157894736842,
        "acc_stderr": 0.033911609343436025,
        "acc_norm": 0.7763157894736842,
        "acc_norm_stderr": 0.033911609343436025
    },
    "harness|hendrycksTest-business_ethics|5": {
        "acc": 0.75,
        "acc_stderr": 0.04351941398892446,
        "acc_norm": 0.75,
        "acc_norm_stderr": 0.04351941398892446
    },
    "harness|hendrycksTest-clinical_knowledge|5": {
        "acc": 0.7245283018867924,
        "acc_stderr": 0.027495663683724057,
        "acc_norm": 0.7245283018867924,
        "acc_norm_stderr": 0.027495663683724057
    },
    "harness|hendrycksTest-college_biology|5": {
        "acc": 0.7222222222222222,
        "acc_stderr": 0.03745554791462457,
        "acc_norm": 0.7222222222222222,
        "acc_norm_stderr": 0.03745554791462457
    },
    "harness|hendrycksTest-college_chemistry|5": {
        "acc": 0.55,
        "acc_stderr": 0.05,
        "acc_norm": 0.55,
        "acc_norm_stderr": 0.05
    },
    "harness|hendrycksTest-college_computer_science|5": {
        "acc": 0.6,
        "acc_stderr": 0.04923659639173309,
        "acc_norm": 0.6,
        "acc_norm_stderr": 0.04923659639173309
    },
    "harness|hendrycksTest-college_mathematics|5": {
        "acc": 0.48,
        "acc_stderr": 0.05021167315686779,
        "acc_norm": 0.48,
        "acc_norm_stderr": 0.05021167315686779
    },
    "harness|hendrycksTest-college_medicine|5": {
        "acc": 0.6994219653179191,
        "acc_stderr": 0.0349610148119118,
        "acc_norm": 0.6994219653179191,
        "acc_norm_stderr": 0.0349610148119118
    },
    "harness|hendrycksTest-college_physics|5": {
        "acc": 0.4215686274509804,
        "acc_stderr": 0.049135952012744975,
        "acc_norm": 0.4215686274509804,
        "acc_norm_stderr": 0.049135952012744975
    },
    "harness|hendrycksTest-computer_security|5": {
        "acc": 0.81,
        "acc_stderr": 0.039427724440366234,
        "acc_norm": 0.81,
        "acc_norm_stderr": 0.039427724440366234
    },
    "harness|hendrycksTest-conceptual_physics|5": {
        "acc": 0.6723404255319149,
        "acc_stderr": 0.030683020843231004,
        "acc_norm": 0.6723404255319149,
        "acc_norm_stderr": 0.030683020843231004
    },
    "harness|hendrycksTest-econometrics|5": {
        "acc": 0.5614035087719298,
        "acc_stderr": 0.04668000738510455,
        "acc_norm": 0.5614035087719298,
        "acc_norm_stderr": 0.04668000738510455
    },
    "harness|hendrycksTest-electrical_engineering|5": {
        "acc": 0.7103448275862069,
        "acc_stderr": 0.03780019230438014,
        "acc_norm": 0.7103448275862069,
        "acc_norm_stderr": 0.03780019230438014
    },
    "harness|hendrycksTest-elementary_mathematics|5": {
        "acc": 0.5555555555555556,
        "acc_stderr": 0.02559185776138218,
        "acc_norm": 0.5555555555555556,
        "acc_norm_stderr": 0.02559185776138218
    },
    "harness|hendrycksTest-formal_logic|5": {
        "acc": 0.5317460317460317,
        "acc_stderr": 0.04463112720677172,
        "acc_norm": 0.5317460317460317,
        "acc_norm_stderr": 0.04463112720677172
    },
    "harness|hendrycksTest-global_facts|5": {
        "acc": 0.44,
        "acc_stderr": 0.04988876515698589,
        "acc_norm": 0.44,
        "acc_norm_stderr": 0.04988876515698589
    },
    "harness|hendrycksTest-high_school_biology|5": {
        "acc": 0.8161290322580645,
        "acc_stderr": 0.02203721734026782,
        "acc_norm": 0.8161290322580645,
        "acc_norm_stderr": 0.02203721734026782
    },
    "harness|hendrycksTest-high_school_chemistry|5": {
        "acc": 0.5960591133004927,
        "acc_stderr": 0.03452453903822032,
        "acc_norm": 0.5960591133004927,
        "acc_norm_stderr": 0.03452453903822032
    },
    "harness|hendrycksTest-high_school_computer_science|5": {
        "acc": 0.75,
        "acc_stderr": 0.04351941398892446,
        "acc_norm": 0.75,
        "acc_norm_stderr": 0.04351941398892446
    },
    "harness|hendrycksTest-high_school_european_history|5": {
        "acc": 0.8363636363636363,
        "acc_stderr": 0.02888787239548795,
        "acc_norm": 0.8363636363636363,
        "acc_norm_stderr": 0.02888787239548795
    },
    "harness|hendrycksTest-high_school_geography|5": {
        "acc": 0.8737373737373737,
        "acc_stderr": 0.023664359402880215,
        "acc_norm": 0.8737373737373737,
        "acc_norm_stderr": 0.023664359402880215
    },
    "harness|hendrycksTest-high_school_government_and_politics|5": {
        "acc": 0.8911917098445595,
        "acc_stderr": 0.02247325333276875,
        "acc_norm": 0.8911917098445595,
        "acc_norm_stderr": 0.02247325333276875
    },
    "harness|hendrycksTest-high_school_macroeconomics|5": {
        "acc": 0.6974358974358974,
        "acc_stderr": 0.023290888053772732,
        "acc_norm": 0.6974358974358974,
        "acc_norm_stderr": 0.023290888053772732
    },
    "harness|hendrycksTest-high_school_mathematics|5": {
        "acc": 0.4074074074074074"}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio