open-llm-leaderboard-old/details_OpenBuddy__openbuddy-qwen1.5-14b-v21.1-32k
This dataset was automatically created during the evaluation run of the model OpenBuddy/openbuddy‑qwen1.5‑14b‑v21.1‑32k for evaluation on the Open LLM Leaderboard. The dataset comprises 63 configurations, each corresponding to an evaluation task. The dataset is generated from a single run; each run can be found in each configuration, with splits named after the run timestamp. The 'train' split always points to the latest results. Additionally, a 'results' configuration stores aggregated results of all runs for computing and displaying aggregated metrics on the Open LLM Leaderboard.
Dataset description and usage context
Dataset Overview
Dataset Name
Evaluation run of OpenBuddy/openbuddy-qwen1.5-14b-v21.1-32k
Dataset Summary
This dataset was automatically created during the evaluation of the model OpenBuddy/openbuddy-qwen1.5-14b-v21.1-32k on the Open LLM Leaderboard.
Dataset Composition
- The dataset contains 63 configurations, each corresponding to an evaluation task.
- It is generated from a single run; each run can be accessed as a specific split within each configuration, with the split name using the run timestamp.
- The "train" split always points to the latest results.
- An additional configuration "results" stores aggregated results of all runs, used for computing and displaying aggregated metrics on the Open LLM Leaderboard.
Data Loading Example
from datasets import load_dataset
data = load_dataset(
"open-llm-leaderboard/details_OpenBuddy__openbuddy-qwen1.5-14b-v21.1-32k",
"harness_winogrande_5",
split="train"
)
Latest Results
The latest results are from the 2024-04-09T06:57:17.996714 run:
{
"all": {
"acc": 0.6783573743130548,
"acc_stderr": 0.031630411639720406,
"acc_norm": 0.6843006798291303,
"acc_norm_stderr": 0.032244439733683676,
"mc1": 0.39657282741738065,
"mc1_stderr": 0.017124930942023518,
"mc2": 0.5584410548633238,
"mc2_stderr": 0.014920454151130717
},
"harness|arc:challenge|25": {
"acc": 0.5358361774744027,
"acc_stderr": 0.01457381366473572,
"acc_norm": 0.5793515358361775,
"acc_norm_stderr": 0.014426211252508403
},
"harness|hellaswag|10": {
"acc": 0.5926110336586338,
"acc_stderr": 0.004903441680003823,
"acc_norm": 0.788388767177853,
"acc_norm_stderr": 0.004076158744346766
},
"harness|hendrycksTest-abstract_algebra|5": {
"acc": 0.38,
"acc_stderr": 0.048783173121456316,
"acc_norm": 0.38,
"acc_norm_stderr": 0.048783173121456316
},
"harness|hendrycksTest-anatomy|5": {
"acc": 0.6222222222222222,
"acc_stderr": 0.04188307537595852,
"acc_norm": 0.6222222222222222,
"acc_norm_stderr": 0.04188307537595852
},
"harness|hendrycksTest-astronomy|5": {
"acc": 0.7763157894736842,
"acc_stderr": 0.033911609343436025,
"acc_norm": 0.7763157894736842,
"acc_norm_stderr": 0.033911609343436025
},
"harness|hendrycksTest-business_ethics|5": {
"acc": 0.75,
"acc_stderr": 0.04351941398892446,
"acc_norm": 0.75,
"acc_norm_stderr": 0.04351941398892446
},
"harness|hendrycksTest-clinical_knowledge|5": {
"acc": 0.7245283018867924,
"acc_stderr": 0.027495663683724057,
"acc_norm": 0.7245283018867924,
"acc_norm_stderr": 0.027495663683724057
},
"harness|hendrycksTest-college_biology|5": {
"acc": 0.7222222222222222,
"acc_stderr": 0.03745554791462457,
"acc_norm": 0.7222222222222222,
"acc_norm_stderr": 0.03745554791462457
},
"harness|hendrycksTest-college_chemistry|5": {
"acc": 0.55,
"acc_stderr": 0.05,
"acc_norm": 0.55,
"acc_norm_stderr": 0.05
},
"harness|hendrycksTest-college_computer_science|5": {
"acc": 0.6,
"acc_stderr": 0.04923659639173309,
"acc_norm": 0.6,
"acc_norm_stderr": 0.04923659639173309
},
"harness|hendrycksTest-college_mathematics|5": {
"acc": 0.48,
"acc_stderr": 0.05021167315686779,
"acc_norm": 0.48,
"acc_norm_stderr": 0.05021167315686779
},
"harness|hendrycksTest-college_medicine|5": {
"acc": 0.6994219653179191,
"acc_stderr": 0.0349610148119118,
"acc_norm": 0.6994219653179191,
"acc_norm_stderr": 0.0349610148119118
},
"harness|hendrycksTest-college_physics|5": {
"acc": 0.4215686274509804,
"acc_stderr": 0.049135952012744975,
"acc_norm": 0.4215686274509804,
"acc_norm_stderr": 0.049135952012744975
},
"harness|hendrycksTest-computer_security|5": {
"acc": 0.81,
"acc_stderr": 0.039427724440366234,
"acc_norm": 0.81,
"acc_norm_stderr": 0.039427724440366234
},
"harness|hendrycksTest-conceptual_physics|5": {
"acc": 0.6723404255319149,
"acc_stderr": 0.030683020843231004,
"acc_norm": 0.6723404255319149,
"acc_norm_stderr": 0.030683020843231004
},
"harness|hendrycksTest-econometrics|5": {
"acc": 0.5614035087719298,
"acc_stderr": 0.04668000738510455,
"acc_norm": 0.5614035087719298,
"acc_norm_stderr": 0.04668000738510455
},
"harness|hendrycksTest-electrical_engineering|5": {
"acc": 0.7103448275862069,
"acc_stderr": 0.03780019230438014,
"acc_norm": 0.7103448275862069,
"acc_norm_stderr": 0.03780019230438014
},
"harness|hendrycksTest-elementary_mathematics|5": {
"acc": 0.5555555555555556,
"acc_stderr": 0.02559185776138218,
"acc_norm": 0.5555555555555556,
"acc_norm_stderr": 0.02559185776138218
},
"harness|hendrycksTest-formal_logic|5": {
"acc": 0.5317460317460317,
"acc_stderr": 0.04463112720677172,
"acc_norm": 0.5317460317460317,
"acc_norm_stderr": 0.04463112720677172
},
"harness|hendrycksTest-global_facts|5": {
"acc": 0.44,
"acc_stderr": 0.04988876515698589,
"acc_norm": 0.44,
"acc_norm_stderr": 0.04988876515698589
},
"harness|hendrycksTest-high_school_biology|5": {
"acc": 0.8161290322580645,
"acc_stderr": 0.02203721734026782,
"acc_norm": 0.8161290322580645,
"acc_norm_stderr": 0.02203721734026782
},
"harness|hendrycksTest-high_school_chemistry|5": {
"acc": 0.5960591133004927,
"acc_stderr": 0.03452453903822032,
"acc_norm": 0.5960591133004927,
"acc_norm_stderr": 0.03452453903822032
},
"harness|hendrycksTest-high_school_computer_science|5": {
"acc": 0.75,
"acc_stderr": 0.04351941398892446,
"acc_norm": 0.75,
"acc_norm_stderr": 0.04351941398892446
},
"harness|hendrycksTest-high_school_european_history|5": {
"acc": 0.8363636363636363,
"acc_stderr": 0.02888787239548795,
"acc_norm": 0.8363636363636363,
"acc_norm_stderr": 0.02888787239548795
},
"harness|hendrycksTest-high_school_geography|5": {
"acc": 0.8737373737373737,
"acc_stderr": 0.023664359402880215,
"acc_norm": 0.8737373737373737,
"acc_norm_stderr": 0.023664359402880215
},
"harness|hendrycksTest-high_school_government_and_politics|5": {
"acc": 0.8911917098445595,
"acc_stderr": 0.02247325333276875,
"acc_norm": 0.8911917098445595,
"acc_norm_stderr": 0.02247325333276875
},
"harness|hendrycksTest-high_school_macroeconomics|5": {
"acc": 0.6974358974358974,
"acc_stderr": 0.023290888053772732,
"acc_norm": 0.6974358974358974,
"acc_norm_stderr": 0.023290888053772732
},
"harness|hendrycksTest-high_school_mathematics|5": {
"acc": 0.4074074074074074"}
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.