JUHE API Marketplace
DATASET
Open Source Community

kensho/bizbench

--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* dataset_info: features: - name: question dtype: string - name: answer dtype: string - name: task dtype: string - name: context dtype: string - name: context_type dtype: string - name: options sequence: string - name: program dtype: string splits: - name: train num_bytes: 52823429 num_examples: 14377 - name: test num_bytes: 15720371 num_examples: 4673 download_size: 23760863 dataset_size: 68543800 --- <p align="left"> <img src="bizbench_pyramid.png"> </p> # BizBench: A Quantitative Reasoning Benchmark for Business and Finance Public dataset for [BizBench](https://arxiv.org/abs/2311.06602). Answering questions within business and finance requires reasoning, precision, and a wide-breadth of technical knowledge. Together, these requirements make this domain difficult for large language models (LLMs). We introduce BizBench, a benchmark for evaluating models' ability to reason about realistic financial problems. BizBench comprises **eight quantitative reasoning tasks**, focusing on question-answering (QA) over financial data via program synthesis. We include three financially-themed code-generation tasks from newly collected and augmented QA data. Additionally, we isolate the reasoning capabilities required for financial QA: reading comprehension of financial text and tables for extracting intermediate values, and understanding financial concepts and formulas needed to calculate complex solutions. Collectively, these tasks evaluate a model's financial background knowledge, ability to parse financial documents, and capacity to solve problems with code. We conducted an in-depth evaluation of open-source and commercial LLMs, comparing and contrasting the behavior of code-focused and language-focused models. We demonstrate that the current bottleneck in performance is due to LLMs' limited business and financial understanding, highlighting the value of a challenging benchmark for quantitative reasoning within this domain. We have also develop a heavily curated leaderboard with a held-out test set open to submission: [https://benchmarks.kensho.com/](https://benchmarks.kensho.com/). This set was manually curated by financial professionals and further cleaned by hand in order to ensure the highest quality. A sample pipeline for using this dataset can be found at [https://github.com/kensho-technologies/benchmarks-pipeline](https://github.com/kensho-technologies/benchmarks-pipeline). ## Dataset Statistics | Dataset | Train/Few Shot Data | Test Data | | --- | --- | --- | | **Program Synthesis** | | | | FinCode | 7 | 47 | | CodeFinQA | 4668 | 795 | | CodeTATQA | 2856 | 2000 | | **Quantity Extraction** | | | | ConvFinQA (E) | | 629 | | TAT-QA (E) | | 120 | | SEC-Num | 6846 | 2000 | | **Domain Knowledge** | | | | FinKnow | | 744 | | ForumlaEval | | 50 |

Updated 6/3/2024
hugging_face

Description

BizBench Dataset Overview

Dataset Information

License

  • Apache 2.0

Configuration

  • Default configuration
    • Training data path: data/train-*
    • Test data path: data/test-*

Features

  • question: string
  • answer: string
  • task: string
  • context: string
  • context_type: string
  • options: sequence of strings
  • program: string

Data Splits

  • Training set
    • Bytes: 52,823,429
    • Samples: 14,377
  • Test set
    • Bytes: 15,720,371
    • Samples: 4,673

Size

  • Download size: 23,760,863 bytes
  • Dataset size: 68,543,800 bytes

Statistics

DatasetTrain/Small‑shot DataTest Data
Program Synthesis
FinCode747
CodeFinQA4,668795
CodeTATQA2,8562,000
Quantity Extraction
ConvFinQA (E)629
TAT‑QA (E)120
SEC‑Num6,8462,000
Domain Knowledge
FinKnow744
ForumlaEval50

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Commercial Finance
Quantitative Reasoning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.