kensho/bizbench

--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* dataset_info: features: - name: question dtype: string - name: answer dtype: string - name: task dtype: string - name: context dtype: string - name: context_type dtype: string - name: options sequence: string - name: program dtype: string splits: - name: train num_bytes: 52823429 num_examples: 14377 - name: test num_bytes: 15720371 num_examples: 4673 download_size: 23760863 dataset_size: 68543800 --- <p align="left"> <img src="bizbench_pyramid.png"> </p> # BizBench: A Quantitative Reasoning Benchmark for Business and Finance Public dataset for [BizBench](https://arxiv.org/abs/2311.06602). Answering questions within business and finance requires reasoning, precision, and a wide-breadth of technical knowledge. Together, these requirements make this domain difficult for large language models (LLMs). We introduce BizBench, a benchmark for evaluating models' ability to reason about realistic financial problems. BizBench comprises **eight quantitative reasoning tasks**, focusing on question-answering (QA) over financial data via program synthesis. We include three financially-themed code-generation tasks from newly collected and augmented QA data. Additionally, we isolate the reasoning capabilities required for financial QA: reading comprehension of financial text and tables for extracting intermediate values, and understanding financial concepts and formulas needed to calculate complex solutions. Collectively, these tasks evaluate a model's financial background knowledge, ability to parse financial documents, and capacity to solve problems with code. We conducted an in-depth evaluation of open-source and commercial LLMs, comparing and contrasting the behavior of code-focused and language-focused models. We demonstrate that the current bottleneck in performance is due to LLMs' limited business and financial understanding, highlighting the value of a challenging benchmark for quantitative reasoning within this domain. We have also develop a heavily curated leaderboard with a held-out test set open to submission: [https://benchmarks.kensho.com/](https://benchmarks.kensho.com/). This set was manually curated by financial professionals and further cleaned by hand in order to ensure the highest quality. A sample pipeline for using this dataset can be found at [https://github.com/kensho-technologies/benchmarks-pipeline](https://github.com/kensho-technologies/benchmarks-pipeline). ## Dataset Statistics | Dataset | Train/Few Shot Data | Test Data | | --- | --- | --- | | **Program Synthesis** | | | | FinCode | 7 | 47 | | CodeFinQA | 4668 | 795 | | CodeTATQA | 2856 | 2000 | | **Quantity Extraction** | | | | ConvFinQA (E) | | 629 | | TAT-QA (E) | | 120 | | SEC-Num | 6846 | 2000 | | **Domain Knowledge** | | | | FinKnow | | 744 | | ForumlaEval | | 50 |

Updated 6/3/2024

hugging_face

Description

BizBench Dataset Overview

Dataset Information

License

Apache 2.0

Configuration

Default configuration
- Training data path: data/train-*
- Test data path: data/test-*

Features

question: string
answer: string
task: string
context: string
context_type: string
options: sequence of strings
program: string

Data Splits

Training set
- Bytes: 52,823,429
- Samples: 14,377
Test set
- Bytes: 15,720,371
- Samples: 4,673

Size

Download size: 23,760,863 bytes
Dataset size: 68,543,800 bytes

Statistics

Dataset	Train/Small‑shot Data	Test Data
Program Synthesis
FinCode	7	47
CodeFinQA	4,668	795
CodeTATQA	2,856	2,000
Quantity Extraction
ConvFinQA (E)		629
TAT‑QA (E)		120
SEC‑Num	6,846	2,000
Domain Knowledge
FinKnow		744
ForumlaEval		50

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Commercial Finance

Quantitative Reasoning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →