kensho/bizbench
--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* dataset_info: features: - name: question dtype: string - name: answer dtype: string - name: task dtype: string - name: context dtype: string - name: context_type dtype: string - name: options sequence: string - name: program dtype: string splits: - name: train num_bytes: 52823429 num_examples: 14377 - name: test num_bytes: 15720371 num_examples: 4673 download_size: 23760863 dataset_size: 68543800 --- <p align="left"> <img src="bizbench_pyramid.png"> </p> # BizBench: A Quantitative Reasoning Benchmark for Business and Finance Public dataset for [BizBench](https://arxiv.org/abs/2311.06602). Answering questions within business and finance requires reasoning, precision, and a wide-breadth of technical knowledge. Together, these requirements make this domain difficult for large language models (LLMs). We introduce BizBench, a benchmark for evaluating models' ability to reason about realistic financial problems. BizBench comprises **eight quantitative reasoning tasks**, focusing on question-answering (QA) over financial data via program synthesis. We include three financially-themed code-generation tasks from newly collected and augmented QA data. Additionally, we isolate the reasoning capabilities required for financial QA: reading comprehension of financial text and tables for extracting intermediate values, and understanding financial concepts and formulas needed to calculate complex solutions. Collectively, these tasks evaluate a model's financial background knowledge, ability to parse financial documents, and capacity to solve problems with code. We conducted an in-depth evaluation of open-source and commercial LLMs, comparing and contrasting the behavior of code-focused and language-focused models. We demonstrate that the current bottleneck in performance is due to LLMs' limited business and financial understanding, highlighting the value of a challenging benchmark for quantitative reasoning within this domain. We have also develop a heavily curated leaderboard with a held-out test set open to submission: [https://benchmarks.kensho.com/](https://benchmarks.kensho.com/). This set was manually curated by financial professionals and further cleaned by hand in order to ensure the highest quality. A sample pipeline for using this dataset can be found at [https://github.com/kensho-technologies/benchmarks-pipeline](https://github.com/kensho-technologies/benchmarks-pipeline). ## Dataset Statistics | Dataset | Train/Few Shot Data | Test Data | | --- | --- | --- | | **Program Synthesis** | | | | FinCode | 7 | 47 | | CodeFinQA | 4668 | 795 | | CodeTATQA | 2856 | 2000 | | **Quantity Extraction** | | | | ConvFinQA (E) | | 629 | | TAT-QA (E) | | 120 | | SEC-Num | 6846 | 2000 | | **Domain Knowledge** | | | | FinKnow | | 744 | | ForumlaEval | | 50 |
Description
BizBench Dataset Overview
Dataset Information
License
- Apache 2.0
Configuration
- Default configuration
- Training data path:
data/train-* - Test data path:
data/test-*
- Training data path:
Features
question: stringanswer: stringtask: stringcontext: stringcontext_type: stringoptions: sequence of stringsprogram: string
Data Splits
- Training set
- Bytes: 52,823,429
- Samples: 14,377
- Test set
- Bytes: 15,720,371
- Samples: 4,673
Size
- Download size: 23,760,863 bytes
- Dataset size: 68,543,800 bytes
Statistics
| Dataset | Train/Small‑shot Data | Test Data |
|---|---|---|
| Program Synthesis | ||
| FinCode | 7 | 47 |
| CodeFinQA | 4,668 | 795 |
| CodeTATQA | 2,856 | 2,000 |
| Quantity Extraction | ||
| ConvFinQA (E) | 629 | |
| TAT‑QA (E) | 120 | |
| SEC‑Num | 6,846 | 2,000 |
| Domain Knowledge | ||
| FinKnow | 744 | |
| ForumlaEval | 50 |
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.