Back to datasets
Dataset assetOpen Source CommunityMultimodal LearningChinese OCR
SWHL/ChineseOCRBench
Chinese OCRBench is a dataset specifically designed for evaluating Chinese OCR tasks, filling the evaluation gap for multimodal large language models in this domain. It comprises 3,410 images and 3,410 question‑answer pairs sourced from the ReCTS and ESTVQA datasets. Annotation includes image filename, question, answer, etc., suitable for OCR benchmarking and research.
Source
hugging_face
Created
Nov 28, 2025
Updated
Apr 30, 2024
Signals
284 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Name
- Chinese OCRBench
Purpose
- Dedicated benchmark for Chinese OCR tasks, addressing the lack of Chinese evaluation for multimodal LLMs.
Composition
- 3,410 images and 3,410 QA pairs.
- Sources: ReCTS and ESTVQA datasets.
Detailed Composition
| Dataset | Images | Questions |
|---|---|---|
| ESTVQA | 709 | 709 |
| ReCTS | 2,701 | 2,701 |
| Total | 3,410 | 3,410 |
Annotation Format
- Each sample includes:
dataset_name,id,question,answers,type,file_name.
Usage
- Recommended to use together with the MultimodalOCR evaluation script.
Loading Example
from datasets import load_dataset
dataset = load_dataset("SWHL/ChineseOCRBench")
test_data = dataset["test"]
print(test_data[0])
# {image: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=760x1080 at 0x12544E770>, dataset_name: ESTVQA_cn, id: 0, question: 这家店的名字是什么?, answers: 禾不锈钢, type: Chinese}
License
- Apache‑2.0
Language
- Chinese
Size
- 1K < n < 10K
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.