DATASET
Open Source Community
SWHL/ChineseOCRBench
Chinese OCRBench is a dataset specifically designed for evaluating Chinese OCR tasks, filling the evaluation gap for multimodal large language models in this domain. It comprises 3,410 images and 3,410 question‑answer pairs sourced from the ReCTS and ESTVQA datasets. Annotation includes image filename, question, answer, etc., suitable for OCR benchmarking and research.
Updated 4/30/2024
hugging_face
Description
Dataset Overview
Name
- Chinese OCRBench
Purpose
- Dedicated benchmark for Chinese OCR tasks, addressing the lack of Chinese evaluation for multimodal LLMs.
Composition
- 3,410 images and 3,410 QA pairs.
- Sources: ReCTS and ESTVQA datasets.
Detailed Composition
| Dataset | Images | Questions |
|---|---|---|
| ESTVQA | 709 | 709 |
| ReCTS | 2,701 | 2,701 |
| Total | 3,410 | 3,410 |
Annotation Format
- Each sample includes:
dataset_name,id,question,answers,type,file_name.
Usage
- Recommended to use together with the MultimodalOCR evaluation script.
Loading Example
from datasets import load_dataset
dataset = load_dataset("SWHL/ChineseOCRBench")
test_data = dataset["test"]
print(test_data[0])
# {image: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=760x1080 at 0x12544E770>, dataset_name: ESTVQA_cn, id: 0, question: 这家店的名字是什么?, answers: 禾不锈钢, type: Chinese}
License
- Apache‑2.0
Language
- Chinese
Size
- 1K < n < 10K
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Chinese OCR
Multimodal Learning
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.