Back to datasets
Dataset assetOpen Source CommunityMultimodal LearningChinese OCR

SWHL/ChineseOCRBench

Chinese OCRBench is a dataset specifically designed for evaluating Chinese OCR tasks, filling the evaluation gap for multimodal large language models in this domain. It comprises 3,410 images and 3,410 question‑answer pairs sourced from the ReCTS and ESTVQA datasets. Annotation includes image filename, question, answer, etc., suitable for OCR benchmarking and research.

Source
hugging_face
Created
Nov 28, 2025
Updated
Apr 30, 2024
Signals
284 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Name

  • Chinese OCRBench

Purpose

  • Dedicated benchmark for Chinese OCR tasks, addressing the lack of Chinese evaluation for multimodal LLMs.

Composition

  • 3,410 images and 3,410 QA pairs.
  • Sources: ReCTS and ESTVQA datasets.

Detailed Composition

DatasetImagesQuestions
ESTVQA709709
ReCTS2,7012,701
Total3,4103,410

Annotation Format

  • Each sample includes: dataset_name, id, question, answers, type, file_name.

Usage

  • Recommended to use together with the MultimodalOCR evaluation script.

Loading Example

from datasets import load_dataset

dataset = load_dataset("SWHL/ChineseOCRBench")

test_data = dataset["test"]
print(test_data[0])
# {image: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=760x1080 at 0x12544E770>, dataset_name: ESTVQA_cn, id: 0, question: 这家店的名字是什么?, answers: 禾不锈钢, type: Chinese}

License

  • Apache‑2.0

Language

  • Chinese

Size

  • 1K < n < 10K
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio