Back to datasets
Dataset assetOpen Source CommunityOptical Character RecognitionHebrew

HebrewManuscripts

This dataset provides images of Hebrew letters and a stop symbol for training and evaluating optical character recognition (OCR) models. It supports OCR of Hebrew text, educational tools for Hebrew learners, and digitization of historical Hebrew manuscripts.

Source
huggingface
Created
Oct 18, 2024
Updated
Oct 19, 2024
Signals
407 views
Availability
Linked source ready
Overview

Dataset description and usage context

Hebrew Letter Recognition Dataset

Dataset Description

The dataset contains Hebrew letters and a stop symbol for training and evaluating OCR models. It supports:

  • Hebrew text OCR
  • Educational tools for Hebrew learners
  • Digitization of historical Hebrew manuscripts

Dataset Structure

Data are organized into directories, each corresponding to a specific Hebrew letter or the stop symbol. Each directory holds multiple .jpg images showing the character in various fonts, sizes, and variations.

Directory Layout:

/dataset/ /א/ (images of the letter "Aleph") /ב/ (images of the letter "Bet") /ג/ (images of the letter "Gimel") ... /stop/ (images of the stop symbol ".")

  • Number of classes: 29 (28 Hebrew letters + 1 stop symbol)
  • File format: .jpg
  • Image size: Typically 64 × 64 pixels

Example Directory Structure:

dataset/ א/ 1.jpg 2.jpg ... ב/ 1.jpg 2.jpg ... stop/ 1.jpg 2.jpg ...

Class Labels:

  • The dataset includes the following classes (letters and stop symbol):
    • א, ב, ג, ד, ה, ו, ז, ח, ט, י, ך, כ, ל, ם, מ, ן, נ, ס, ע, ף, פ, ץ, צ, ק, ר, ש, ת, stop (.)

Dataset Uses

The dataset can be used to train machine‑learning models for:

  • Hebrew letter recognition: Building models that recognize individual Hebrew letters from scanned documents or photos.
  • OCR systems: Developing OCR pipelines for printed or handwritten Hebrew documents.
  • Educational tools: Creating real‑time letter‑recognition applications for Hebrew language learning.

Pre‑Processing

Steps:

  • Resize: All images should be resized to a uniform size (e.g., 64 × 64 pixels) before feeding into a CNN.
  • Normalize: Pixel values are normalized to the range [0, 1] by dividing by 255.
  • Data augmentation (optional): Apply rotations, flips, and scaling to improve model robustness.

Statistics

  • Total images: 307
  • Number of classes: 29 (28 letters + 1 stop symbol)
  • File format: .jpg
  • Average images per class: Approximately 10‑15

License

The dataset is released under the MIT License. You may freely use, modify, and distribute the dataset provided you retain attribution to the original authors.

Citation

Please cite the dataset as follows:

bibtex @misc{hebrew-letter-dataset, title={Hebrew Letter Recognition Dataset}, author={Benjamin Schnabel}, year={2024}, howpublished={url{https://huggingface.co/datasets/your-dataset}}, }

Contributions

We welcome contributions of additional Hebrew letter variants or improvements to the dataset via pull requests or issue reports.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio