JUHE API Marketplace
DATASET
Open Source Community

HebrewManuscripts

This dataset provides images of Hebrew letters and a stop symbol for training and evaluating optical character recognition (OCR) models. It supports OCR of Hebrew text, educational tools for Hebrew learners, and digitization of historical Hebrew manuscripts.

Updated 10/19/2024
huggingface

Description

Hebrew Letter Recognition Dataset

Dataset Description

The dataset contains Hebrew letters and a stop symbol for training and evaluating OCR models. It supports:

  • Hebrew text OCR
  • Educational tools for Hebrew learners
  • Digitization of historical Hebrew manuscripts

Dataset Structure

Data are organized into directories, each corresponding to a specific Hebrew letter or the stop symbol. Each directory holds multiple .jpg images showing the character in various fonts, sizes, and variations.

Directory Layout:

/dataset/ /א/ (images of the letter "Aleph") /ב/ (images of the letter "Bet") /ג/ (images of the letter "Gimel") ... /stop/ (images of the stop symbol ".")

  • Number of classes: 29 (28 Hebrew letters + 1 stop symbol)
  • File format: .jpg
  • Image size: Typically 64 × 64 pixels

Example Directory Structure:

dataset/ א/ 1.jpg 2.jpg ... ב/ 1.jpg 2.jpg ... stop/ 1.jpg 2.jpg ...

Class Labels:

  • The dataset includes the following classes (letters and stop symbol):
    • א, ב, ג, ד, ה, ו, ז, ח, ט, י, ך, כ, ל, ם, מ, ן, נ, ס, ע, ף, פ, ץ, צ, ק, ר, ש, ת, stop (.)

Dataset Uses

The dataset can be used to train machine‑learning models for:

  • Hebrew letter recognition: Building models that recognize individual Hebrew letters from scanned documents or photos.
  • OCR systems: Developing OCR pipelines for printed or handwritten Hebrew documents.
  • Educational tools: Creating real‑time letter‑recognition applications for Hebrew language learning.

Pre‑Processing

Steps:

  • Resize: All images should be resized to a uniform size (e.g., 64 × 64 pixels) before feeding into a CNN.
  • Normalize: Pixel values are normalized to the range [0, 1] by dividing by 255.
  • Data augmentation (optional): Apply rotations, flips, and scaling to improve model robustness.

Statistics

  • Total images: 307
  • Number of classes: 29 (28 letters + 1 stop symbol)
  • File format: .jpg
  • Average images per class: Approximately 10‑15

License

The dataset is released under the MIT License. You may freely use, modify, and distribute the dataset provided you retain attribution to the original authors.

Citation

Please cite the dataset as follows:

bibtex @misc{hebrew-letter-dataset, title={Hebrew Letter Recognition Dataset}, author={Benjamin Schnabel}, year={2024}, howpublished={url{https://huggingface.co/datasets/your-dataset}}, }

Contributions

We welcome contributions of additional Hebrew letter variants or improvements to the dataset via pull requests or issue reports.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Optical Character Recognition
Hebrew

Source

Organization: huggingface

Created: 10/18/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.