HebrewManuscripts
This dataset provides images of Hebrew letters and a stop symbol for training and evaluating optical character recognition (OCR) models. It supports OCR of Hebrew text, educational tools for Hebrew learners, and digitization of historical Hebrew manuscripts.
Description
Hebrew Letter Recognition Dataset
Dataset Description
The dataset contains Hebrew letters and a stop symbol for training and evaluating OCR models. It supports:
- Hebrew text OCR
- Educational tools for Hebrew learners
- Digitization of historical Hebrew manuscripts
Dataset Structure
Data are organized into directories, each corresponding to a specific Hebrew letter or the stop symbol. Each directory holds multiple .jpg images showing the character in various fonts, sizes, and variations.
Directory Layout:
/dataset/ /א/ (images of the letter "Aleph") /ב/ (images of the letter "Bet") /ג/ (images of the letter "Gimel") ... /stop/ (images of the stop symbol ".")
- Number of classes: 29 (28 Hebrew letters + 1 stop symbol)
- File format:
.jpg - Image size: Typically 64 × 64 pixels
Example Directory Structure:
dataset/ א/ 1.jpg 2.jpg ... ב/ 1.jpg 2.jpg ... stop/ 1.jpg 2.jpg ...
Class Labels:
- The dataset includes the following classes (letters and stop symbol):
- א, ב, ג, ד, ה, ו, ז, ח, ט, י, ך, כ, ל, ם, מ, ן, נ, ס, ע, ף, פ, ץ, צ, ק, ר, ש, ת, stop (.)
Dataset Uses
The dataset can be used to train machine‑learning models for:
- Hebrew letter recognition: Building models that recognize individual Hebrew letters from scanned documents or photos.
- OCR systems: Developing OCR pipelines for printed or handwritten Hebrew documents.
- Educational tools: Creating real‑time letter‑recognition applications for Hebrew language learning.
Pre‑Processing
Steps:
- Resize: All images should be resized to a uniform size (e.g., 64 × 64 pixels) before feeding into a CNN.
- Normalize: Pixel values are normalized to the range
[0, 1]by dividing by 255. - Data augmentation (optional): Apply rotations, flips, and scaling to improve model robustness.
Statistics
- Total images: 307
- Number of classes: 29 (28 letters + 1 stop symbol)
- File format:
.jpg - Average images per class: Approximately 10‑15
License
The dataset is released under the MIT License. You may freely use, modify, and distribute the dataset provided you retain attribution to the original authors.
Citation
Please cite the dataset as follows:
bibtex @misc{hebrew-letter-dataset, title={Hebrew Letter Recognition Dataset}, author={Benjamin Schnabel}, year={2024}, howpublished={url{https://huggingface.co/datasets/your-dataset}}, }
Contributions
We welcome contributions of additional Hebrew letter variants or improvements to the dataset via pull requests or issue reports.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 10/18/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.