|
--- |
|
license: apache-2.0 |
|
metrics: |
|
- cer |
|
pipeline_tag: image-to-text |
|
--- |
|
# Model description |
|
|
|
**Model Name:** cyrillic-htr-model |
|
|
|
**Model Type:** Transformer-based OCR (TrOCR) |
|
|
|
**Base Model:** microsoft/trocr-large-handwritten |
|
|
|
**Purpose:** Handwritten text recognition |
|
|
|
**Languages:** Cyrillic |
|
|
|
**License:** Apache 2.0 |
|
|
|
This model is a fine-tuned version of the microsoft/trocr-large-handwritten model, specialized for recognizing handwritten cyrillic text. At the moment it has been trained on the dataset (number of pages 740) from 17th to 20th centuries. |
|
|
|
# Model Architecture |
|
|
|
The model is based on a Transformer architecture (TrOCR) with an encoder-decoder setup: |
|
|
|
- The encoder processes images of handwritten text. |
|
- The decoder generates corresponding text output. |
|
|
|
# Intended Use |
|
|
|
This model is designed for handwritten text recognition and is intended for use in: |
|
|
|
- Document digitization (e.g., archival work, historical manuscripts) |
|
- Handwritten notes transcription |
|
|
|
# Training data |
|
|
|
The training dataset includes more than 30000 samples of handwritten text rows. |
|
|
|
# Evaluation |
|
|
|
The model was evaluated on test dataset. Below are key metrics: |
|
|
|
**Character Error Rate (CER):** 8 |
|
|
|
**Test Dataset Description:** size ~33 400 text rows |
|
|
|
# How to Use the Model |
|
|
|
You can use the model directly with Hugging Face’s pipeline function or by manually loading the processor and model. |
|
|
|
```python |
|
from transformers import TrOCRProcessor, VisionEncoderDecoderModel |
|
from PIL import Image |
|
|
|
# Load the model and processor |
|
processor = TrOCRProcessor.from_pretrained("Kansallisarkisto/cyrillic-htr-model/processor") |
|
model = VisionEncoderDecoderModel.from_pretrained("Kansallisarkisto/cyrillic-htr-model") |
|
|
|
# Open an image of handwritten text |
|
image = Image.open("path_to_image.png") |
|
|
|
# Preprocess and predict |
|
pixel_values = processor(image, return_tensors="pt").pixel_values |
|
generated_ids = model.generate(pixel_values) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
print(generated_text) |
|
|
|
``` |
|
|
|
# Limitations and Biases |
|
|
|
The model was trained primarily on handwritten text that uses basic Cyrillic characters. |
|
|
|
# Future Work |
|
|
|
Potential improvements for this model include: |
|
|
|
- Expanding training data: Incorporating more diverse handwriting styles and languages. |
|
- Optimizing for specific domains: Fine-tuning the model on domain-specific handwriting. |
|
|
|
# Citation |
|
|
|
If you use this model in your work, please cite it as: |
|
|
|
@misc{cyrillic_htr_model_2025, |
|
|
|
author = {Kansallisarkisto}, |
|
|
|
title = {Cyrillic HTR Model: Handwritten Text Recognition}, |
|
|
|
year = {2025}, |
|
|
|
publisher = {Hugging Face}, |
|
|
|
howpublished = {\url{https://huggingface.co/Kansallisarkisto/cyrillic-htr-model/}}, |
|
|
|
} |
|
|
|
## Model Card Authors |
|
|
|
Author: Kansallisarkisto |
|
|