cyrillic-htr-model / README.md
riikkamarttila's picture
Update README.md
c31a608 verified
|
raw
history blame
2.77 kB
---
license: apache-2.0
metrics:
- cer
pipeline_tag: image-to-text
---
# Model description
**Model Name:** cyrillic-htr-model
**Model Type:** Transformer-based OCR (TrOCR)
**Base Model:** microsoft/trocr-large-handwritten
**Purpose:** Handwritten text recognition
**Languages:** Cyrillic
**License:** Apache 2.0
This model is a fine-tuned version of the microsoft/trocr-large-handwritten model, specialized for recognizing handwritten cyrillic text. At the moment it has been trained on the dataset (number of pages 740) from 17th to 20th centuries.
# Model Architecture
The model is based on a Transformer architecture (TrOCR) with an encoder-decoder setup:
- The encoder processes images of handwritten text.
- The decoder generates corresponding text output.
# Intended Use
This model is designed for handwritten text recognition and is intended for use in:
- Document digitization (e.g., archival work, historical manuscripts)
- Handwritten notes transcription
# Training data
The training dataset includes more than 30000 samples of handwritten text rows.
# Evaluation
The model was evaluated on test dataset. Below are key metrics:
**Character Error Rate (CER):** 8
**Test Dataset Description:** size ~33 400 text rows
# How to Use the Model
You can use the model directly with Hugging Face’s pipeline function or by manually loading the processor and model.
```python
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
# Load the model and processor
processor = TrOCRProcessor.from_pretrained("Kansallisarkisto/cyrillic-htr-model/processor")
model = VisionEncoderDecoderModel.from_pretrained("Kansallisarkisto/cyrillic-htr-model")
# Open an image of handwritten text
image = Image.open("path_to_image.png")
# Preprocess and predict
pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
```
# Limitations and Biases
The model was trained primarily on handwritten text that uses basic Cyrillic characters.
# Future Work
Potential improvements for this model include:
- Expanding training data: Incorporating more diverse handwriting styles and languages.
- Optimizing for specific domains: Fine-tuning the model on domain-specific handwriting.
# Citation
If you use this model in your work, please cite it as:
@misc{cyrillic_htr_model_2025,
author = {Kansallisarkisto},
title = {Cyrillic HTR Model: Handwritten Text Recognition},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Kansallisarkisto/cyrillic-htr-model/}},
}
## Model Card Authors
Author: Kansallisarkisto