--- license: apache-2.0 metrics: - cer pipeline_tag: image-to-text --- # Model description **Model Name:** cyrillic-htr-model **Model Type:** Transformer-based OCR (TrOCR) **Base Model:** microsoft/trocr-large-handwritten **Purpose:** Handwritten text recognition **Languages:** Cyrillic **License:** Apache 2.0 This model is a fine-tuned version of the microsoft/trocr-large-handwritten model, specialized for recognizing handwritten cyrillic text. At the moment it has been trained on the dataset (number of pages 740) from 17th to 20th centuries. # Model Architecture The model is based on a Transformer architecture (TrOCR) with an encoder-decoder setup: - The encoder processes images of handwritten text. - The decoder generates corresponding text output. # Intended Use This model is designed for handwritten text recognition and is intended for use in: - Document digitization (e.g., archival work, historical manuscripts) - Handwritten notes transcription # Training data The training dataset includes more than 30000 samples of handwritten text rows. # Evaluation The model was evaluated on test dataset. Below are key metrics: **Character Error Rate (CER):** 8 **Test Dataset Description:** size ~33 400 text rows # How to Use the Model You can use the model directly with Hugging Face’s pipeline function or by manually loading the processor and model. ```python from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image # Load the model and processor processor = TrOCRProcessor.from_pretrained("Kansallisarkisto/cyrillic-htr-model/processor") model = VisionEncoderDecoderModel.from_pretrained("Kansallisarkisto/cyrillic-htr-model") # Open an image of handwritten text image = Image.open("path_to_image.png") # Preprocess and predict pixel_values = processor(image, return_tensors="pt").pixel_values generated_ids = model.generate(pixel_values) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(generated_text) ``` # Limitations and Biases The model was trained primarily on handwritten text that uses basic Cyrillic characters. # Future Work Potential improvements for this model include: - Expanding training data: Incorporating more diverse handwriting styles and languages. - Optimizing for specific domains: Fine-tuning the model on domain-specific handwriting. # Citation If you use this model in your work, please cite it as: @misc{cyrillic_htr_model_2025, author = {Kansallisarkisto}, title = {Cyrillic HTR Model: Handwritten Text Recognition}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Kansallisarkisto/cyrillic-htr-model/}}, } ## Model Card Authors Author: Kansallisarkisto