riikkamarttila commited on
Commit
c31a608
·
verified ·
1 Parent(s): ce541fa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -3
README.md CHANGED
@@ -1,3 +1,104 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ metrics:
4
+ - cer
5
+ pipeline_tag: image-to-text
6
+ ---
7
+ # Model description
8
+
9
+ **Model Name:** cyrillic-htr-model
10
+
11
+ **Model Type:** Transformer-based OCR (TrOCR)
12
+
13
+ **Base Model:** microsoft/trocr-large-handwritten
14
+
15
+ **Purpose:** Handwritten text recognition
16
+
17
+ **Languages:** Cyrillic
18
+
19
+ **License:** Apache 2.0
20
+
21
+ This model is a fine-tuned version of the microsoft/trocr-large-handwritten model, specialized for recognizing handwritten cyrillic text. At the moment it has been trained on the dataset (number of pages 740) from 17th to 20th centuries.
22
+
23
+ # Model Architecture
24
+
25
+ The model is based on a Transformer architecture (TrOCR) with an encoder-decoder setup:
26
+
27
+ - The encoder processes images of handwritten text.
28
+ - The decoder generates corresponding text output.
29
+
30
+ # Intended Use
31
+
32
+ This model is designed for handwritten text recognition and is intended for use in:
33
+
34
+ - Document digitization (e.g., archival work, historical manuscripts)
35
+ - Handwritten notes transcription
36
+
37
+ # Training data
38
+
39
+ The training dataset includes more than 30000 samples of handwritten text rows.
40
+
41
+ # Evaluation
42
+
43
+ The model was evaluated on test dataset. Below are key metrics:
44
+
45
+ **Character Error Rate (CER):** 8
46
+
47
+ **Test Dataset Description:** size ~33 400 text rows
48
+
49
+ # How to Use the Model
50
+
51
+ You can use the model directly with Hugging Face’s pipeline function or by manually loading the processor and model.
52
+
53
+ ```python
54
+ from transformers import TrOCRProcessor, VisionEncoderDecoderModel
55
+ from PIL import Image
56
+
57
+ # Load the model and processor
58
+ processor = TrOCRProcessor.from_pretrained("Kansallisarkisto/cyrillic-htr-model/processor")
59
+ model = VisionEncoderDecoderModel.from_pretrained("Kansallisarkisto/cyrillic-htr-model")
60
+
61
+ # Open an image of handwritten text
62
+ image = Image.open("path_to_image.png")
63
+
64
+ # Preprocess and predict
65
+ pixel_values = processor(image, return_tensors="pt").pixel_values
66
+ generated_ids = model.generate(pixel_values)
67
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
68
+
69
+ print(generated_text)
70
+
71
+ ```
72
+
73
+ # Limitations and Biases
74
+
75
+ The model was trained primarily on handwritten text that uses basic Cyrillic characters.
76
+
77
+ # Future Work
78
+
79
+ Potential improvements for this model include:
80
+
81
+ - Expanding training data: Incorporating more diverse handwriting styles and languages.
82
+ - Optimizing for specific domains: Fine-tuning the model on domain-specific handwriting.
83
+
84
+ # Citation
85
+
86
+ If you use this model in your work, please cite it as:
87
+
88
+ @misc{cyrillic_htr_model_2025,
89
+
90
+ author = {Kansallisarkisto},
91
+
92
+ title = {Cyrillic HTR Model: Handwritten Text Recognition},
93
+
94
+ year = {2025},
95
+
96
+ publisher = {Hugging Face},
97
+
98
+ howpublished = {\url{https://huggingface.co/Kansallisarkisto/cyrillic-htr-model/}},
99
+
100
+ }
101
+
102
+ ## Model Card Authors
103
+
104
+ Author: Kansallisarkisto