File size: 4,616 Bytes
19c3992
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: cc-by-4.0
task_categories:
  - text2text-generation
language:
  - la
size_categories:
  - 1M<n<10M
tags:
  - medieval
  - editing
  - normalization
  - Georges
pretty_name: Normalized Georges 1913 Model
version: 1.0.0
---
# Normalization Model for Medieval Latin

## **Overview**
This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the [**Normalized Georges 1913 Dataset**](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization/), which provides approximately 5 million word pairs of orthographic variants and their normalized forms.

The model is part of the *Burchard's Dekret Digital* project ([www.burchards-dekret-digital.de](http://www.burchards-dekret-digital.de)) and was developed to support text normalization tasks in historical document processing.

## **Model Architecture**
The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include:

1. **Embedding Layer**:
   - Converts character indices into dense vector representations.

2. **Bidirectional LSTM Encoder**:
   - Encodes the input sequence and captures bidirectional context.

3. **Attention Mechanism**:
   - Aligns decoder outputs with relevant encoder outputs for better context-awareness.

4. **LSTM Decoder**:
   - Decodes the normalized sequence character-by-character.

5. **Projection Layer**:
   - Maps decoder outputs to character probabilities.

### Model Parameters
- **Embedding Dimension**: 64  
- **Hidden Dimension**: 128  
- **Number of Layers**: 3  
- **Dropout**: 0.3  

## **Dataset**
The model is trained on the **Normalized Georges 1913 Dataset**. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the [dataset page](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization).

### Sample Data
| Orthographic Variant | Normalized Form    |
|-----------------------|--------------------|
|`circumcalcabicis`|`circumcalcabitis`|
|`peruincaturi`|`pervincaturi`|
|`tepidaremtur`|`tepidarentur`|
|`exmovemdis`|`exmovendis`|
|`comvomavisset`|`convomavisset`|
|`permeiemdis`|`permeiendis`|
|`permeditacissime`|`permeditatissime`|
|`conspersu`|`conspersu`|
|`pręviridancissimę`|`praeviridantissimae`|
|`relaxavisses`|`relaxavisses`|
|`edentaveratis`|`edentaveratis`|
|`amhelioris`|`anhelioris`|
|`remediatae`|`remediatae`|
|`discruciavero`|`discruciavero`|
|`imterplicavimus`|`interplicavimus`|
|`peraequata`|`peraequata`|
|`ignicomantissimorum`|`ignicomantissimorum`|
|`pręfvltvro`|`praefulturo`|

## **Training**
The model is trained using the following parameter:
   - **Loss**: CrossEntropyLoss (ignores padding index).
   - **Optimizer**: Adam with a learning rate of 0.0005.
   - **Scheduler**: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation.
   - **Gradient Clipping**: Max norm of 1.0.
   - **Batch Size**: 4096.

## **Usecases**
This model can be used for:

- Applying normalization based on Georges 1913.


## **Known limitations**
The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter."


## **How to Use**

### **Saved Files**

- normalization_model.pth: Trained PyTorch model weights.
- vocab.pkl: Vocabulary mapping for the dataset.
- config.json: Configuration file with model hyperparameters.

### **Training**
To train the model, run the `train_model.py` script on Github.

### **Usage for Inference**

Use script `test_model.py` script on Github.

## **Acknowledgments**
Dataset was created by Michael Schonhardt ([https://orcid.org/0000-0002-2750-1900](https://orcid.org/0000-0002-2750-1900)) for the project Burchards Dekret Digital.

Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via [www.zeno.org](http://www.zeno.org/georges-1913) by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.

## **License**
CC BY 4.0 ([https://creativecommons.org/licenses/by/4.0/legalcode.en](https://creativecommons.org/licenses/by/4.0/legalcode.en))

## **Citation**
If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, [https://huggingface.co/mschonhardt/georges-1913-normalization-model](https://huggingface.co/mschonhardt/georges-1913-normalization-model), Doi: [10.5281/zenodo.14264956](https://doi.org/10.5281/zenodo.14264956).