File size: 4,616 Bytes
19c3992 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
license: cc-by-4.0
task_categories:
- text2text-generation
language:
- la
size_categories:
- 1M<n<10M
tags:
- medieval
- editing
- normalization
- Georges
pretty_name: Normalized Georges 1913 Model
version: 1.0.0
---
# Normalization Model for Medieval Latin
## **Overview**
This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the [**Normalized Georges 1913 Dataset**](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization/), which provides approximately 5 million word pairs of orthographic variants and their normalized forms.
The model is part of the *Burchard's Dekret Digital* project ([www.burchards-dekret-digital.de](http://www.burchards-dekret-digital.de)) and was developed to support text normalization tasks in historical document processing.
## **Model Architecture**
The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include:
1. **Embedding Layer**:
- Converts character indices into dense vector representations.
2. **Bidirectional LSTM Encoder**:
- Encodes the input sequence and captures bidirectional context.
3. **Attention Mechanism**:
- Aligns decoder outputs with relevant encoder outputs for better context-awareness.
4. **LSTM Decoder**:
- Decodes the normalized sequence character-by-character.
5. **Projection Layer**:
- Maps decoder outputs to character probabilities.
### Model Parameters
- **Embedding Dimension**: 64
- **Hidden Dimension**: 128
- **Number of Layers**: 3
- **Dropout**: 0.3
## **Dataset**
The model is trained on the **Normalized Georges 1913 Dataset**. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the [dataset page](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization).
### Sample Data
| Orthographic Variant | Normalized Form |
|-----------------------|--------------------|
|`circumcalcabicis`|`circumcalcabitis`|
|`peruincaturi`|`pervincaturi`|
|`tepidaremtur`|`tepidarentur`|
|`exmovemdis`|`exmovendis`|
|`comvomavisset`|`convomavisset`|
|`permeiemdis`|`permeiendis`|
|`permeditacissime`|`permeditatissime`|
|`conspersu`|`conspersu`|
|`pręviridancissimę`|`praeviridantissimae`|
|`relaxavisses`|`relaxavisses`|
|`edentaveratis`|`edentaveratis`|
|`amhelioris`|`anhelioris`|
|`remediatae`|`remediatae`|
|`discruciavero`|`discruciavero`|
|`imterplicavimus`|`interplicavimus`|
|`peraequata`|`peraequata`|
|`ignicomantissimorum`|`ignicomantissimorum`|
|`pręfvltvro`|`praefulturo`|
## **Training**
The model is trained using the following parameter:
- **Loss**: CrossEntropyLoss (ignores padding index).
- **Optimizer**: Adam with a learning rate of 0.0005.
- **Scheduler**: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation.
- **Gradient Clipping**: Max norm of 1.0.
- **Batch Size**: 4096.
## **Usecases**
This model can be used for:
- Applying normalization based on Georges 1913.
## **Known limitations**
The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter."
## **How to Use**
### **Saved Files**
- normalization_model.pth: Trained PyTorch model weights.
- vocab.pkl: Vocabulary mapping for the dataset.
- config.json: Configuration file with model hyperparameters.
### **Training**
To train the model, run the `train_model.py` script on Github.
### **Usage for Inference**
Use script `test_model.py` script on Github.
## **Acknowledgments**
Dataset was created by Michael Schonhardt ([https://orcid.org/0000-0002-2750-1900](https://orcid.org/0000-0002-2750-1900)) for the project Burchards Dekret Digital.
Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via [www.zeno.org](http://www.zeno.org/georges-1913) by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.
## **License**
CC BY 4.0 ([https://creativecommons.org/licenses/by/4.0/legalcode.en](https://creativecommons.org/licenses/by/4.0/legalcode.en))
## **Citation**
If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, [https://huggingface.co/mschonhardt/georges-1913-normalization-model](https://huggingface.co/mschonhardt/georges-1913-normalization-model), Doi: [10.5281/zenodo.14264956](https://doi.org/10.5281/zenodo.14264956). |