|
--- |
|
license: cc-by-4.0 |
|
task_categories: |
|
- text2text-generation |
|
language: |
|
- la |
|
size_categories: |
|
- 1M<n<10M |
|
tags: |
|
- medieval |
|
- editing |
|
- normalization |
|
- Georges |
|
pretty_name: Normalized Georges 1913 Model |
|
version: 1.0.0 |
|
--- |
|
# Normalization Model for Medieval Latin |
|
|
|
## **Overview** |
|
This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the [**Normalized Georges 1913 Dataset**](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization/), which provides approximately 5 million word pairs of orthographic variants and their normalized forms. |
|
|
|
The model is part of the *Burchard's Dekret Digital* project ([www.burchards-dekret-digital.de](http://www.burchards-dekret-digital.de)) and was developed to support text normalization tasks in historical document processing. |
|
|
|
## **Model Architecture** |
|
The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include: |
|
|
|
1. **Embedding Layer**: |
|
- Converts character indices into dense vector representations. |
|
|
|
2. **Bidirectional LSTM Encoder**: |
|
- Encodes the input sequence and captures bidirectional context. |
|
|
|
3. **Attention Mechanism**: |
|
- Aligns decoder outputs with relevant encoder outputs for better context-awareness. |
|
|
|
4. **LSTM Decoder**: |
|
- Decodes the normalized sequence character-by-character. |
|
|
|
5. **Projection Layer**: |
|
- Maps decoder outputs to character probabilities. |
|
|
|
### Model Parameters |
|
- **Embedding Dimension**: 64 |
|
- **Hidden Dimension**: 128 |
|
- **Number of Layers**: 3 |
|
- **Dropout**: 0.3 |
|
|
|
## **Dataset** |
|
The model is trained on the **Normalized Georges 1913 Dataset**. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the [dataset page](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization). |
|
|
|
### Sample Data |
|
| Orthographic Variant | Normalized Form | |
|
|-----------------------|--------------------| |
|
|`circumcalcabicis`|`circumcalcabitis`| |
|
|`peruincaturi`|`pervincaturi`| |
|
|`tepidaremtur`|`tepidarentur`| |
|
|`exmovemdis`|`exmovendis`| |
|
|`comvomavisset`|`convomavisset`| |
|
|`permeiemdis`|`permeiendis`| |
|
|`permeditacissime`|`permeditatissime`| |
|
|`conspersu`|`conspersu`| |
|
|`pręviridancissimę`|`praeviridantissimae`| |
|
|`relaxavisses`|`relaxavisses`| |
|
|`edentaveratis`|`edentaveratis`| |
|
|`amhelioris`|`anhelioris`| |
|
|`remediatae`|`remediatae`| |
|
|`discruciavero`|`discruciavero`| |
|
|`imterplicavimus`|`interplicavimus`| |
|
|`peraequata`|`peraequata`| |
|
|`ignicomantissimorum`|`ignicomantissimorum`| |
|
|`pręfvltvro`|`praefulturo`| |
|
|
|
## **Training** |
|
The model is trained using the following parameter: |
|
- **Loss**: CrossEntropyLoss (ignores padding index). |
|
- **Optimizer**: Adam with a learning rate of 0.0005. |
|
- **Scheduler**: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation. |
|
- **Gradient Clipping**: Max norm of 1.0. |
|
- **Batch Size**: 4096. |
|
|
|
## **Usecases** |
|
This model can be used for: |
|
|
|
- Applying normalization based on Georges 1913. |
|
|
|
|
|
## **Known limitations** |
|
The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter." |
|
|
|
|
|
## **How to Use** |
|
|
|
### **Saved Files** |
|
|
|
- normalization_model.pth: Trained PyTorch model weights. |
|
- vocab.pkl: Vocabulary mapping for the dataset. |
|
- config.json: Configuration file with model hyperparameters. |
|
|
|
### **Training** |
|
To train the model, run the `train_model.py` script on Github. |
|
|
|
### **Usage for Inference** |
|
|
|
Use script `test_model.py` script on Github. |
|
|
|
## **Acknowledgments** |
|
Dataset was created by Michael Schonhardt ([https://orcid.org/0000-0002-2750-1900](https://orcid.org/0000-0002-2750-1900)) for the project Burchards Dekret Digital. |
|
|
|
Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via [www.zeno.org](http://www.zeno.org/georges-1913) by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service. |
|
|
|
## **License** |
|
CC BY 4.0 ([https://creativecommons.org/licenses/by/4.0/legalcode.en](https://creativecommons.org/licenses/by/4.0/legalcode.en)) |
|
|
|
## **Citation** |
|
If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, [https://huggingface.co/mschonhardt/georges-1913-normalization-model](https://huggingface.co/mschonhardt/georges-1913-normalization-model), Doi: [10.5281/zenodo.14264956](https://doi.org/10.5281/zenodo.14264956). |