georges-1913-normalization-model / README.md

Update README.md

19c3992 verified 7 months ago

4.62 kB

	---
	license: cc-by-4.0
	task_categories:
	- text2text-generation
	language:
	- la
	size_categories:
	- 1M<n<10M
	tags:
	- medieval
	- editing
	- normalization
	- Georges
	pretty_name: Normalized Georges 1913 Model
	version: 1.0.0
	---
	# Normalization Model for Medieval Latin

	## Overview
	This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the [Normalized Georges 1913 Dataset](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization/), which provides approximately 5 million word pairs of orthographic variants and their normalized forms.

	The model is part of the Burchard's Dekret Digital project ([www.burchards-dekret-digital.de](http://www.burchards-dekret-digital.de)) and was developed to support text normalization tasks in historical document processing.

	## Model Architecture
	The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include:

	1. Embedding Layer:
	- Converts character indices into dense vector representations.

	2. Bidirectional LSTM Encoder:
	- Encodes the input sequence and captures bidirectional context.

	3. Attention Mechanism:
	- Aligns decoder outputs with relevant encoder outputs for better context-awareness.

	4. LSTM Decoder:
	- Decodes the normalized sequence character-by-character.

	5. Projection Layer:
	- Maps decoder outputs to character probabilities.

	### Model Parameters
	- Embedding Dimension: 64
	- Hidden Dimension: 128
	- Number of Layers: 3
	- Dropout: 0.3

	## Dataset
	The model is trained on the Normalized Georges 1913 Dataset. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the [dataset page](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization).

	### Sample Data
	\| Orthographic Variant \| Normalized Form \|
	\|-----------------------\|--------------------\|
	\|`circumcalcabicis`\|`circumcalcabitis`\|
	\|`peruincaturi`\|`pervincaturi`\|
	\|`tepidaremtur`\|`tepidarentur`\|
	\|`exmovemdis`\|`exmovendis`\|
	\|`comvomavisset`\|`convomavisset`\|
	\|`permeiemdis`\|`permeiendis`\|
	\|`permeditacissime`\|`permeditatissime`\|
	\|`conspersu`\|`conspersu`\|
	\|`pręviridancissimę`\|`praeviridantissimae`\|
	\|`relaxavisses`\|`relaxavisses`\|
	\|`edentaveratis`\|`edentaveratis`\|
	\|`amhelioris`\|`anhelioris`\|
	\|`remediatae`\|`remediatae`\|
	\|`discruciavero`\|`discruciavero`\|
	\|`imterplicavimus`\|`interplicavimus`\|
	\|`peraequata`\|`peraequata`\|
	\|`ignicomantissimorum`\|`ignicomantissimorum`\|
	\|`pręfvltvro`\|`praefulturo`\|

	## Training
	The model is trained using the following parameter:
	- Loss: CrossEntropyLoss (ignores padding index).
	- Optimizer: Adam with a learning rate of 0.0005.
	- Scheduler: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation.
	- Gradient Clipping: Max norm of 1.0.
	- Batch Size: 4096.

	## Usecases
	This model can be used for:

	- Applying normalization based on Georges 1913.


	## Known limitations
	The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter."


	## How to Use

	### Saved Files

	- normalization_model.pth: Trained PyTorch model weights.
	- vocab.pkl: Vocabulary mapping for the dataset.
	- config.json: Configuration file with model hyperparameters.

	### Training
	To train the model, run the `train_model.py` script on Github.

	### Usage for Inference

	Use script `test_model.py` script on Github.

	## Acknowledgments
	Dataset was created by Michael Schonhardt ([https://orcid.org/0000-0002-2750-1900](https://orcid.org/0000-0002-2750-1900)) for the project Burchards Dekret Digital.

	Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via [www.zeno.org](http://www.zeno.org/georges-1913) by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.

	## License
	CC BY 4.0 ([https://creativecommons.org/licenses/by/4.0/legalcode.en](https://creativecommons.org/licenses/by/4.0/legalcode.en))

	## Citation
	If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, [https://huggingface.co/mschonhardt/georges-1913-normalization-model](https://huggingface.co/mschonhardt/georges-1913-normalization-model), Doi: [10.5281/zenodo.14264956](https://doi.org/10.5281/zenodo.14264956).

	---
	license: cc-by-4.0
	task_categories:
	- text2text-generation
	language:
	- la
	size_categories:
	- 1M<n<10M
	tags:
	- medieval
	- editing
	- normalization
	- Georges
	pretty_name: Normalized Georges 1913 Model
	version: 1.0.0
	---
	# Normalization Model for Medieval Latin

	## Overview
	This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the [Normalized Georges 1913 Dataset](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization/), which provides approximately 5 million word pairs of orthographic variants and their normalized forms.

	The model is part of the Burchard's Dekret Digital project ([www.burchards-dekret-digital.de](http://www.burchards-dekret-digital.de)) and was developed to support text normalization tasks in historical document processing.

	## Model Architecture
	The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include:

	1. Embedding Layer:
	- Converts character indices into dense vector representations.

	2. Bidirectional LSTM Encoder:
	- Encodes the input sequence and captures bidirectional context.

	3. Attention Mechanism:
	- Aligns decoder outputs with relevant encoder outputs for better context-awareness.

	4. LSTM Decoder:
	- Decodes the normalized sequence character-by-character.

	5. Projection Layer:
	- Maps decoder outputs to character probabilities.

	### Model Parameters
	- Embedding Dimension: 64
	- Hidden Dimension: 128
	- Number of Layers: 3
	- Dropout: 0.3

	## Dataset
	The model is trained on the Normalized Georges 1913 Dataset. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the [dataset page](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization).

	### Sample Data
	\| Orthographic Variant \| Normalized Form \|
	\|-----------------------\|--------------------\|
	\|`circumcalcabicis`\|`circumcalcabitis`\|
	\|`peruincaturi`\|`pervincaturi`\|
	\|`tepidaremtur`\|`tepidarentur`\|
	\|`exmovemdis`\|`exmovendis`\|
	\|`comvomavisset`\|`convomavisset`\|
	\|`permeiemdis`\|`permeiendis`\|
	\|`permeditacissime`\|`permeditatissime`\|
	\|`conspersu`\|`conspersu`\|
	\|`pręviridancissimę`\|`praeviridantissimae`\|
	\|`relaxavisses`\|`relaxavisses`\|
	\|`edentaveratis`\|`edentaveratis`\|
	\|`amhelioris`\|`anhelioris`\|
	\|`remediatae`\|`remediatae`\|
	\|`discruciavero`\|`discruciavero`\|
	\|`imterplicavimus`\|`interplicavimus`\|
	\|`peraequata`\|`peraequata`\|
	\|`ignicomantissimorum`\|`ignicomantissimorum`\|
	\|`pręfvltvro`\|`praefulturo`\|

	## Training
	The model is trained using the following parameter:
	- Loss: CrossEntropyLoss (ignores padding index).
	- Optimizer: Adam with a learning rate of 0.0005.
	- Scheduler: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation.
	- Gradient Clipping: Max norm of 1.0.
	- Batch Size: 4096.

	## Usecases
	This model can be used for:

	- Applying normalization based on Georges 1913.


	## Known limitations
	The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter."


	## How to Use

	### Saved Files

	- normalization_model.pth: Trained PyTorch model weights.
	- vocab.pkl: Vocabulary mapping for the dataset.
	- config.json: Configuration file with model hyperparameters.

	### Training
	To train the model, run the `train_model.py` script on Github.

	### Usage for Inference

	Use script `test_model.py` script on Github.

	## Acknowledgments
	Dataset was created by Michael Schonhardt ([https://orcid.org/0000-0002-2750-1900](https://orcid.org/0000-0002-2750-1900)) for the project Burchards Dekret Digital.

	Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via [www.zeno.org](http://www.zeno.org/georges-1913) by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.

	## License
	CC BY 4.0 ([https://creativecommons.org/licenses/by/4.0/legalcode.en](https://creativecommons.org/licenses/by/4.0/legalcode.en))

	## Citation
	If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, [https://huggingface.co/mschonhardt/georges-1913-normalization-model](https://huggingface.co/mschonhardt/georges-1913-normalization-model), Doi: [10.5281/zenodo.14264956](https://doi.org/10.5281/zenodo.14264956).