---
license: cc-by-nc-4.0
language:
- de
- frr
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: translation
---

# Northern Frisian translation model
This is an [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model fine-tuned for translating between German and 
the Northern Frisian dialects of Mooringer Frasch and Wiringhiirder Freesk following 
[this great blogpost](https://cointegrated.medium.com/a37fc706b865).

While the additional data introduced with the new dialect has improved the model's performance for translations German <-> Mooring 
compared to [nllb-deu-moo](https://huggingface.co/CmdCody/nllb-deu-moo), the extended training has at the same time degraded 
the performance for other languages. For example, translating English to Mooring still works relatively well while conversely translating 
Mooring to English does not.

## Data

1. Mooring <-> German:<br>
The Mooring dataset for finetuning consisted of 9339 sentence pairs.
Most examples (roughly 5100) were taken directly from 
["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf) 
published by the Nordfriisk Instituut. For sentence splitting the python 
[sentence-splitting library](https://pypi.org/project/sentence-splitter/) was used. The splitting wasn't perfect, 
especially in cases of direct speech, so that manual re-alignment and further splitting was necessary.
Further, the texts about larks from Föögle önj Nordfraschlönj, Marie Tångeberg, 1992 were added, a translation of the 
story Bulemanns Haus by Theodor Storm, as well as roughly 3000 examples taken from the Frasch Uurdebök,
Friesisches Wörterbuch, Neumünster 1988.
Finally, a little under 180 very simple self-written examples were used as evaluation data set.

2. Wiringhiirder <-> German:<br>
The Wiringhiirder dataset consisted of 7529 sentence pairs taken from the books
["Di muon fuon e halie"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/Peter_Jensen__Di_muon_fuon_e_halie.pdf)
and ["Di tofel"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/Peter_Jensen__Di_tofel.pdf)
by Peter Jensen published by the Nordfriisk Instituut. Similar measures were taken as for Rüm Hart above.
For evaluation sentences were collected from Wikipedia, however the evaluation set remains very small and is barely enough to detect 
overfitting.


## Usage
How to use the model:
```python
!pip install transformers==4.33

from transformers import AutoModelForSeq2SeqLM, NllbTokenizer

def create_tokenizer_with_new_langs(model_id, new_langs):
    tokenizer = NllbTokenizer.from_pretrained(model_id)
    for lang in new_langs:
        old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
        new_token_id = old_len - 1
        if new_lang in tokenizer.added_tokens_encoder:
            new_token_id = tokenizer.added_tokens_encoder[new_lang] - 1
        tokenizer.lang_code_to_id[new_lang] = new_token_id
        tokenizer.id_to_lang_code[new_token_id] = new_lang
        # always move "mask" to the last position
        tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset
    
        tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
        tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
        if new_lang not in tokenizer._additional_special_tokens:
            tokenizer._additional_special_tokens.append(new_lang)
        # clear the added token encoder; otherwise a new token may end up there by mistake
        tokenizer.added_tokens_encoder = {}
        tokenizer.added_tokens_decoder = {}

    return tokenizer

def translate(
    text,
    tokenizer,
    model,
    src_lang='moo_Latn',
    tgt_lang='deu_Latn',
    a=32,
    b=3,
    max_input_length=1024,
    num_beams=4,
    **kwargs
):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

path = "CmdCody/nllb-deu-frr"
tokenizer = create_tokenizer_with_new_langs(path, ['moo_Latn', 'wir_Latn'])
model = AutoModelForSeq2SeqLM.from_pretrained(path)

translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
```

## Training
The model was trained in a Google Colab notebook for 4 epochs and a batch size of 16 following the above mentioned blog post with two notable adaptations:
1. The data iteration was changed to make sure that the model sees each example in the dataset exactly once per epoch.
2. After tokenization and batching the complete data set is shuffled before each epoch so that all translation directions are mixed. However, each batch only contains examples for one direction.

## Evaluation
Metrics on the evaluation data sets:

|            | Bleu  | ChrF++ |
|------------|-------|--------|
| Moo -> Deu | 55.78 | 70.73  |
| Deu -> Moo | 50.19 | 67.76  |
| Wir -> Deu | 67.22 | 80.16  |
| Deu -> Wir | 42.35 | 61.08  |

Note: As mentioned above the Wiringhiirder evaluation set is very small and the resulting metrics should not be compared with the Mooring 
metrics.