--- license: cc-by-nc-4.0 language: - de - frr base_model: - facebook/nllb-200-distilled-600M pipeline_tag: translation --- # Northern Frisian translation model This is an [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model fine-tuned for translating between German and the Northern Frisian dialects of Mooringer Frasch and Wiringhiirder Freesk following [this great blogpost](https://cointegrated.medium.com/a37fc706b865). While the additional data introduced with the new dialect has improved the model's performance for translations German <-> Mooring compared to [nllb-deu-moo](https://huggingface.co/CmdCody/nllb-deu-moo), the extended training has at the same time degraded the performance for other languages. For example, translating English to Mooring still works relatively well while conversely translating Mooring to English does not. ## Data 1. Mooring <-> German:
The Mooring dataset for finetuning consisted of 9339 sentence pairs. Most examples (roughly 5100) were taken directly from ["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf) published by the Nordfriisk Instituut. For sentence splitting the python [sentence-splitting library](https://pypi.org/project/sentence-splitter/) was used. The splitting wasn't perfect, especially in cases of direct speech, so that manual re-alignment and further splitting was necessary. Further, the texts about larks from Föögle önj Nordfraschlönj, Marie Tångeberg, 1992 were added, a translation of the story Bulemanns Haus by Theodor Storm, as well as roughly 3000 examples taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988. Finally, a little under 180 very simple self-written examples were used as evaluation data set. 2. Wiringhiirder <-> German:
The Wiringhiirder dataset consisted of 7529 sentence pairs taken from the books ["Di muon fuon e halie"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/Peter_Jensen__Di_muon_fuon_e_halie.pdf) and ["Di tofel"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/Peter_Jensen__Di_tofel.pdf) by Peter Jensen published by the Nordfriisk Instituut. Similar measures were taken as for Rüm Hart above. For evaluation sentences were collected from Wikipedia, however the evaluation set remains very small and is barely enough to detect overfitting. ## Usage How to use the model: ```python !pip install transformers==4.33 from transformers import AutoModelForSeq2SeqLM, NllbTokenizer def create_tokenizer_with_new_langs(model_id, new_langs): tokenizer = NllbTokenizer.from_pretrained(model_id) for lang in new_langs: old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder) new_token_id = old_len - 1 if new_lang in tokenizer.added_tokens_encoder: new_token_id = tokenizer.added_tokens_encoder[new_lang] - 1 tokenizer.lang_code_to_id[new_lang] = new_token_id tokenizer.id_to_lang_code[new_token_id] = new_lang # always move "mask" to the last position tokenizer.fairseq_tokens_to_ids[""] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id) tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()} if new_lang not in tokenizer._additional_special_tokens: tokenizer._additional_special_tokens.append(new_lang) # clear the added token encoder; otherwise a new token may end up there by mistake tokenizer.added_tokens_encoder = {} tokenizer.added_tokens_decoder = {} return tokenizer def translate( text, tokenizer, model, src_lang='moo_Latn', tgt_lang='deu_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs ): tokenizer.src_lang = src_lang tokenizer.tgt_lang = tgt_lang inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length) result = model.generate( **inputs.to(model.device), forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang), max_new_tokens=int(a + b * inputs.input_ids.shape[1]), num_beams=num_beams, **kwargs ) return tokenizer.batch_decode(result, skip_special_tokens=True) path = "CmdCody/nllb-deu-frr" tokenizer = create_tokenizer_with_new_langs(path, ['moo_Latn', 'wir_Latn']) model = AutoModelForSeq2SeqLM.from_pretrained(path) translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model) ``` ## Training The model was trained in a Google Colab notebook for 4 epochs and a batch size of 16 following the above mentioned blog post with two notable adaptations: 1. The data iteration was changed to make sure that the model sees each example in the dataset exactly once per epoch. 2. After tokenization and batching the complete data set is shuffled before each epoch so that all translation directions are mixed. However, each batch only contains examples for one direction. ## Evaluation Metrics on the evaluation data sets: | | Bleu | ChrF++ | |------------|-------|--------| | Moo -> Deu | 55.78 | 70.73 | | Deu -> Moo | 50.19 | 67.76 | | Wir -> Deu | 67.22 | 80.16 | | Deu -> Wir | 42.35 | 61.08 | Note: As mentioned above the Wiringhiirder evaluation set is very small and the resulting metrics should not be compared with the Mooring metrics.