nllb-deu-frr / README.md

Update README.md

868efdb verified about 2 months ago

5.58 kB

	---
	license: cc-by-nc-4.0
	language:
	- de
	- frr
	base_model:
	- facebook/nllb-200-distilled-600M
	pipeline_tag: translation
	---

	# Northern Frisian translation model
	This is an [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model fine-tuned for translating between German and
	the Northern Frisian dialects of Mooringer Frasch and Wiringhiirder Freesk following
	[this great blogpost](https://cointegrated.medium.com/a37fc706b865).

	While the additional data introduced with the new dialect has improved the model's performance for translations German <-> Mooring
	compared to [nllb-deu-moo](https://huggingface.co/CmdCody/nllb-deu-moo), the extended training has at the same time degraded
	the performance for other languages. For example, translating English to Mooring still works relatively well while conversely translating
	Mooring to English does not.

	## Data

	1. Mooring <-> German:<br>
	The Mooring dataset for finetuning consisted of 9339 sentence pairs.
	Most examples (roughly 5100) were taken directly from
	["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf)
	published by the Nordfriisk Instituut. For sentence splitting the python
	[sentence-splitting library](https://pypi.org/project/sentence-splitter/) was used. The splitting wasn't perfect,
	especially in cases of direct speech, so that manual re-alignment and further splitting was necessary.
	Further, the texts about larks from Föögle önj Nordfraschlönj, Marie Tångeberg, 1992 were added, a translation of the
	story Bulemanns Haus by Theodor Storm, as well as roughly 3000 examples taken from the Frasch Uurdebök,
	Friesisches Wörterbuch, Neumünster 1988.
	Finally, a little under 180 very simple self-written examples were used as evaluation data set.

	2. Wiringhiirder <-> German:<br>
	The Wiringhiirder dataset consisted of 7529 sentence pairs taken from the books
	["Di muon fuon e halie"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/Peter_Jensen__Di_muon_fuon_e_halie.pdf)
	and ["Di tofel"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/Peter_Jensen__Di_tofel.pdf)
	by Peter Jensen published by the Nordfriisk Instituut. Similar measures were taken as for Rüm Hart above.
	For evaluation sentences were collected from Wikipedia, however the evaluation set remains very small and is barely enough to detect
	overfitting.


	## Usage
	How to use the model:
	```python
	!pip install transformers==4.33

	from transformers import AutoModelForSeq2SeqLM, NllbTokenizer

	def create_tokenizer_with_new_langs(model_id, new_langs):
	tokenizer = NllbTokenizer.from_pretrained(model_id)
	for lang in new_langs:
	old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
	new_token_id = old_len - 1
	if new_lang in tokenizer.added_tokens_encoder:
	new_token_id = tokenizer.added_tokens_encoder[new_lang] - 1
	tokenizer.lang_code_to_id[new_lang] = new_token_id
	tokenizer.id_to_lang_code[new_token_id] = new_lang
	# always move "mask" to the last position
	tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

	tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
	tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
	if new_lang not in tokenizer._additional_special_tokens:
	tokenizer._additional_special_tokens.append(new_lang)
	# clear the added token encoder; otherwise a new token may end up there by mistake
	tokenizer.added_tokens_encoder = {}
	tokenizer.added_tokens_decoder = {}

	return tokenizer

	def translate(
	text,
	tokenizer,
	model,
	src_lang='moo_Latn',
	tgt_lang='deu_Latn',
	a=32,
	b=3,
	max_input_length=1024,
	num_beams=4,
	**kwargs
	):
	tokenizer.src_lang = src_lang
	tokenizer.tgt_lang = tgt_lang
	inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
	result = model.generate(
	**inputs.to(model.device),
	forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
	max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
	num_beams=num_beams,
	**kwargs
	)
	return tokenizer.batch_decode(result, skip_special_tokens=True)

	path = "CmdCody/nllb-deu-frr"
	tokenizer = create_tokenizer_with_new_langs(path, ['moo_Latn', 'wir_Latn'])
	model = AutoModelForSeq2SeqLM.from_pretrained(path)

	translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
	```

	## Training
	The model was trained in a Google Colab notebook for 4 epochs and a batch size of 16 following the above mentioned blog post with two notable adaptations:
	1. The data iteration was changed to make sure that the model sees each example in the dataset exactly once per epoch.
	2. After tokenization and batching the complete data set is shuffled before each epoch so that all translation directions are mixed. However, each batch only contains examples for one direction.

	## Evaluation
	Metrics on the evaluation data sets:

	\| \| Bleu \| ChrF++ \|
	\|------------\|-------\|--------\|
	\| Moo -> Deu \| 55.78 \| 70.73 \|
	\| Deu -> Moo \| 50.19 \| 67.76 \|
	\| Wir -> Deu \| 67.22 \| 80.16 \|
	\| Deu -> Wir \| 42.35 \| 61.08 \|

	Note: As mentioned above the Wiringhiirder evaluation set is very small and the resulting metrics should not be compared with the Mooring
	metrics.