Update README.md

4b2d75b verified about 3 hours ago

4.31 kB

	---
	license: apache-2.0
	language:
	- es
	- oci
	- arg
	tags:
	- translation
	- low-resource
	- aranese
	- occitan
	- multilingual
	- NLLB
	- bloomz
	- WMT24
	datasets:
	- OPUS
	- PILAR
	- flore_plus
	pipeline_tag: translation
	library_name: transformers
	model-index:
	- name: TIM-UNIGE WMT24 Multilingual Aranese Model
	results:
	- task:
	name: Translation
	type: translation
	dataset:
	name: FLORES+
	type: flores
	metrics:
	- name: BLEU
	type: BLEU
	value: 30.1
	verified: true
	args:
	target: spa-arn
	- name: ChrF
	type: ChrF
	value: 49.8
	verified: true
	args:
	target: spa-arn
	- name: TER
	type: TER
	value: 71.5
	verified: true
	args:
	target: spa-arn
	metrics:
	- sacrebleu
	- ter
	- chrf
	paper:
	- name: "TIM-UNIGE: Translation into Low-Resource Languages of Spain for WMT24"
	url: https://doi.org/10.18653/v1/2024.wmt-1.82
	---

	# TIM-UNIGE Multilingual Aranese Model (WMT24)

	This model was submitted to the [WMT24 Shared Task on Translation into Low-Resource Languages of Spain](https://statmt.org/wmt24/translation-task.html). It is a multilingual translation model that translates from Spanish into Aranese and Occitan, fine-tuned from [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M).

	## 🧠 Model Description

	- Architecture: NLLB (600M distilled)
	- Fine-tuned with a multilingual multistage approach
	- Includes transfer from Occitan to improve Aranese translation
	- Supports Aranese and Occitan via the `oci_Latn` language tag
	- Optional special tokens `<arn>` / `<oci>` used in training to distinguish the targets

	## 📊 Performance

	Evaluated on FLORES+ test set:

	\| Language \| BLEU \| ChrF \| TER \|
	\|-----------\|------\|------\|------\|
	\| Aranese \| 30.1 \| 49.8 \| 71.5 \|
	\| Aragonese \| 61.9 \| 79.5 \| 26.8 \|

	- Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
	- Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.


	## 🗂️ Training Data

	- Real parallel data: OPUS, PILAR (Occitan, Aranese)
	- Synthetic data:
	- BLOOMZ-generated Aranese sentences (~59k)
	- Forward and backtranslations using Apertium
	- Final fine-tuning: FLORES+ dev set (997 segments)

	## 🛠️ Multilingual Training Setup

	We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using:
	- `oci_Latn` as the shared language tag
	- Or, a special token prefix such as `<arn>` or `<oci>` to distinguish them

	# 🚀 Quick Example (Spanish → Aranese)

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
	import torch

	# Load model and tokenizer
	model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

	# Input in Spanish
	spanish_sentence = "¿Cómo se encuentra usted hoy?"

	# Tokenize input
	inputs = tokenizer(spanish_sentence, return_tensors="pt")

	# Target language: Aranese uses 'oci_Latn' in NLLB
	translated_tokens = model.generate(
	**inputs,
	forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
	max_length=50,
	num_beams=5
	)

	# Decode and print output
	translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
	print(translation)
	```

	Example output:
	```
	Com se trape vos aué?
	```

	## 🔍 Intended Uses

	- Translate Spanish texts into Aranese or Occitan
	- Research in low-resource multilingual MT
	- Applications for language revitalization or public health communication

	## ⚠️ Limitations

	- Aranese corpora remain extremely small
	- If you use the same `oci_Latn` token for both Occitan and Aranese, disambiguation may require special prompts
	- Orthographic inconsistency or dialect variation may affect quality

	## 📚 Citation

	```bibtex
	@inproceedings{mutal2024timunige,
	title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
	author = {Mutal, Jonathan and Ormaechea, Lucía},
	booktitle = "Proceedings of the Ninth Conference on Machine Translation",
	year = {2024},
	pages = {862--870}
	}
	```

	## 👥 Authors

	- [Jonathan Mutal](https://huggingface.co/jonathanmutal)
	- Lucía Ormaechea
	TIM, University of Geneva