TIM-UNIGE Multilingual Aranese Model (WMT24)

This model was submitted to the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. It is a multilingual translation model that translates from Spanish into Aranese and Occitan, fine-tuned from facebook/nllb-200-distilled-600M.

🧠 Model Description

Architecture: NLLB (600M distilled)
Fine-tuned with a multilingual multistage approach
Includes transfer from Occitan to improve Aranese translation
Supports Aranese and Occitan via the oci_Latn language tag
Optional special tokens <arn> / <oci> used in training to distinguish the targets

📊 Performance

Evaluated on FLORES+ test set:

Language	BLEU	ChrF	TER
Aranese	30.1	49.8	71.5
Aragonese	61.9	79.5	26.8

Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.

🗂️ Training Data

Real parallel data: OPUS, PILAR (Occitan, Aranese)
Synthetic data:
- BLOOMZ-generated Aranese sentences (~59k)
- Forward and backtranslations using Apertium
Final fine-tuning: FLORES+ dev set (997 segments)

🛠️ Multilingual Training Setup

We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using:

oci_Latn as the shared language tag
Or, a special token prefix such as <arn> or <oci> to distinguish them

🚀 Quick Example (Spanish → Aranese)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Input in Spanish
spanish_sentence = "¿Cómo se encuentra usted hoy?"

# Tokenize input
inputs = tokenizer(spanish_sentence, return_tensors="pt")

# Target language: Aranese uses 'oci_Latn' in NLLB
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
    max_length=50,
    num_beams=5
)

# Decode and print output
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)

Example output:

Coma se tròbe ué?

🔍 Intended Uses

Translate Spanish texts into Aranese or Occitan
Research in low-resource multilingual MT
Applications for language revitalization or public health communication

⚠️ Limitations

Aranese corpora remain extremely small
If you use the same oci_Latn token for both Occitan and Aranese, disambiguation may require special prompts
Orthographic inconsistency or dialect variation may affect quality

📚 Citation

@inproceedings{mutal2024timunige,
  title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
  author = {Mutal, Jonathan and Ormaechea, Lucía},
  booktitle = "Proceedings of the Ninth Conference on Machine Translation",
  year = {2024},
  pages = {862--870}
}

👥 Authors

Jonathan Mutal
Lucía Ormaechea
TIM, University of Geneva

jonathanmutal
/

WMT24-spanish-to-aranese-aragonese