TIM-UNIGE Multilingual Aranese Model (WMT24)

This model was submitted to the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. It is a multilingual translation model that translates from Spanish into Aranese and Occitan, fine-tuned from facebook/nllb-200-distilled-600M.

🧠 Model Description

  • Architecture: NLLB (600M distilled)
  • Fine-tuned with a multilingual multistage approach
  • Includes transfer from Occitan to improve Aranese translation
  • Supports Aranese and Occitan via the oci_Latn language tag
  • Optional special tokens <arn> / <oci> used in training to distinguish the targets

📊 Performance

Evaluated on FLORES+ test set:

Language BLEU ChrF TER
Aranese 30.1 49.8 71.5
Aragonese 61.9 79.5 26.8
  • Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
  • Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.

🗂️ Training Data

  • Real parallel data: OPUS, PILAR (Occitan, Aranese)
  • Synthetic data:
    • BLOOMZ-generated Aranese sentences (~59k)
    • Forward and backtranslations using Apertium
  • Final fine-tuning: FLORES+ dev set (997 segments)

🛠️ Multilingual Training Setup

We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using:

  • oci_Latn as the shared language tag
  • Or, a special token prefix such as <arn> or <oci> to distinguish them

🚀 Quick Example (Spanish → Aranese)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Input in Spanish
spanish_sentence = "¿Cómo se encuentra usted hoy?"

# Tokenize input
inputs = tokenizer(spanish_sentence, return_tensors="pt")

# Target language: Aranese uses 'oci_Latn' in NLLB
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
    max_length=50,
    num_beams=5
)

# Decode and print output
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)

Example output:

Coma se tròbe ué?

🔍 Intended Uses

  • Translate Spanish texts into Aranese or Occitan
  • Research in low-resource multilingual MT
  • Applications for language revitalization or public health communication

⚠️ Limitations

  • Aranese corpora remain extremely small
  • If you use the same oci_Latn token for both Occitan and Aranese, disambiguation may require special prompts
  • Orthographic inconsistency or dialect variation may affect quality

📚 Citation

@inproceedings{mutal2024timunige,
  title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
  author = {Mutal, Jonathan and Ormaechea, Lucía},
  booktitle = "Proceedings of the Ninth Conference on Machine Translation",
  year = {2024},
  pages = {862--870}
}

👥 Authors

Downloads last month
3
Safetensors
Model size
615M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results