TIM-UNIGE Multilingual Aranese Model (WMT24)
This model was submitted to the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. It is a multilingual translation model that translates from Spanish into Aranese and Occitan, fine-tuned from facebook/nllb-200-distilled-600M
.
🧠 Model Description
- Architecture: NLLB (600M distilled)
- Fine-tuned with a multilingual multistage approach
- Includes transfer from Occitan to improve Aranese translation
- Supports Aranese and Occitan via the
oci_Latn
language tag - Optional special tokens
<arn>
/<oci>
used in training to distinguish the targets
📊 Performance
Evaluated on FLORES+ test set:
Language | BLEU | ChrF | TER |
---|---|---|---|
Aranese | 30.1 | 49.8 | 71.5 |
Aragonese | 61.9 | 79.5 | 26.8 |
- Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
- Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.
🗂️ Training Data
- Real parallel data: OPUS, PILAR (Occitan, Aranese)
- Synthetic data:
- BLOOMZ-generated Aranese sentences (~59k)
- Forward and backtranslations using Apertium
- Final fine-tuning: FLORES+ dev set (997 segments)
🛠️ Multilingual Training Setup
We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using:
oci_Latn
as the shared language tag- Or, a special token prefix such as
<arn>
or<oci>
to distinguish them
🚀 Quick Example (Spanish → Aranese)
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Input in Spanish
spanish_sentence = "¿Cómo se encuentra usted hoy?"
# Tokenize input
inputs = tokenizer(spanish_sentence, return_tensors="pt")
# Target language: Aranese uses 'oci_Latn' in NLLB
translated_tokens = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
max_length=50,
num_beams=5
)
# Decode and print output
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)
Example output:
Coma se tròbe ué?
🔍 Intended Uses
- Translate Spanish texts into Aranese or Occitan
- Research in low-resource multilingual MT
- Applications for language revitalization or public health communication
⚠️ Limitations
- Aranese corpora remain extremely small
- If you use the same
oci_Latn
token for both Occitan and Aranese, disambiguation may require special prompts - Orthographic inconsistency or dialect variation may affect quality
📚 Citation
@inproceedings{mutal2024timunige,
title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
author = {Mutal, Jonathan and Ormaechea, Lucía},
booktitle = "Proceedings of the Ninth Conference on Machine Translation",
year = {2024},
pages = {862--870}
}
👥 Authors
- Jonathan Mutal
- Lucía Ormaechea
TIM, University of Geneva
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Evaluation results
- BLEU on FLORES+self-reported30.100
- ChrF on FLORES+self-reported49.800
- TER on FLORES+self-reported71.500