metadata

license: apache-2.0
language:
  - es
  - oci
  - arg
tags:
  - translation
  - low-resource
  - aranese
  - occitan
  - multilingual
  - NLLB
  - bloomz
  - WMT24
datasets:
  - OPUS
  - PILAR
  - flore_plus
pipeline_tag: translation
library_name: transformers
model-index:
  - name: TIM-UNIGE WMT24 Multilingual Aranese Model
    results:
      - task:
          name: Translation
          type: translation
        dataset:
          name: FLORES+
          type: flores
        metrics:
          - name: BLEU
            type: BLEU
            value: 30.1
            verified: true
            args:
              target: spa-arn
          - name: ChrF
            type: ChrF
            value: 49.8
            verified: true
            args:
              target: spa-arn
          - name: TER
            type: TER
            value: 71.5
            verified: true
            args:
              target: spa-arn
metrics:
  - sacrebleu
  - ter
  - chrf
paper:
  - name: 'TIM-UNIGE: Translation into Low-Resource Languages of Spain for WMT24'
    url: https://doi.org/10.18653/v1/2024.wmt-1.82

TIM-UNIGE Multilingual Aranese Model (WMT24)

This model was submitted to the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. It is a multilingual translation model that translates from Spanish into Aranese and Occitan, fine-tuned from facebook/nllb-200-distilled-600M.

🧠 Model Description

Architecture: NLLB (600M distilled)
Fine-tuned with a multilingual multistage approach
Includes transfer from Occitan to improve Aranese translation
Supports Aranese and Occitan via the oci_Latn language tag
Optional special tokens <arn> / <oci> used in training to distinguish the targets

📊 Performance

Evaluated on FLORES+ test set:

Language	BLEU	ChrF	TER
Aranese	30.1	49.8	71.5
Aragonese	61.9	79.5	26.8

Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.

🗂️ Training Data

Real parallel data: OPUS, PILAR (Occitan, Aranese)
Synthetic data:
- BLOOMZ-generated Aranese sentences (~59k)
- Forward and backtranslations using Apertium
Final fine-tuning: FLORES+ dev set (997 segments)

🛠️ Multilingual Training Setup

We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using:

oci_Latn as the shared language tag
Or, a special token prefix such as <arn> or <oci> to distinguish them

🚀 Quick Example (Spanish → Aranese)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Input in Spanish
spanish_sentence = "¿Cómo se encuentra usted hoy?"

# Tokenize input
inputs = tokenizer(spanish_sentence, return_tensors="pt")

# Target language: Aranese uses 'oci_Latn' in NLLB
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
    max_length=50,
    num_beams=5
)

# Decode and print output
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)

Example output:

Coma se tròbe ué?

🔍 Intended Uses

Translate Spanish texts into Aranese or Occitan
Research in low-resource multilingual MT
Applications for language revitalization or public health communication

⚠️ Limitations

Aranese corpora remain extremely small
If you use the same oci_Latn token for both Occitan and Aranese, disambiguation may require special prompts
Orthographic inconsistency or dialect variation may affect quality

📚 Citation

@inproceedings{mutal2024timunige,
  title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
  author = {Mutal, Jonathan and Ormaechea, Lucía},
  booktitle = "Proceedings of the Ninth Conference on Machine Translation",
  year = {2024},
  pages = {862--870}
}

👥 Authors

Jonathan Mutal
Lucía Ormaechea
TIM, University of Geneva