File size: 4,309 Bytes

---
license: apache-2.0
language:
- es
- oci
- arg
tags:
- translation
- low-resource
- aranese
- occitan
- multilingual
- NLLB
- bloomz
- WMT24
datasets:
- OPUS
- PILAR
- flore_plus
pipeline_tag: translation
library_name: transformers
model-index:
- name: TIM-UNIGE WMT24 Multilingual Aranese Model
  results:
  - task:
      name: Translation
      type: translation
    dataset:
      name: FLORES+
      type: flores
    metrics:
    - name: BLEU
      type: BLEU
      value: 30.1
      verified: true
      args:
        target: spa-arn
    - name: ChrF
      type: ChrF
      value: 49.8
      verified: true
      args:
        target: spa-arn
    - name: TER
      type: TER
      value: 71.5
      verified: true
      args:
        target: spa-arn
metrics:
- sacrebleu
- ter
- chrf
paper:
  - name: "TIM-UNIGE: Translation into Low-Resource Languages of Spain for WMT24"
    url: https://doi.org/10.18653/v1/2024.wmt-1.82
---

# TIM-UNIGE Multilingual Aranese Model (WMT24)

This model was submitted to the [WMT24 Shared Task on Translation into Low-Resource Languages of Spain](https://statmt.org/wmt24/translation-task.html). It is a multilingual translation model that translates from **Spanish** into **Aranese** and **Occitan**, fine-tuned from [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M).

## 🧠 Model Description

- Architecture: NLLB (600M distilled)
- Fine-tuned with a **multilingual multistage approach**
- Includes transfer from **Occitan** to improve **Aranese** translation
- Supports **Aranese and Occitan** via the `oci_Latn` language tag
- Optional special tokens `<arn>` / `<oci>` used in training to distinguish the targets

## 📊 Performance

Evaluated on **FLORES+ test set**:

| Language  | BLEU | ChrF | TER  |
|-----------|------|------|------|
| Aranese   | 30.1 | 49.8 | 71.5 |
| Aragonese | 61.9 | 79.5 | 26.8 |

- Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
- Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.


## 🗂️ Training Data

- **Real parallel data**: OPUS, PILAR (Occitan, Aranese)
- **Synthetic data**:
  - BLOOMZ-generated Aranese sentences (~59k)
  - Forward and backtranslations using Apertium
- **Final fine-tuning**: FLORES+ dev set (997 segments)

## 🛠️ Multilingual Training Setup

We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using:
- `oci_Latn` as the shared language tag
- Or, a special token prefix such as `<arn>` or `<oci>` to distinguish them

# 🚀 Quick Example (Spanish → Aranese)

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Input in Spanish
spanish_sentence = "¿Cómo se encuentra usted hoy?"

# Tokenize input
inputs = tokenizer(spanish_sentence, return_tensors="pt")

# Target language: Aranese uses 'oci_Latn' in NLLB
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
    max_length=50,
    num_beams=5
)

# Decode and print output
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)
```

Example output:
```
Com se trape vos aué?
```

## 🔍 Intended Uses

- Translate Spanish texts into **Aranese** or **Occitan**
- Research in **low-resource multilingual MT**
- Applications for **language revitalization** or public health communication

## ⚠️ Limitations

- Aranese corpora remain extremely small
- If you use the same `oci_Latn` token for both Occitan and Aranese, **disambiguation may require special prompts**
- Orthographic inconsistency or dialect variation may affect quality

## 📚 Citation

```bibtex
@inproceedings{mutal2024timunige,
  title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
  author = {Mutal, Jonathan and Ormaechea, Lucía},
  booktitle = "Proceedings of the Ninth Conference on Machine Translation",
  year = {2024},
  pages = {862--870}
}
```

## 👥 Authors

- [Jonathan Mutal](https://huggingface.co/jonathanmutal)
- Lucía Ormaechea  
TIM, University of Geneva