File size: 4,309 Bytes
55ed3a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
752bd07
 
 
55ed3a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
710ccea
55ed3a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b2d75b
55ed3a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: apache-2.0
language:
- es
- oci
- arg
tags:
- translation
- low-resource
- aranese
- occitan
- multilingual
- NLLB
- bloomz
- WMT24
datasets:
- OPUS
- PILAR
- flore_plus
pipeline_tag: translation
library_name: transformers
model-index:
- name: TIM-UNIGE WMT24 Multilingual Aranese Model
  results:
  - task:
      name: Translation
      type: translation
    dataset:
      name: FLORES+
      type: flores
    metrics:
    - name: BLEU
      type: BLEU
      value: 30.1
      verified: true
      args:
        target: spa-arn
    - name: ChrF
      type: ChrF
      value: 49.8
      verified: true
      args:
        target: spa-arn
    - name: TER
      type: TER
      value: 71.5
      verified: true
      args:
        target: spa-arn
metrics:
- sacrebleu
- ter
- chrf
paper:
  - name: "TIM-UNIGE: Translation into Low-Resource Languages of Spain for WMT24"
    url: https://doi.org/10.18653/v1/2024.wmt-1.82
---

# TIM-UNIGE Multilingual Aranese Model (WMT24)

This model was submitted to the [WMT24 Shared Task on Translation into Low-Resource Languages of Spain](https://statmt.org/wmt24/translation-task.html). It is a multilingual translation model that translates from **Spanish** into **Aranese** and **Occitan**, fine-tuned from [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M).

## 🧠 Model Description

- Architecture: NLLB (600M distilled)
- Fine-tuned with a **multilingual multistage approach**
- Includes transfer from **Occitan** to improve **Aranese** translation
- Supports **Aranese and Occitan** via the `oci_Latn` language tag
- Optional special tokens `<arn>` / `<oci>` used in training to distinguish the targets

## 📊 Performance

Evaluated on **FLORES+ test set**:

| Language  | BLEU | ChrF | TER  |
|-----------|------|------|------|
| Aranese   | 30.1 | 49.8 | 71.5 |
| Aragonese | 61.9 | 79.5 | 26.8 |

- Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
- Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.


## 🗂️ Training Data

- **Real parallel data**: OPUS, PILAR (Occitan, Aranese)
- **Synthetic data**:
  - BLOOMZ-generated Aranese sentences (~59k)
  - Forward and backtranslations using Apertium
- **Final fine-tuning**: FLORES+ dev set (997 segments)

## 🛠️ Multilingual Training Setup

We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using:
- `oci_Latn` as the shared language tag
- Or, a special token prefix such as `<arn>` or `<oci>` to distinguish them

# 🚀 Quick Example (Spanish → Aranese)

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Input in Spanish
spanish_sentence = "¿Cómo se encuentra usted hoy?"

# Tokenize input
inputs = tokenizer(spanish_sentence, return_tensors="pt")

# Target language: Aranese uses 'oci_Latn' in NLLB
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
    max_length=50,
    num_beams=5
)

# Decode and print output
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)
```

Example output:
```
Com se trape vos aué?
```

## 🔍 Intended Uses

- Translate Spanish texts into **Aranese** or **Occitan**
- Research in **low-resource multilingual MT**
- Applications for **language revitalization** or public health communication

## ⚠️ Limitations

- Aranese corpora remain extremely small
- If you use the same `oci_Latn` token for both Occitan and Aranese, **disambiguation may require special prompts**
- Orthographic inconsistency or dialect variation may affect quality

## 📚 Citation

```bibtex
@inproceedings{mutal2024timunige,
  title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
  author = {Mutal, Jonathan and Ormaechea, Lucía},
  booktitle = "Proceedings of the Ninth Conference on Machine Translation",
  year = {2024},
  pages = {862--870}
}
```

## 👥 Authors

- [Jonathan Mutal](https://huggingface.co/jonathanmutal)
- Lucía Ormaechea  
TIM, University of Geneva