File size: 4,309 Bytes
55ed3a0 752bd07 55ed3a0 710ccea 55ed3a0 4b2d75b 55ed3a0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
license: apache-2.0
language:
- es
- oci
- arg
tags:
- translation
- low-resource
- aranese
- occitan
- multilingual
- NLLB
- bloomz
- WMT24
datasets:
- OPUS
- PILAR
- flore_plus
pipeline_tag: translation
library_name: transformers
model-index:
- name: TIM-UNIGE WMT24 Multilingual Aranese Model
results:
- task:
name: Translation
type: translation
dataset:
name: FLORES+
type: flores
metrics:
- name: BLEU
type: BLEU
value: 30.1
verified: true
args:
target: spa-arn
- name: ChrF
type: ChrF
value: 49.8
verified: true
args:
target: spa-arn
- name: TER
type: TER
value: 71.5
verified: true
args:
target: spa-arn
metrics:
- sacrebleu
- ter
- chrf
paper:
- name: "TIM-UNIGE: Translation into Low-Resource Languages of Spain for WMT24"
url: https://doi.org/10.18653/v1/2024.wmt-1.82
---
# TIM-UNIGE Multilingual Aranese Model (WMT24)
This model was submitted to the [WMT24 Shared Task on Translation into Low-Resource Languages of Spain](https://statmt.org/wmt24/translation-task.html). It is a multilingual translation model that translates from **Spanish** into **Aranese** and **Occitan**, fine-tuned from [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M).
## 🧠 Model Description
- Architecture: NLLB (600M distilled)
- Fine-tuned with a **multilingual multistage approach**
- Includes transfer from **Occitan** to improve **Aranese** translation
- Supports **Aranese and Occitan** via the `oci_Latn` language tag
- Optional special tokens `<arn>` / `<oci>` used in training to distinguish the targets
## 📊 Performance
Evaluated on **FLORES+ test set**:
| Language | BLEU | ChrF | TER |
|-----------|------|------|------|
| Aranese | 30.1 | 49.8 | 71.5 |
| Aragonese | 61.9 | 79.5 | 26.8 |
- Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
- Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.
## 🗂️ Training Data
- **Real parallel data**: OPUS, PILAR (Occitan, Aranese)
- **Synthetic data**:
- BLOOMZ-generated Aranese sentences (~59k)
- Forward and backtranslations using Apertium
- **Final fine-tuning**: FLORES+ dev set (997 segments)
## 🛠️ Multilingual Training Setup
We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using:
- `oci_Latn` as the shared language tag
- Or, a special token prefix such as `<arn>` or `<oci>` to distinguish them
# 🚀 Quick Example (Spanish → Aranese)
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Input in Spanish
spanish_sentence = "¿Cómo se encuentra usted hoy?"
# Tokenize input
inputs = tokenizer(spanish_sentence, return_tensors="pt")
# Target language: Aranese uses 'oci_Latn' in NLLB
translated_tokens = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
max_length=50,
num_beams=5
)
# Decode and print output
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)
```
Example output:
```
Com se trape vos aué?
```
## 🔍 Intended Uses
- Translate Spanish texts into **Aranese** or **Occitan**
- Research in **low-resource multilingual MT**
- Applications for **language revitalization** or public health communication
## ⚠️ Limitations
- Aranese corpora remain extremely small
- If you use the same `oci_Latn` token for both Occitan and Aranese, **disambiguation may require special prompts**
- Orthographic inconsistency or dialect variation may affect quality
## 📚 Citation
```bibtex
@inproceedings{mutal2024timunige,
title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
author = {Mutal, Jonathan and Ormaechea, Lucía},
booktitle = "Proceedings of the Ninth Conference on Machine Translation",
year = {2024},
pages = {862--870}
}
```
## 👥 Authors
- [Jonathan Mutal](https://huggingface.co/jonathanmutal)
- Lucía Ormaechea
TIM, University of Geneva |