|
--- |
|
license: apache-2.0 |
|
language: |
|
- es |
|
- oci |
|
- arg |
|
tags: |
|
- translation |
|
- low-resource |
|
- aranese |
|
- occitan |
|
- multilingual |
|
- NLLB |
|
- bloomz |
|
- WMT24 |
|
datasets: |
|
- OPUS |
|
- PILAR |
|
- flore_plus |
|
pipeline_tag: translation |
|
library_name: transformers |
|
model-index: |
|
- name: TIM-UNIGE WMT24 Multilingual Aranese Model |
|
results: |
|
- task: |
|
name: Translation |
|
type: translation |
|
dataset: |
|
name: FLORES+ |
|
type: flores |
|
metrics: |
|
- name: BLEU |
|
type: BLEU |
|
value: 30.1 |
|
verified: true |
|
args: |
|
target: spa-arn |
|
- name: ChrF |
|
type: ChrF |
|
value: 49.8 |
|
verified: true |
|
args: |
|
target: spa-arn |
|
- name: TER |
|
type: TER |
|
value: 71.5 |
|
verified: true |
|
args: |
|
target: spa-arn |
|
metrics: |
|
- sacrebleu |
|
- ter |
|
- chrf |
|
paper: |
|
- name: "TIM-UNIGE: Translation into Low-Resource Languages of Spain for WMT24" |
|
url: https://doi.org/10.18653/v1/2024.wmt-1.82 |
|
--- |
|
|
|
# TIM-UNIGE Multilingual Aranese Model (WMT24) |
|
|
|
This model was submitted to the [WMT24 Shared Task on Translation into Low-Resource Languages of Spain](https://statmt.org/wmt24/translation-task.html). It is a multilingual translation model that translates from **Spanish** into **Aranese** and **Occitan**, fine-tuned from [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M). |
|
|
|
## 🧠 Model Description |
|
|
|
- Architecture: NLLB (600M distilled) |
|
- Fine-tuned with a **multilingual multistage approach** |
|
- Includes transfer from **Occitan** to improve **Aranese** translation |
|
- Supports **Aranese and Occitan** via the `oci_Latn` language tag |
|
- Optional special tokens `<arn>` / `<oci>` used in training to distinguish the targets |
|
|
|
## 📊 Performance |
|
|
|
Evaluated on **FLORES+ test set**: |
|
|
|
| Language | BLEU | ChrF | TER | |
|
|-----------|------|------|------| |
|
| Aranese | 30.1 | 49.8 | 71.5 | |
|
| Aragonese | 61.9 | 79.5 | 26.8 | |
|
|
|
- Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU. |
|
- Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU. |
|
|
|
|
|
## 🗂️ Training Data |
|
|
|
- **Real parallel data**: OPUS, PILAR (Occitan, Aranese) |
|
- **Synthetic data**: |
|
- BLOOMZ-generated Aranese sentences (~59k) |
|
- Forward and backtranslations using Apertium |
|
- **Final fine-tuning**: FLORES+ dev set (997 segments) |
|
|
|
## 🛠️ Multilingual Training Setup |
|
|
|
We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using: |
|
- `oci_Latn` as the shared language tag |
|
- Or, a special token prefix such as `<arn>` or `<oci>` to distinguish them |
|
|
|
# 🚀 Quick Example (Spanish → Aranese) |
|
|
|
```python |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
# Input in Spanish |
|
spanish_sentence = "¿Cómo se encuentra usted hoy?" |
|
|
|
# Tokenize input |
|
inputs = tokenizer(spanish_sentence, return_tensors="pt") |
|
|
|
# Target language: Aranese uses 'oci_Latn' in NLLB |
|
translated_tokens = model.generate( |
|
**inputs, |
|
forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"), |
|
max_length=50, |
|
num_beams=5 |
|
) |
|
|
|
# Decode and print output |
|
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] |
|
print(translation) |
|
``` |
|
|
|
Example output: |
|
``` |
|
Com se trape vos aué? |
|
``` |
|
|
|
## 🔍 Intended Uses |
|
|
|
- Translate Spanish texts into **Aranese** or **Occitan** |
|
- Research in **low-resource multilingual MT** |
|
- Applications for **language revitalization** or public health communication |
|
|
|
## ⚠️ Limitations |
|
|
|
- Aranese corpora remain extremely small |
|
- If you use the same `oci_Latn` token for both Occitan and Aranese, **disambiguation may require special prompts** |
|
- Orthographic inconsistency or dialect variation may affect quality |
|
|
|
## 📚 Citation |
|
|
|
```bibtex |
|
@inproceedings{mutal2024timunige, |
|
title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}", |
|
author = {Mutal, Jonathan and Ormaechea, Lucía}, |
|
booktitle = "Proceedings of the Ninth Conference on Machine Translation", |
|
year = {2024}, |
|
pages = {862--870} |
|
} |
|
``` |
|
|
|
## 👥 Authors |
|
|
|
- [Jonathan Mutal](https://huggingface.co/jonathanmutal) |
|
- Lucía Ormaechea |
|
TIM, University of Geneva |