jonathanmutal's picture
Update README.md
4b2d75b verified
---
license: apache-2.0
language:
- es
- oci
- arg
tags:
- translation
- low-resource
- aranese
- occitan
- multilingual
- NLLB
- bloomz
- WMT24
datasets:
- OPUS
- PILAR
- flore_plus
pipeline_tag: translation
library_name: transformers
model-index:
- name: TIM-UNIGE WMT24 Multilingual Aranese Model
results:
- task:
name: Translation
type: translation
dataset:
name: FLORES+
type: flores
metrics:
- name: BLEU
type: BLEU
value: 30.1
verified: true
args:
target: spa-arn
- name: ChrF
type: ChrF
value: 49.8
verified: true
args:
target: spa-arn
- name: TER
type: TER
value: 71.5
verified: true
args:
target: spa-arn
metrics:
- sacrebleu
- ter
- chrf
paper:
- name: "TIM-UNIGE: Translation into Low-Resource Languages of Spain for WMT24"
url: https://doi.org/10.18653/v1/2024.wmt-1.82
---
# TIM-UNIGE Multilingual Aranese Model (WMT24)
This model was submitted to the [WMT24 Shared Task on Translation into Low-Resource Languages of Spain](https://statmt.org/wmt24/translation-task.html). It is a multilingual translation model that translates from **Spanish** into **Aranese** and **Occitan**, fine-tuned from [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M).
## 🧠 Model Description
- Architecture: NLLB (600M distilled)
- Fine-tuned with a **multilingual multistage approach**
- Includes transfer from **Occitan** to improve **Aranese** translation
- Supports **Aranese and Occitan** via the `oci_Latn` language tag
- Optional special tokens `<arn>` / `<oci>` used in training to distinguish the targets
## 📊 Performance
Evaluated on **FLORES+ test set**:
| Language | BLEU | ChrF | TER |
|-----------|------|------|------|
| Aranese | 30.1 | 49.8 | 71.5 |
| Aragonese | 61.9 | 79.5 | 26.8 |
- Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
- Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.
## 🗂️ Training Data
- **Real parallel data**: OPUS, PILAR (Occitan, Aranese)
- **Synthetic data**:
- BLOOMZ-generated Aranese sentences (~59k)
- Forward and backtranslations using Apertium
- **Final fine-tuning**: FLORES+ dev set (997 segments)
## 🛠️ Multilingual Training Setup
We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using:
- `oci_Latn` as the shared language tag
- Or, a special token prefix such as `<arn>` or `<oci>` to distinguish them
# 🚀 Quick Example (Spanish → Aranese)
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Input in Spanish
spanish_sentence = "¿Cómo se encuentra usted hoy?"
# Tokenize input
inputs = tokenizer(spanish_sentence, return_tensors="pt")
# Target language: Aranese uses 'oci_Latn' in NLLB
translated_tokens = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
max_length=50,
num_beams=5
)
# Decode and print output
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)
```
Example output:
```
Com se trape vos aué?
```
## 🔍 Intended Uses
- Translate Spanish texts into **Aranese** or **Occitan**
- Research in **low-resource multilingual MT**
- Applications for **language revitalization** or public health communication
## ⚠️ Limitations
- Aranese corpora remain extremely small
- If you use the same `oci_Latn` token for both Occitan and Aranese, **disambiguation may require special prompts**
- Orthographic inconsistency or dialect variation may affect quality
## 📚 Citation
```bibtex
@inproceedings{mutal2024timunige,
title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
author = {Mutal, Jonathan and Ormaechea, Lucía},
booktitle = "Proceedings of the Ninth Conference on Machine Translation",
year = {2024},
pages = {862--870}
}
```
## 👥 Authors
- [Jonathan Mutal](https://huggingface.co/jonathanmutal)
- Lucía Ormaechea
TIM, University of Geneva