--- license: apache-2.0 language: - es - oci - arg tags: - translation - low-resource - aranese - occitan - multilingual - NLLB - bloomz - WMT24 datasets: - OPUS - PILAR - flore_plus pipeline_tag: translation library_name: transformers model-index: - name: TIM-UNIGE WMT24 Multilingual Aranese Model results: - task: name: Translation type: translation dataset: name: FLORES+ type: flores metrics: - name: BLEU type: BLEU value: 30.1 verified: true args: target: spa-arn - name: ChrF type: ChrF value: 49.8 verified: true args: target: spa-arn - name: TER type: TER value: 71.5 verified: true args: target: spa-arn metrics: - sacrebleu - ter - chrf paper: - name: "TIM-UNIGE: Translation into Low-Resource Languages of Spain for WMT24" url: https://doi.org/10.18653/v1/2024.wmt-1.82 --- # TIM-UNIGE Multilingual Aranese Model (WMT24) This model was submitted to the [WMT24 Shared Task on Translation into Low-Resource Languages of Spain](https://statmt.org/wmt24/translation-task.html). It is a multilingual translation model that translates from **Spanish** into **Aranese** and **Occitan**, fine-tuned from [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M). ## 🧠 Model Description - Architecture: NLLB (600M distilled) - Fine-tuned with a **multilingual multistage approach** - Includes transfer from **Occitan** to improve **Aranese** translation - Supports **Aranese and Occitan** via the `oci_Latn` language tag - Optional special tokens `` / `` used in training to distinguish the targets ## 📊 Performance Evaluated on **FLORES+ test set**: | Language | BLEU | ChrF | TER | |-----------|------|------|------| | Aranese | 30.1 | 49.8 | 71.5 | | Aragonese | 61.9 | 79.5 | 26.8 | - Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU. - Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU. ## 🗂️ Training Data - **Real parallel data**: OPUS, PILAR (Occitan, Aranese) - **Synthetic data**: - BLOOMZ-generated Aranese sentences (~59k) - Forward and backtranslations using Apertium - **Final fine-tuning**: FLORES+ dev set (997 segments) ## 🛠️ Multilingual Training Setup We trained the model on Spanish–Occitan and Spanish–Aranese jointly, using: - `oci_Latn` as the shared language tag - Or, a special token prefix such as `` or `` to distinguish them # 🚀 Quick Example (Spanish → Aranese) ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer import torch # Load model and tokenizer model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Input in Spanish spanish_sentence = "¿Cómo se encuentra usted hoy?" # Tokenize input inputs = tokenizer(spanish_sentence, return_tensors="pt") # Target language: Aranese uses 'oci_Latn' in NLLB translated_tokens = model.generate( **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"), max_length=50, num_beams=5 ) # Decode and print output translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] print(translation) ``` Example output: ``` Com se trape vos aué? ``` ## 🔍 Intended Uses - Translate Spanish texts into **Aranese** or **Occitan** - Research in **low-resource multilingual MT** - Applications for **language revitalization** or public health communication ## ⚠️ Limitations - Aranese corpora remain extremely small - If you use the same `oci_Latn` token for both Occitan and Aranese, **disambiguation may require special prompts** - Orthographic inconsistency or dialect variation may affect quality ## 📚 Citation ```bibtex @inproceedings{mutal2024timunige, title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}", author = {Mutal, Jonathan and Ormaechea, Lucía}, booktitle = "Proceedings of the Ninth Conference on Machine Translation", year = {2024}, pages = {862--870} } ``` ## 👥 Authors - [Jonathan Mutal](https://huggingface.co/jonathanmutal) - Lucía Ormaechea TIM, University of Geneva