Preprocessing

  1. Normalisation and tokenisation with moses scripts
  2. truecased with model docgWP.tcmodel.[LAN] and moses scripts
  3. bped with model model.caesen40k.bpe and subword-nmt
  • Note: no prepended tag for multilinguality

Training Data

  1. Bilingual es-ca: DOGC, Wikimatrix, OpenSubtitles, JW300, GlobalVoices
  • Bilingual es-ca: Translations using systems trained with 1. of Oscar and Wikipedia
  1. Bilingual es-en, ca-en: United Nations, Europarl, Wikimatrix, OpenSubtitles, JW300
  • Bilingual es-en, ca-en: Translations using systems trained with 1. of the missing pairs
  • Final training data size for the ca/es-en: 44M parallel sentences
  • Finetuned with 1.5M real parallel data (without backtranslations)

Model

Transformer big with guided alignments. Relevant parameters:

--beam-size 6

--normalize 0.6

--enc-depth 6 --dec-depth 6 --transformer-heads 8

--transformer-preprocess n --transformer-postprocess da

--transformer-dropout 0.1

--label-smoothing 0.1

--dim-emb 1024 --transformer-dim-ffn 4096

--transformer-dropout-attention 0.1

--transformer-dropout-ffn 0.1

--learn-rate 0.00015 --lr-warmup 8000 --lr-decay-inv-sqrt 8000

--optimizer-params 0.9 0.998 1e-09

--clip-norm 5

--tied-embeddings

--exponential-smoothing

--transformer-guided-alignment-layer 1 --guided-alignment-cost mse --guided-alignment-weight 0.1

Evaluation

Test set

https://github.com/PLXIV/Gebiotoolkit/tree/master/gebiocorpus_v2

ca2en

BLEU|#:1|bs:1000|rs:12345|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 47.8 (μ = 47.8 ± 0.9)

chrF|#:1|bs:1000|rs:12345|c:mixed|e:yes|nc:6|nw:0|s:no|v:2.0.0 = 69.9 (μ = 69.9 ± 0.7)

es2en

BLEU|#:1|bs:1000|rs:12345|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 48.9 (μ = 48.9 ± 0.9)

chrF2|#:1|bs:1000|rs:12345|c:mixed|e:yes|nc:6|nw:0|s:no|v:2.0.0 = 70.5 (μ = 70.5 ± 0.7)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.