File size: 1,904 Bytes
f32df87 7cd774b f32df87 7cd774b f32df87 5267210 d0ebf7c f32df87 912c812 5267210 3801785 5267210 912c812 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
language:
- ca
- es
- en
tags:
- translation
---
### Preprocessing
1. Normalisation and tokenisation with moses scripts
2. truecased with model docgWP.tcmodel.[LAN] and moses scripts
3. bped with model model.caesen40k.bpe and subword-nmt
- Note: no prepended tag for multilinguality
### Training Data
1. Bilingual es-ca: DOGC, Wikimatrix, OpenSubtitles, JW300, GlobalVoices
* Bilingual es-ca: Translations using systems trained with 1. of Oscar and Wikipedia
2. Bilingual es-en, ca-en: United Nations, Europarl, Wikimatrix, OpenSubtitles, JW300
* Bilingual es-en, ca-en: Translations using systems trained with 1. of the missing pairs
- Final training data size for the ca/es-en: 44M parallel sentences
- Finetuned with 1.5M real parallel data (without backtranslations)
### Model
Transformer big with guided alignments. Relevant parameters:
--beam-size 6
--normalize 0.6
--enc-depth 6 --dec-depth 6 --transformer-heads 8
--transformer-preprocess n --transformer-postprocess da
--transformer-dropout 0.1
--label-smoothing 0.1
--dim-emb 1024 --transformer-dim-ffn 4096
--transformer-dropout-attention 0.1
--transformer-dropout-ffn 0.1
--learn-rate 0.00015 --lr-warmup 8000 --lr-decay-inv-sqrt 8000
--optimizer-params 0.9 0.998 1e-09
--clip-norm 5
--tied-embeddings
--exponential-smoothing
--transformer-guided-alignment-layer 1 --guided-alignment-cost mse --guided-alignment-weight 0.1
## Evaluation
### Test set
https://github.com/PLXIV/Gebiotoolkit/tree/master/gebiocorpus_v2
### ca2en
BLEU|#:1|bs:1000|rs:12345|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 47.8 (μ = 47.8 ± 0.9)
chrF|#:1|bs:1000|rs:12345|c:mixed|e:yes|nc:6|nw:0|s:no|v:2.0.0 = 69.9 (μ = 69.9 ± 0.7)
### es2en
BLEU|#:1|bs:1000|rs:12345|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 48.9 (μ = 48.9 ± 0.9)
chrF2|#:1|bs:1000|rs:12345|c:mixed|e:yes|nc:6|nw:0|s:no|v:2.0.0 = 70.5 (μ = 70.5 ± 0.7)
|