File size: 1,904 Bytes
f32df87
7cd774b
f32df87
7cd774b
 
f32df87
5267210
d0ebf7c
f32df87
912c812
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5267210
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3801785
5267210
 
 
912c812
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
language: 
- ca 
- es
- en
tags:
- translation
---

### Preprocessing
1. Normalisation and tokenisation with moses scripts
2. truecased with model docgWP.tcmodel.[LAN] and moses scripts
3. bped with model model.caesen40k.bpe and subword-nmt
- Note: no prepended tag for multilinguality

### Training Data
1. Bilingual es-ca: DOGC, Wikimatrix, OpenSubtitles, JW300, GlobalVoices
* Bilingual es-ca: Translations using systems trained with 1. of Oscar and Wikipedia
2. Bilingual es-en, ca-en: United Nations, Europarl, Wikimatrix, OpenSubtitles, JW300
* Bilingual es-en, ca-en: Translations using systems trained with 1. of the missing pairs

- Final training data size for the ca/es-en: 44M parallel sentences
- Finetuned with 1.5M real parallel data (without backtranslations)

### Model
Transformer big with guided alignments. Relevant parameters:

--beam-size 6 

--normalize 0.6 

--enc-depth 6  --dec-depth 6  --transformer-heads 8

--transformer-preprocess n  --transformer-postprocess da 

--transformer-dropout 0.1 

--label-smoothing 0.1 

--dim-emb 1024  --transformer-dim-ffn 4096 

--transformer-dropout-attention 0.1 

--transformer-dropout-ffn 0.1 

--learn-rate 0.00015 --lr-warmup 8000 --lr-decay-inv-sqrt 8000 

--optimizer-params 0.9 0.998 1e-09 

--clip-norm 5 

--tied-embeddings 

--exponential-smoothing 

--transformer-guided-alignment-layer 1 --guided-alignment-cost mse --guided-alignment-weight 0.1


## Evaluation

### Test set

https://github.com/PLXIV/Gebiotoolkit/tree/master/gebiocorpus_v2

### ca2en
 BLEU|#:1|bs:1000|rs:12345|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 47.8 (μ = 47.8 ± 0.9)

 chrF|#:1|bs:1000|rs:12345|c:mixed|e:yes|nc:6|nw:0|s:no|v:2.0.0 = 69.9 (μ = 69.9 ± 0.7)

### es2en
BLEU|#:1|bs:1000|rs:12345|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 48.9 (μ = 48.9 ± 0.9) 

chrF2|#:1|bs:1000|rs:12345|c:mixed|e:yes|nc:6|nw:0|s:no|v:2.0.0 = 70.5 (μ = 70.5 ± 0.7)