Spaces:
Running
Running
| # Examples of Training scripts for Non-autoregressive Machine Translation models | |
| ### Non-autoregressive Transformer (NAT, Gu et al., 2017) | |
| Note that we need to have an additional module to perform "length prediction" (`--length-loss-factor`) before generating the whole sequence. | |
| ```bash | |
| fairseq-train \ | |
| data-bin/wmt14_en_de_distill \ | |
| --save-dir checkpoints \ | |
| --ddp-backend=legacy_ddp \ | |
| --task translation_lev \ | |
| --criterion nat_loss \ | |
| --arch nonautoregressive_transformer \ | |
| --noise full_mask \ | |
| --share-all-embeddings \ | |
| --optimizer adam --adam-betas '(0.9,0.98)' \ | |
| --lr 0.0005 --lr-scheduler inverse_sqrt \ | |
| --stop-min-lr '1e-09' --warmup-updates 10000 \ | |
| --warmup-init-lr '1e-07' --label-smoothing 0.1 \ | |
| --dropout 0.3 --weight-decay 0.01 \ | |
| --decoder-learned-pos \ | |
| --encoder-learned-pos \ | |
| --pred-length-offset \ | |
| --length-loss-factor 0.1 \ | |
| --apply-bert-init \ | |
| --log-format 'simple' --log-interval 100 \ | |
| --fixed-validation-seed 7 \ | |
| --max-tokens 8000 \ | |
| --save-interval-updates 10000 \ | |
| --max-update 300000 | |
| ``` | |
| ### Fast Structured Decoding for Sequence Models (NAT-CRF, Sun et al., 2019) | |
| Note that we implemented a low-rank appromixated CRF model by setting `--crf-lowrank-approx=32` and `--crf-beam-approx=64` as discribed in the original paper. All other settings are the same as the vanilla NAT model. | |
| ```bash | |
| fairseq-train \ | |
| data-bin/wmt14_en_de_distill \ | |
| --save-dir checkpoints \ | |
| --ddp-backend=legacy_ddp \ | |
| --task translation_lev \ | |
| --criterion nat_loss \ | |
| --arch nacrf_transformer \ | |
| --noise full_mask \ | |
| --share-all-embeddings \ | |
| --optimizer adam --adam-betas '(0.9,0.98)' \ | |
| --lr 0.0005 --lr-scheduler inverse_sqrt \ | |
| --stop-min-lr '1e-09' --warmup-updates 10000 \ | |
| --warmup-init-lr '1e-07' --label-smoothing 0.1 \ | |
| --dropout 0.3 --weight-decay 0.01 \ | |
| --decoder-learned-pos \ | |
| --encoder-learned-pos \ | |
| --pred-length-offset \ | |
| --length-loss-factor 0.1 \ | |
| --word-ins-loss-factor 0.5 \ | |
| --crf-lowrank-approx 32 \ | |
| --crf-beam-approx 64 \ | |
| --apply-bert-init \ | |
| --log-format 'simple' --log-interval 100 \ | |
| --fixed-validation-seed 7 \ | |
| --max-tokens 8000 \ | |
| --save-interval-updates 10000 \ | |
| --max-update 300000 | |
| ``` | |
| ### Non-autoregressive Transformer with Iterative Refinement (iNAT, Lee et al., 2018) | |
| Note that `--train-step` means how many iterations of refinement we used during training, and `--dae-ratio` controls the ratio of denoising auto-encoder training described in the original paper. | |
| ```bash | |
| fairseq-train \ | |
| data-bin/wmt14_en_de_distill \ | |
| --save-dir checkpoints \ | |
| --ddp-backend=legacy_ddp \ | |
| --task translation_lev \ | |
| --criterion nat_loss \ | |
| --arch iterative_nonautoregressive_transformer \ | |
| --noise full_mask \ | |
| --share-all-embeddings \ | |
| --optimizer adam --adam-betas '(0.9,0.98)' \ | |
| --lr 0.0005 --lr-scheduler inverse_sqrt \ | |
| --stop-min-lr '1e-09' --warmup-updates 10000 \ | |
| --warmup-init-lr '1e-07' --label-smoothing 0.1 \ | |
| --dropout 0.3 --weight-decay 0.01 \ | |
| --decoder-learned-pos \ | |
| --encoder-learned-pos \ | |
| --pred-length-offset \ | |
| --length-loss-factor 0.1 \ | |
| --train-step 4 \ | |
| --dae-ratio 0.5 \ | |
| --stochastic-approx \ | |
| --apply-bert-init \ | |
| --log-format 'simple' --log-interval 100 \ | |
| --fixed-validation-seed 7 \ | |
| --max-tokens 8000 \ | |
| --save-interval-updates 10000 \ | |
| --max-update 300000 | |
| ``` | |
| ### Insertion Transformer (InsT, Stern et al., 2019) | |
| Note that we need to specify the "slot-loss" (uniform or balanced tree) described in the original paper. Here we use `--label-tau` to control the temperature. | |
| ```bash | |
| fairseq-train \ | |
| data-bin/wmt14_en_de_distill \ | |
| --save-dir checkpoints \ | |
| --ddp-backend=legacy_ddp \ | |
| --task translation_lev \ | |
| --criterion nat_loss \ | |
| --arch insertion_transformer \ | |
| --noise random_delete \ | |
| --share-all-embeddings \ | |
| --optimizer adam --adam-betas '(0.9,0.98)' \ | |
| --lr 0.0005 --lr-scheduler inverse_sqrt \ | |
| --stop-min-lr '1e-09' --warmup-updates 10000 \ | |
| --warmup-init-lr '1e-07' --label-smoothing 0.1 \ | |
| --dropout 0.3 --weight-decay 0.01 \ | |
| --decoder-learned-pos \ | |
| --encoder-learned-pos \ | |
| --apply-bert-init \ | |
| --log-format 'simple' --log-interval 100 \ | |
| --fixed-validation-seed 7 \ | |
| --max-tokens 8000 \ | |
| --save-interval-updates 10000 \ | |
| --max-update 300000 | |
| ``` | |
| ### Mask Predict (CMLM, Ghazvininejad et al., 2019) | |
| ```bash | |
| fairseq-train \ | |
| data-bin/wmt14_en_de_distill \ | |
| --save-dir checkpoints \ | |
| --ddp-backend=legacy_ddp \ | |
| --task translation_lev \ | |
| --criterion nat_loss \ | |
| --arch cmlm_transformer \ | |
| --noise random_mask \ | |
| --share-all-embeddings \ | |
| --optimizer adam --adam-betas '(0.9,0.98)' \ | |
| --lr 0.0005 --lr-scheduler inverse_sqrt \ | |
| --stop-min-lr '1e-09' --warmup-updates 10000 \ | |
| --warmup-init-lr '1e-07' --label-smoothing 0.1 \ | |
| --dropout 0.3 --weight-decay 0.01 \ | |
| --decoder-learned-pos \ | |
| --encoder-learned-pos \ | |
| --apply-bert-init \ | |
| --log-format 'simple' --log-interval 100 \ | |
| --fixed-validation-seed 7 \ | |
| --max-tokens 8000 \ | |
| --save-interval-updates 10000 \ | |
| --max-update 300000 | |
| ``` | |
| ### Levenshtein Transformer (LevT, Gu et al., 2019) | |
| ```bash | |
| fairseq-train \ | |
| data-bin/wmt14_en_de_distill \ | |
| --save-dir checkpoints \ | |
| --ddp-backend=legacy_ddp \ | |
| --task translation_lev \ | |
| --criterion nat_loss \ | |
| --arch levenshtein_transformer \ | |
| --noise random_delete \ | |
| --share-all-embeddings \ | |
| --optimizer adam --adam-betas '(0.9,0.98)' \ | |
| --lr 0.0005 --lr-scheduler inverse_sqrt \ | |
| --stop-min-lr '1e-09' --warmup-updates 10000 \ | |
| --warmup-init-lr '1e-07' --label-smoothing 0.1 \ | |
| --dropout 0.3 --weight-decay 0.01 \ | |
| --decoder-learned-pos \ | |
| --encoder-learned-pos \ | |
| --apply-bert-init \ | |
| --log-format 'simple' --log-interval 100 \ | |
| --fixed-validation-seed 7 \ | |
| --max-tokens 8000 \ | |
| --save-interval-updates 10000 \ | |
| --max-update 300000 | |
| ``` | |