Spaces:
Running
Running
| # An example of English to Japaneses Simultaneous Translation System | |
| This is an example of training and evaluating a transformer *wait-k* English to Japanese simultaneous text-to-text translation model. | |
| ## Data Preparation | |
| This section introduces the data preparation for training and evaluation. | |
| If you only want to evaluate the model, please jump to [Inference & Evaluation](#inference-&-evaluation) | |
| For illustration, we only use the following subsets of the available data from [WMT20 news translation task](http://www.statmt.org/wmt20/translation-task.html), which results in 7,815,391 sentence pairs. | |
| - News Commentary v16 | |
| - Wiki Titles v3 | |
| - WikiMatrix V1 | |
| - Japanese-English Subtitle Corpus | |
| - The Kyoto Free Translation Task Corpus | |
| We use WMT20 development data as development set. Training `transformer_vaswani_wmt_en_de_big` model on such amount of data will result in 17.3 BLEU with greedy search and 19.7 with beam (10) search. Notice that a better performance can be achieved with the full WMT training data. | |
| We use [sentencepiece](https://github.com/google/sentencepiece) toolkit to tokenize the data with a vocabulary size of 32000. | |
| Additionally, we filtered out the sentences longer than 200 words after tokenization. | |
| Assuming the tokenized text data is saved at `${DATA_DIR}`, | |
| we prepare the data binary with the following command. | |
| ```bash | |
| fairseq-preprocess \ | |
| --source-lang en --target-lang ja \ | |
| --trainpref ${DATA_DIR}/train \ | |
| --validpref ${DATA_DIR}/dev \ | |
| --testpref ${DATA_DIR}/test \ | |
| --destdir ${WMT20_ENJA_DATA_BIN} \ | |
| --nwordstgt 32000 --nwordssrc 32000 \ | |
| --workers 20 | |
| ``` | |
| ## Simultaneous Translation Model Training | |
| To train a wait-k `(k=10)` model. | |
| ```bash | |
| fairseq-train ${WMT20_ENJA_DATA_BIN} \ | |
| --save-dir ${SAVEDIR} | |
| --simul-type waitk \ | |
| --waitk-lagging 10 \ | |
| --max-epoch 70 \ | |
| --arch transformer_monotonic_vaswani_wmt_en_de_big \ | |
| --optimizer adam \ | |
| --adam-betas '(0.9, 0.98)' \ | |
| --lr-scheduler inverse_sqrt \ | |
| --warmup-init-lr 1e-07 \ | |
| --warmup-updates 4000 \ | |
| --lr 0.0005 \ | |
| --stop-min-lr 1e-09 \ | |
| --clip-norm 10.0 \ | |
| --dropout 0.3 \ | |
| --weight-decay 0.0 \ | |
| --criterion label_smoothed_cross_entropy \ | |
| --label-smoothing 0.1 \ | |
| --max-tokens 3584 | |
| ``` | |
| This command is for training on 8 GPUs. Equivalently, the model can be trained on one GPU with `--update-freq 8`. | |
| ## Inference & Evaluation | |
| First of all, install [SimulEval](https://github.com/facebookresearch/SimulEval) for evaluation. | |
| ```bash | |
| git clone https://github.com/facebookresearch/SimulEval.git | |
| cd SimulEval | |
| pip install -e . | |
| ``` | |
| The following command is for the evaluation. | |
| Assuming the source and reference files are `${SRC_FILE}` and `${REF_FILE}`, the sentencepiece model file for English is saved at `${SRC_SPM_PATH}` | |
| ```bash | |
| simuleval \ | |
| --source ${SRC_FILE} \ | |
| --target ${TGT_FILE} \ | |
| --data-bin ${WMT20_ENJA_DATA_BIN} \ | |
| --sacrebleu-tokenizer ja-mecab \ | |
| --eval-latency-unit char \ | |
| --no-space \ | |
| --src-splitter-type sentencepiecemodel \ | |
| --src-splitter-path ${SRC_SPM_PATH} \ | |
| --agent ${FAIRSEQ}/examples/simultaneous_translation/agents/simul_trans_text_agent_enja.py \ | |
| --model-path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \ | |
| --output ${OUTPUT} \ | |
| --scores | |
| ``` | |
| The `--data-bin` should be the same in previous sections if you prepare the data from the scratch. | |
| If only for evaluation, a prepared data directory can be found [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_databin.tgz) and a pretrained checkpoint (wait-k=10 model) can be downloaded from [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_wait10_ckpt.pt). | |
| The output should look like this: | |
| ```bash | |
| { | |
| "Quality": { | |
| "BLEU": 11.442253287568398 | |
| }, | |
| "Latency": { | |
| "AL": 8.6587861866951, | |
| "AP": 0.7863304776251316, | |
| "DAL": 9.477850951194764 | |
| } | |
| } | |
| ``` | |
| The latency is evaluated by characters (`--eval-latency-unit`) on the target side. The latency is evaluated with `sacrebleu` with `MeCab` tokenizer `--sacrebleu-tokenizer ja-mecab`. `--no-space` indicates that do not add space when merging the predicted words. | |
| If `--output ${OUTPUT}` option is used, the detailed log and scores will be stored under the `${OUTPUT}` directory. | |