Spaces:
Sleeping
Sleeping
# [DeltaLM](https://arxiv.org/abs/2106.13736) | |
**Encoder-Decoder Pre-training for Language Generation and Translation** | |
[DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders.](https://arxiv.org/abs/2106.13736) Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, Furu Wei. CoRR abs/2106.13736. | |
[mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs.](https://arxiv.org/abs/2104.08692) Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang, Xian-Ling Mao, Heyan Huang, and Furu Wei. In EMNLP 2021. | |
- September 2021: DeltaLM ranks first on the [WMT21 multilingual translation task](http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html). | |
- August 2021: release code and pretrained checkpoints. | |
--- | |
## Pretrained Models | |
- [DeltaLM-base](https://deltalm.blob.core.windows.net/deltalm/deltalm-base.pt): #enc-dec=12-6; #hidden=768; #head=12; #FFN=3072 (#parameters: 360M) | |
- [DeltaLM-large](https://deltalm.blob.core.windows.net/deltalm/deltalm-large.pt): #enc-dec=24-12; #hidden=1024; #head=16; #FFN=4096 (#parameters: 830M) | |
- [Vocabulary](https://deltalm.blob.core.windows.net/deltalm/dict.txt) and [Sentencepiece-model](https://deltalm.blob.core.windows.net/deltalm/spm.model) | |
- DeltaLM can be finetuned to support language generation and translation tasks for **100+ languages** | |
## Cross-lingual Abstractive Summarization - [Wikilingua](https://arxiv.org/abs/2010.03093) | |
We evaluate DeltaLM on cross-lingual abstractive summarization benchmark. We report the results by averaging the numbers in different languages. | |
| Model | #Params | ROUGE-1 | ROUGE-2 | ROUGE-L | | |
|-----------|-------------|-----------|-----------|-----------| | |
| [mBART](https://arxiv.org/abs/2001.08210) | 610M | 34.5 | 12.9 | **28.7** | | |
| [mT5](https://arxiv.org/abs/2010.11934) | 300M | 27.5 | 8.8 | 22.8 | | |
| [mT5](https://arxiv.org/abs/2010.11934) | 580M | 31.8 | 11.5 | 26.0 | | |
| DeltaLM | 360M | **35.3** | **13.4** | **28.7** | | |
## Setup | |
```bash | |
git submodule update --init deltalm/fairseq | |
cd deltalm/ | |
pip install --editable fairseq/ | |
``` | |
## Fine-tuning | |
1. Organize the raw data in the following structure: | |
``` | |
. | |
+-- /path/to/data/ | |
| +-- train.src | |
| +-- train.tgt | |
| +-- valid.src | |
| +-- valid.tgt | |
``` | |
*Examples (IWSLT14 German to English)*: | |
```bash | |
bash examples/prepare_iwslt14.sh /tmp/iwslt14 | |
``` | |
2. Tokenize the data using [Sentencepiece](https://github.com/google/sentencepiece): | |
```bash | |
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < train.src > train.spm.src | |
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < train.tgt > train.spm.tgt | |
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < valid.src > valid.spm.src | |
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < valid.tgt > valid.spm.tgt | |
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < test.src > test.spm.src | |
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < test.tgt > test.spm.tgt | |
``` | |
*Examples (IWSLT14 German to English)*: | |
```bash | |
bash examples/binary_iwslt14.sh \ | |
/tmp/iwslt14/iwslt14.tokenized.de-en \ | |
/tmp/iwslt14/iwslt14.spm \ | |
/path/to/checkpoint/spm.model | |
``` | |
3. Binary the data: | |
```bash | |
data_bin=/path/to/data-bin/ | |
python preprocess.py \ | |
--trainpref train.spm \ | |
--validpref valid.spm \ | |
--testpref test.spm \ | |
--source-lang src --target-lang tgt \ | |
--destdir $data_bin \ | |
--srcdict /path/to/checkpoint/dict.txt \ | |
--tgtdict /path/to/checkpoint/dict.txt \ | |
--workers 40 | |
``` | |
*Examples (IWSLT14 German to English)*: | |
```bash | |
bash examples/binary_iwslt14.sh \ | |
/tmp/iwslt14/iwslt14.spm \ | |
/tmp/iwslt14/iwslt14.bin \ | |
/path/to/checkpoint/dict.txt | |
``` | |
4. Fine-tuning: | |
```bash | |
PRETRAINED_MODEL=/path/to/checkpoint/model.pt | |
python train.py $data_bin \ | |
--save-dir $save_dir \ | |
--arch deltalm_base \ | |
--pretrained-deltalm-checkpoint $PRETRAINED_MODEL \ | |
--share-all-embeddings \ | |
--max-source-positions 512 --max-target-positions 512 \ | |
--criterion label_smoothed_cross_entropy \ | |
--label-smoothing 0.1 \ | |
--optimizer adam --adam-betas '(0.9, 0.98)' \ | |
--lr-scheduler inverse_sqrt \ | |
--lr $lr \ | |
--warmup-init-lr 1e-07 \ | |
--stop-min-lr 1e-09 \ | |
--warmup-updates 4000 \ | |
--max-update 400000 \ | |
--max-epoch 100 \ | |
--max-tokens $batch_size \ | |
--update-freq 1 \ | |
--seed 1 \ | |
--log-format simple \ | |
--skip-invalid-size-inputs-valid-test | |
``` | |
**Note: | |
- For large checkpoint, please set `--arch deltalm_large`. | |
- Please adjust the `max-tokens` and `update-freq` to suit in different experimental environments. Recommendation of the total batch size is `4096 * 128` tokens per step. | |
- Use `--fp16` for more efficient training on the devices that have Tensor Cores. | |
*Examples (IWSLT14 German to English)*: | |
```bash | |
bash examples/train_iwslt14.sh \ | |
/tmp/iwslt14/iwslt14.bin \ | |
/tmp/iwslt14/checkpoints \ | |
/path/to/checkpoint/model.pt | |
``` | |
5. Evaluation: | |
```bash | |
python generate.py $data_bin \ | |
--path $save_dir/checkpoint_best.pt \ | |
--batch-size 128 --beam 5 --remove-bpe=sentencepiece | |
``` | |
*Examples (IWSLT14 German to English)*: | |
```bash | |
bash examples/evaluate_iwslt14.sh \ | |
/tmp/iwslt14/iwslt14.bin \ | |
/tmp/iwslt14/checkpoints | |
``` | |
--- | |
## Citation | |
If you find this repository useful, please consider citing our work: | |
``` | |
@article{deltalm, | |
title={{DeltaLM}: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders}, | |
author={Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Alexandre Muzio and Saksham Singhal and Hany Hassan Awadalla and Xia Song and Furu Wei}, | |
year={2021}, | |
eprint={2106.13736}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CL} | |
} | |
``` | |
## Acknowledgement | |
This repository is built using the [Fairseq](https://github.com/pytorch/fairseq) repository. | |
## License | |
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. | |
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct) | |
### Contact Information | |
For help or issues using DeltaLM models, please submit a GitHub issue. | |
For other communications related to DeltaLM, please contact Shuming Ma (`[email protected]`), [Furu Wei](http://gitnlp.org/) (`[email protected]`). | |