Spaces:
Sleeping
Sleeping
# [EdgeFormer](https://arxiv.org/abs/2202.07959) | |
**EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation** | |
[EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation](https://arxiv.org/abs/2202.07959). Tao Ge and Furu Wei | |
- March 2022: release code and pretrained checkpoints. | |
--- | |
## Pretrained Models | |
- [EdgeFormer (Adapter-LA)](https://msranlp.blob.core.windows.net/edgeformer/v1/edgeformer_lora32_pretrain_checkpoint_250k.pt): #enc-dec=12-2; #hidden=512; #head=8; #enc-FFN=2048, #dec-FFN=128, #LoRA-r=32 (#parameters: 11M) | |
- [Vocabulary](https://msranlp.blob.core.windows.net/edgeformer/v1/dict.src.txt) and [Sentencepiece-model](https://msranlp.blob.core.windows.net/edgeformer/v1/spm2k-fy22.model) | |
- EdgeFormer can be finetuned to support seq2seq generation in English (by now). | |
## Downstream seq2seq tasks | |
We evaluate EdgeFormer on the benchmarks of three popular seq2seq tasks: CoNLL-14 for GEC, XSUM for Abstractive Summarization, and SQuAD-NQG for Question Generation. | |
[**CoNLL-14**](https://aclanthology.org/W14-1701.pdf) | |
| Model | #Params |#FLOPS|F0.5| | |
|-----------|-------------|-----------|-----------| | |
| [Transformer-base](https://arxiv.org/abs/1706.03762) | 44M | 1.8G | 50.1 | | |
| Pretrained 12+2 [Universal Transformer](https://arxiv.org/abs/1807.03819) | 7.4M | 1.4G | 51.3 | | |
| Pretrained 12+2 Universal Transformer (wide) | 9.4M | 1.9G | 51.7 | | |
| Pretrained EdgeFormer | 9.4M | 1.3G | **52.7** | | |
[**XSUM**](https://arxiv.org/pdf/1808.08745.pdf) | |
| Model | #Params |#FLOPS|ROUGE-1|ROUGE-2|ROUGE-L| | |
|-----------|-------------|-----------|-----------|-----------|-----------| | |
| Transformer-base | 44M | 1.8G | 31.2 | 10.7 | 24.9 | | |
| Pretrained 12+2 Universal Transformer | 7.4M | 1.4G | 34.4 | 13.4 | 27.9 | | |
| Pretrained 12+2 Universal Transformer (wide) | 9.4M | 1.9G | 35.1 | 14.0 | 28.6 | | |
| Pretrained EdgeFormer | 9.4M | 1.3G | **36.3** | **14.8** | **29.5** | | |
[**SQuAD-NQG**](https://arxiv.org/abs/1705.00106) | |
| Model | #Params |#FLOPS|B4|MTR|ROUGE-L| | |
|-----------|-------------|-----------|-----------|-----------|-----------| | |
| Transformer-base | 44M | 1.8G | 2.6 | 9.0 | 26.0| | |
| Pretrained 12+2 Universal Transformer | 7.4M | 1.4G | 18.3 | 21.0 | 45.9 | | |
| Pretrained 12+2 Universal Transformer (wide) | 9.4M | 1.9G | 18.7 | 21.3 | 46.1 | | |
| Pretrained EdgeFormer | 9.4M | 1.3G | **19.0** | **21.7** | **46.3** | | |
## Setup | |
pip install --editable ./ | |
``` | |
## Fine-tuning | |
```bash | |
PRETRAINED_MODEL=/path/to/checkpoint/model.pt | |
fairseq-train /path/to/binarized/data \ | |
--restore-file $PRETRAINED_MODEL --reset-lr-scheduler --reset-optimizer --reset-dataloader \ | |
--task translation \ | |
--criterion label_smoothed_cross_entropy \ | |
--arch transformer_edge --encoder-layers 12 --decoder-ffn-embed-dim 128 --lora-r 32 --lora-r-shape 0 \ | |
--share-all-embeddings \ | |
--required-batch-size-multiple 8 \ | |
--optimizer adam \ | |
--adam-betas '(0.9,0.98)' \ | |
--adam-eps 1e-6 \ | |
--clip-norm 1.0 \ | |
--lr-scheduler polynomial_decay \ | |
--lr 0.00015 \ | |
--warmup-updates 8000 \ | |
--total-num-update 100000 \ | |
--max-update 100000 --max-epoch 1000 \ | |
--max-tokens 20000 \ | |
--update-freq 1 \ | |
--log-format simple \ | |
--log-interval 1000 \ | |
--save-interval-updates 5000 \ | |
--fp16 \ | |
--fp16-init-scale 4 \ | |
--fp16-scale-window 256 \ | |
--min-loss-scale 0.0001 \ | |
--seed 1 | |
--save-dir /path/to/save/checkpoints | |
--ddp-backend legacy_ddp | |
``` | |
**Note: | |
- Please adjust the hyperparameters like `lr` and `warmup-updates` based on the datasets and tasks. | |
- Please adjust the `max-tokens` and `update-freq` to suit in different experimental environments. | |
- Use `--fp16` for more efficient training on the devices that have Tensor Cores. | |
5. Evaluation: | |
```bash | |
fairseq-generate $data_bin \ | |
--path $save_dir/checkpoint_best.pt \ | |
--batch-size 64 --beam 5 --remove-bpe=sentencepiece | |
``` | |
--- | |
## Citation | |
If you find this repository useful, please consider citing our work: | |
``` | |
@article{ge2022edgeformer, | |
title={EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation}, | |
author={Ge, Tao and Wei, Furu}, | |
journal={arXiv preprint arXiv:2202.07959}, | |
year={2022} | |
} | |
``` | |
## Acknowledgement | |
This repository is built using the [Fairseq](https://github.com/pytorch/fairseq) repository. | |
## License | |
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. | |
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct) | |
### Contact Information | |
For help or issues using EdgeFormer models, please submit a GitHub issue. | |
For other communications related to EdgeFormer, please contact [Tao Ge](https://www.microsoft.com/en-us/research/people/tage/) (`[email protected]`), [Furu Wei](http://gitnlp.org/) (`[email protected]`). | |