Spaces:
Sleeping
Sleeping
File size: 5,181 Bytes
6fc683c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
# [EdgeFormer](https://arxiv.org/abs/2202.07959)
**EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation**
[EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation](https://arxiv.org/abs/2202.07959). Tao Ge and Furu Wei
- March 2022: release code and pretrained checkpoints.
---
## Pretrained Models
- [EdgeFormer (Adapter-LA)](https://msranlp.blob.core.windows.net/edgeformer/v1/edgeformer_lora32_pretrain_checkpoint_250k.pt): #enc-dec=12-2; #hidden=512; #head=8; #enc-FFN=2048, #dec-FFN=128, #LoRA-r=32 (#parameters: 11M)
- [Vocabulary](https://msranlp.blob.core.windows.net/edgeformer/v1/dict.src.txt) and [Sentencepiece-model](https://msranlp.blob.core.windows.net/edgeformer/v1/spm2k-fy22.model)
- EdgeFormer can be finetuned to support seq2seq generation in English (by now).
## Downstream seq2seq tasks
We evaluate EdgeFormer on the benchmarks of three popular seq2seq tasks: CoNLL-14 for GEC, XSUM for Abstractive Summarization, and SQuAD-NQG for Question Generation.
[**CoNLL-14**](https://aclanthology.org/W14-1701.pdf)
| Model | #Params |#FLOPS|F0.5|
|-----------|-------------|-----------|-----------|
| [Transformer-base](https://arxiv.org/abs/1706.03762) | 44M | 1.8G | 50.1 |
| Pretrained 12+2 [Universal Transformer](https://arxiv.org/abs/1807.03819) | 7.4M | 1.4G | 51.3 |
| Pretrained 12+2 Universal Transformer (wide) | 9.4M | 1.9G | 51.7 |
| Pretrained EdgeFormer | 9.4M | 1.3G | **52.7** |
[**XSUM**](https://arxiv.org/pdf/1808.08745.pdf)
| Model | #Params |#FLOPS|ROUGE-1|ROUGE-2|ROUGE-L|
|-----------|-------------|-----------|-----------|-----------|-----------|
| Transformer-base | 44M | 1.8G | 31.2 | 10.7 | 24.9 |
| Pretrained 12+2 Universal Transformer | 7.4M | 1.4G | 34.4 | 13.4 | 27.9 |
| Pretrained 12+2 Universal Transformer (wide) | 9.4M | 1.9G | 35.1 | 14.0 | 28.6 |
| Pretrained EdgeFormer | 9.4M | 1.3G | **36.3** | **14.8** | **29.5** |
[**SQuAD-NQG**](https://arxiv.org/abs/1705.00106)
| Model | #Params |#FLOPS|B4|MTR|ROUGE-L|
|-----------|-------------|-----------|-----------|-----------|-----------|
| Transformer-base | 44M | 1.8G | 2.6 | 9.0 | 26.0|
| Pretrained 12+2 Universal Transformer | 7.4M | 1.4G | 18.3 | 21.0 | 45.9 |
| Pretrained 12+2 Universal Transformer (wide) | 9.4M | 1.9G | 18.7 | 21.3 | 46.1 |
| Pretrained EdgeFormer | 9.4M | 1.3G | **19.0** | **21.7** | **46.3** |
## Setup
pip install --editable ./
```
## Fine-tuning
```bash
PRETRAINED_MODEL=/path/to/checkpoint/model.pt
fairseq-train /path/to/binarized/data \
--restore-file $PRETRAINED_MODEL --reset-lr-scheduler --reset-optimizer --reset-dataloader \
--task translation \
--criterion label_smoothed_cross_entropy \
--arch transformer_edge --encoder-layers 12 --decoder-ffn-embed-dim 128 --lora-r 32 --lora-r-shape 0 \
--share-all-embeddings \
--required-batch-size-multiple 8 \
--optimizer adam \
--adam-betas '(0.9,0.98)' \
--adam-eps 1e-6 \
--clip-norm 1.0 \
--lr-scheduler polynomial_decay \
--lr 0.00015 \
--warmup-updates 8000 \
--total-num-update 100000 \
--max-update 100000 --max-epoch 1000 \
--max-tokens 20000 \
--update-freq 1 \
--log-format simple \
--log-interval 1000 \
--save-interval-updates 5000 \
--fp16 \
--fp16-init-scale 4 \
--fp16-scale-window 256 \
--min-loss-scale 0.0001 \
--seed 1
--save-dir /path/to/save/checkpoints
--ddp-backend legacy_ddp
```
**Note:
- Please adjust the hyperparameters like `lr` and `warmup-updates` based on the datasets and tasks.
- Please adjust the `max-tokens` and `update-freq` to suit in different experimental environments.
- Use `--fp16` for more efficient training on the devices that have Tensor Cores.
5. Evaluation:
```bash
fairseq-generate $data_bin \
--path $save_dir/checkpoint_best.pt \
--batch-size 64 --beam 5 --remove-bpe=sentencepiece
```
---
## Citation
If you find this repository useful, please consider citing our work:
```
@article{ge2022edgeformer,
title={EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation},
author={Ge, Tao and Wei, Furu},
journal={arXiv preprint arXiv:2202.07959},
year={2022}
}
```
## Acknowledgement
This repository is built using the [Fairseq](https://github.com/pytorch/fairseq) repository.
## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
### Contact Information
For help or issues using EdgeFormer models, please submit a GitHub issue.
For other communications related to EdgeFormer, please contact [Tao Ge](https://www.microsoft.com/en-us/research/people/tage/) (`[email protected]`), [Furu Wei](http://gitnlp.org/) (`[email protected]`).
|