Spaces:
Sleeping
Sleeping
File size: 6,292 Bytes
6fc683c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
# AdaLM
**Domain, language and task adaptation of pre-trained models.**
[Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.](https://arxiv.org/abs/2106.13474)
Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong and Furu Wei, [ACL 2021](#)
This repository includes the code to finetune the adapted domain-specific model on downstrem tasks and the [code](https://github.com/microsoft/unilm/tree/master/adalm/incr_bpe) to generate incremental vocabulary for specific domain.
### Pre-trained Model
The adapted domain-specific model can be download:
- ***AdaLM-bio-base*** 12-layer, 768-hidden, 12-heads, 132M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxOqGWQk1u9G4mXf?e=Pa2RGC)
- ***AdaLM-bio-small*** 6-layer, 384-hidden, 12-heads, 34M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxQPKamwrRUelGUJ?e=qtmFHC)
- ***AdaLM-cs-base*** 12-layer, 768-hidden, 12-heads, 124M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxE_1VEP9gHU7mUe?e=XZemIz)
- ***AdaLM-cs-small*** 6-layer, 384-hidden, 12-heads, 30M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxJrUlHJbE4HY9Ev?e=PBaTNy)
### Fine-tuning Examples
#### Requirements
Install the requirements:
```bash
pip install -r requirements.txt
```
Add the project to your PYTHONPATH
```bash
export PYTHONPATH=$PYTHONPATH:`pwd`
```
#### Download Fine-tune Datasets
The biomedical downstream task can be download from [BLURB Leaderboard ](https://microsoft.github.io/BLURB/). The computer science tasks can be download from [allenai](https://github.com/allenai/dont-stop-pretraining)
#### Finetune Classification Task
```bash
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/read/glue/task/data/ # Example: "/path/to/downloaded-glue-data-dir/mnli/"
# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning
export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test
export CKPT_PATH=/path/to/your/model/checkpoint
# Set config file
export CONFIG_FILE=/path/to/config/file
# Set vocab file
export VOCAB_FILE=/path/to/vocab/file
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
# Setting the hyperparameters for the run.
export BSZ=32
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_classifier.py \
--model_type bert --model_name_or_path $CKPT_PATH \
--config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
--data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
--do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
--max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
--fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir
```
#### Finetune NER Task
To finetune the PICO task, just need to change the run_ner to run_pico.
```bash
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/ner/task/data/
# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning
export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test
export CKPT_PATH=/path/to/your/model/checkpoint
# Set config file
export CONFIG_FILE=/path/to/config/file
# Set vocab file
export VOCAB_FILE=/path/to/vocab/file
# Set label file such as the BIO tag
export LABEL_FILE=/path/to/vocab/file
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export CACHE_DIR=/path/to/cache
# Setting the hyperparameters for the run.
export BSZ=16
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_ner.py \
--model_type bert --model_name_or_path $CKPT_PATH \
--config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
--data_dir $DATASET_PATH--cache_dir $CACHE_DIR --labels $LABEL_FILE \
--do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
--max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
--fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir
```
#### Results
**Biomedical**
| | JNLPBA | PICO | ChemProt | Average |
| --------------- | --------- | --------- | --------- | --------- |
| BERT | 78.63 | 72.34 | 71.86 | 74.28 |
| BioBERT | 79.35 | 73.18 | 76.14 | 76.22 |
| PubmedBERT | **80.06** | 73.38 | 77.24 | 76.89 |
| AdaLM-bio-base | 79.46 | **75.47** | **78.41** | **77.74** |
| AdaLM-bio-small | 79.04 | 74.91 | 72.06 | 75.34 |
**Computer Science**
| | ACL-ARC | SCIERC | Average |
| -------------- | --------- | --------- | --------- |
| BERT | 64.92 | 81.14 | 73.03 |
| AdaLM-cs-base | **73.61** | **81.91** | **77.76** |
| AdaLM-cs-small | 68.74 | 78.88 | 73.81 |
<!--
## Citation
If you find LayoutLM useful in your research, please cite the following paper:
``` latex
@misc{xu2019layoutlm,
title={LayoutLM: Pre-training of Text and Layout for Document Image Understanding},
author={Yiheng Xu and Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou},
year={2019},
eprint={1912.13318},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
-->
## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) project.
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
### Contact Information
For help or issues using AdaLM, please submit a GitHub issue.
For other communications related to AdaLM, please contact Shaohan Huang (`[email protected]`), Furu Wei (`[email protected]`).
|