Tzktz's picture
Upload 7664 files
6fc683c verified
# AdaLM
**Domain, language and task adaptation of pre-trained models.**
[Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.](https://arxiv.org/abs/2106.13474)
Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong and Furu Wei, [ACL 2021](#)
This repository includes the code to finetune the adapted domain-specific model on downstrem tasks and the [code](https://github.com/microsoft/unilm/tree/master/adalm/incr_bpe) to generate incremental vocabulary for specific domain.
### Pre-trained Model
The adapted domain-specific model can be download:
- ***AdaLM-bio-base*** 12-layer, 768-hidden, 12-heads, 132M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxOqGWQk1u9G4mXf?e=Pa2RGC)
- ***AdaLM-bio-small*** 6-layer, 384-hidden, 12-heads, 34M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxQPKamwrRUelGUJ?e=qtmFHC)
- ***AdaLM-cs-base*** 12-layer, 768-hidden, 12-heads, 124M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxE_1VEP9gHU7mUe?e=XZemIz)
- ***AdaLM-cs-small*** 6-layer, 384-hidden, 12-heads, 30M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxJrUlHJbE4HY9Ev?e=PBaTNy)
### Fine-tuning Examples
#### Requirements
​ Install the requirements:
```bash
pip install -r requirements.txt
```
Add the project to your PYTHONPATH
```bash
export PYTHONPATH=$PYTHONPATH:`pwd`
```
#### Download Fine-tune Datasets
The biomedical downstream task can be download from [BLURB Leaderboard ](https://microsoft.github.io/BLURB/). The computer science tasks can be download from [allenai](https://github.com/allenai/dont-stop-pretraining)
#### Finetune Classification Task
```bash
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/read/glue/task/data/ # Example: "/path/to/downloaded-glue-data-dir/mnli/"
# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning
export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test
export CKPT_PATH=/path/to/your/model/checkpoint
# Set config file
export CONFIG_FILE=/path/to/config/file
# Set vocab file
export VOCAB_FILE=/path/to/vocab/file
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
# Setting the hyperparameters for the run.
export BSZ=32
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_classifier.py \
--model_type bert --model_name_or_path $CKPT_PATH \
--config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
--data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
--do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
--max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
--fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir
```
#### Finetune NER Task
To finetune the PICO task, just need to change the run_ner to run_pico.
```bash
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/ner/task/data/
# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning
export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test
export CKPT_PATH=/path/to/your/model/checkpoint
# Set config file
export CONFIG_FILE=/path/to/config/file
# Set vocab file
export VOCAB_FILE=/path/to/vocab/file
# Set label file such as the BIO tag
export LABEL_FILE=/path/to/vocab/file
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export CACHE_DIR=/path/to/cache
# Setting the hyperparameters for the run.
export BSZ=16
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_ner.py \
--model_type bert --model_name_or_path $CKPT_PATH \
--config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
--data_dir $DATASET_PATH--cache_dir $CACHE_DIR --labels $LABEL_FILE \
--do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
--max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
--fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir
```
#### Results
**Biomedical**
| | JNLPBA | PICO | ChemProt | Average |
| --------------- | --------- | --------- | --------- | --------- |
| BERT | 78.63 | 72.34 | 71.86 | 74.28 |
| BioBERT | 79.35 | 73.18 | 76.14 | 76.22 |
| PubmedBERT | **80.06** | 73.38 | 77.24 | 76.89 |
| AdaLM-bio-base | 79.46 | **75.47** | **78.41** | **77.74** |
| AdaLM-bio-small | 79.04 | 74.91 | 72.06 | 75.34 |
**Computer Science**
| | ACL-ARC | SCIERC | Average |
| -------------- | --------- | --------- | --------- |
| BERT | 64.92 | 81.14 | 73.03 |
| AdaLM-cs-base | **73.61** | **81.91** | **77.76** |
| AdaLM-cs-small | 68.74 | 78.88 | 73.81 |
<!--
## Citation
If you find LayoutLM useful in your research, please cite the following paper:
``` latex
@misc{xu2019layoutlm,
title={LayoutLM: Pre-training of Text and Layout for Document Image Understanding},
author={Yiheng Xu and Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou},
year={2019},
eprint={1912.13318},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
-->
## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) project.
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
### Contact Information
For help or issues using AdaLM, please submit a GitHub issue.
For other communications related to AdaLM, please contact Shaohan Huang (`[email protected]`), Furu Wei (`[email protected]`).