File size: 6,292 Bytes
6fc683c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# AdaLM

**Domain, language and task adaptation of pre-trained models.**

[Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.](https://arxiv.org/abs/2106.13474)
Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong and Furu Wei, [ACL 2021](#)

This repository includes the code to finetune the adapted domain-specific model on downstrem tasks and the [code](https://github.com/microsoft/unilm/tree/master/adalm/incr_bpe) to generate incremental vocabulary for specific domain.

### Pre-trained Model
The adapted domain-specific model can be download:
- ***AdaLM-bio-base*** 12-layer, 768-hidden, 12-heads, 132M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxOqGWQk1u9G4mXf?e=Pa2RGC)
- ***AdaLM-bio-small*** 6-layer, 384-hidden, 12-heads, 34M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxQPKamwrRUelGUJ?e=qtmFHC)
- ***AdaLM-cs-base*** 12-layer, 768-hidden, 12-heads, 124M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxE_1VEP9gHU7mUe?e=XZemIz)
- ***AdaLM-cs-small*** 6-layer, 384-hidden, 12-heads, 30M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxJrUlHJbE4HY9Ev?e=PBaTNy)

### Fine-tuning Examples
#### Requirements

​	Install the requirements:

```bash
pip install -r requirements.txt
```

   Add the project to your PYTHONPATH

```bash
export PYTHONPATH=$PYTHONPATH:`pwd`
```

#### Download Fine-tune Datasets

The biomedical downstream task can be download from [BLURB Leaderboard ](https://microsoft.github.io/BLURB/).  The computer science tasks can be download from  [allenai](https://github.com/allenai/dont-stop-pretraining)

#### Finetune Classification Task

```bash
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/read/glue/task/data/            # Example: "/path/to/downloaded-glue-data-dir/mnli/"

# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning

export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test 
export CKPT_PATH=/path/to/your/model/checkpoint

# Set config file
export CONFIG_FILE=/path/to/config/file

# Set vocab file
export VOCAB_FILE=/path/to/vocab/file

# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache

# Setting the hyperparameters for the run.
export BSZ=32
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_classifier.py \
   --model_type bert --model_name_or_path $CKPT_PATH \
   --config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
   --data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
   --do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
   --max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
   --num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
   --fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir

```



#### Finetune NER Task

To finetune the PICO task, just need to change the run_ner to run_pico.

```bash
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/ner/task/data/           

# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning

export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test 
export CKPT_PATH=/path/to/your/model/checkpoint

# Set config file
export CONFIG_FILE=/path/to/config/file

# Set vocab file
export VOCAB_FILE=/path/to/vocab/file

# Set label file  such as the BIO tag
export LABEL_FILE=/path/to/vocab/file

# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export CACHE_DIR=/path/to/cache

# Setting the hyperparameters for the run.
export BSZ=16
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_ner.py \
   --model_type bert --model_name_or_path $CKPT_PATH \
   --config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
   --data_dir $DATASET_PATH--cache_dir $CACHE_DIR --labels $LABEL_FILE \
   --do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
   --max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
   --num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
   --fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir
```

#### Results

**Biomedical**

|                 | JNLPBA    | PICO      | ChemProt  | Average   |
| --------------- | --------- | --------- | --------- | --------- |
| BERT            | 78.63     | 72.34     | 71.86     | 74.28     |
| BioBERT         | 79.35     | 73.18     | 76.14     | 76.22     |
| PubmedBERT      | **80.06** | 73.38     | 77.24     | 76.89     |
| AdaLM-bio-base  | 79.46     | **75.47** | **78.41** | **77.74** |
| AdaLM-bio-small | 79.04     | 74.91     | 72.06     | 75.34     |

**Computer Science**

|                | ACL-ARC   | SCIERC    | Average   |
| -------------- | --------- | --------- | --------- |
| BERT           | 64.92     | 81.14     | 73.03     |
| AdaLM-cs-base  | **73.61** | **81.91** | **77.76** |
| AdaLM-cs-small | 68.74     | 78.88     | 73.81     |

<!--

## Citation

If you find LayoutLM useful in your research, please cite the following paper:

``` latex
@misc{xu2019layoutlm,
    title={LayoutLM: Pre-training of Text and Layout for Document Image Understanding},
    author={Yiheng Xu and Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou},
    year={2019},
    eprint={1912.13318},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```
-->

## License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) project.
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

### Contact Information

For help or issues using AdaLM, please submit a GitHub issue.

For other communications related to AdaLM, please contact Shaohan Huang (`[email protected]`), Furu Wei (`[email protected]`).