Spaces:

Tzktz
/

Dit-document-layout-analysis

Sleeping

File size: 6,292 Bytes

6fc683c

# AdaLM

**Domain, language and task adaptation of pre-trained models.**

[Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.](https://arxiv.org/abs/2106.13474)
Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong and Furu Wei, [ACL 2021](#)

This repository includes the code to finetune the adapted domain-specific model on downstrem tasks and the [code](https://github.com/microsoft/unilm/tree/master/adalm/incr_bpe) to generate incremental vocabulary for specific domain.

### Pre-trained Model
The adapted domain-specific model can be download:
- ***AdaLM-bio-base*** 12-layer, 768-hidden, 12-heads, 132M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxOqGWQk1u9G4mXf?e=Pa2RGC)
- ***AdaLM-bio-small*** 6-layer, 384-hidden, 12-heads, 34M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxQPKamwrRUelGUJ?e=qtmFHC)
- ***AdaLM-cs-base*** 12-layer, 768-hidden, 12-heads, 124M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxE_1VEP9gHU7mUe?e=XZemIz)
- ***AdaLM-cs-small*** 6-layer, 384-hidden, 12-heads, 30M parameters || [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxJrUlHJbE4HY9Ev?e=PBaTNy)

### Fine-tuning Examples
#### Requirements

	Install the requirements:

```bash
pip install -r requirements.txt
```

   Add the project to your PYTHONPATH

```bash
export PYTHONPATH=$PYTHONPATH:`pwd`
```

#### Download Fine-tune Datasets

The biomedical downstream task can be download from [BLURB Leaderboard ](https://microsoft.github.io/BLURB/).  The computer science tasks can be download from  [allenai](https://github.com/allenai/dont-stop-pretraining)

#### Finetune Classification Task

```bash
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/read/glue/task/data/            # Example: "/path/to/downloaded-glue-data-dir/mnli/"

# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning

export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test 
export CKPT_PATH=/path/to/your/model/checkpoint

# Set config file
export CONFIG_FILE=/path/to/config/file

# Set vocab file
export VOCAB_FILE=/path/to/vocab/file

# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache

# Setting the hyperparameters for the run.
export BSZ=32
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_classifier.py \
   --model_type bert --model_name_or_path $CKPT_PATH \
   --config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
   --data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
   --do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
   --max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
   --num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
   --fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir

```



#### Finetune NER Task

To finetune the PICO task, just need to change the run_ner to run_pico.

```bash
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/ner/task/data/           

# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning

export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test 
export CKPT_PATH=/path/to/your/model/checkpoint

# Set config file
export CONFIG_FILE=/path/to/config/file

# Set vocab file
export VOCAB_FILE=/path/to/vocab/file

# Set label file  such as the BIO tag
export LABEL_FILE=/path/to/vocab/file

# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export CACHE_DIR=/path/to/cache

# Setting the hyperparameters for the run.
export BSZ=16
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_ner.py \
   --model_type bert --model_name_or_path $CKPT_PATH \
   --config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
   --data_dir $DATASET_PATH--cache_dir $CACHE_DIR --labels $LABEL_FILE \
   --do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
   --max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
   --num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
   --fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir
```

#### Results

**Biomedical**

|                 | JNLPBA    | PICO      | ChemProt  | Average   |
| --------------- | --------- | --------- | --------- | --------- |
| BERT            | 78.63     | 72.34     | 71.86     | 74.28     |
| BioBERT         | 79.35     | 73.18     | 76.14     | 76.22     |
| PubmedBERT      | **80.06** | 73.38     | 77.24     | 76.89     |
| AdaLM-bio-base  | 79.46     | **75.47** | **78.41** | **77.74** |
| AdaLM-bio-small | 79.04     | 74.91     | 72.06     | 75.34     |

**Computer Science**

|                | ACL-ARC   | SCIERC    | Average   |
| -------------- | --------- | --------- | --------- |
| BERT           | 64.92     | 81.14     | 73.03     |
| AdaLM-cs-base  | **73.61** | **81.91** | **77.76** |
| AdaLM-cs-small | 68.74     | 78.88     | 73.81     |

<!--

## Citation

If you find LayoutLM useful in your research, please cite the following paper:

``` latex
@misc{xu2019layoutlm,
    title={LayoutLM: Pre-training of Text and Layout for Document Image Understanding},
    author={Yiheng Xu and Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou},
    year={2019},
    eprint={1912.13318},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```
-->

## License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) project.
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

### Contact Information

For help or issues using AdaLM, please submit a GitHub issue.

For other communications related to AdaLM, please contact Shaohan Huang (`[email protected]`), Furu Wei (`[email protected]`).