Spaces:

Tzktz
/

Dit-document-layout-analysis

Sleeping

App Files Files Community

Dit-document-layout-analysis / unilm /adalm /README.md

Tzktz

Upload 7664 files

6fc683c verified over 1 year ago

preview code

raw

history blame contribute delete

6.29 kB

	# AdaLM

	Domain, language and task adaptation of pre-trained models.

	[Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.](https://arxiv.org/abs/2106.13474)
	Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong and Furu Wei, [ACL 2021](#)

	This repository includes the code to finetune the adapted domain-specific model on downstrem tasks and the [code](https://github.com/microsoft/unilm/tree/master/adalm/incr_bpe) to generate incremental vocabulary for specific domain.

	### Pre-trained Model
	The adapted domain-specific model can be download:
	- *AdaLM-bio-base* 12-layer, 768-hidden, 12-heads, 132M parameters \|\| [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxOqGWQk1u9G4mXf?e=Pa2RGC)
	- *AdaLM-bio-small* 6-layer, 384-hidden, 12-heads, 34M parameters \|\| [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxQPKamwrRUelGUJ?e=qtmFHC)
	- *AdaLM-cs-base* 12-layer, 768-hidden, 12-heads, 124M parameters \|\| [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxE_1VEP9gHU7mUe?e=XZemIz)
	- *AdaLM-cs-small* 6-layer, 384-hidden, 12-heads, 30M parameters \|\| [One Drive](https://1drv.ms/u/s!AmcFNgkl1JIngxJrUlHJbE4HY9Ev?e=PBaTNy)

	### Fine-tuning Examples
	#### Requirements

	Install the requirements:

	```bash
	pip install -r requirements.txt
	```

	Add the project to your PYTHONPATH

	```bash
	export PYTHONPATH=$PYTHONPATH:`pwd`
	```

	#### Download Fine-tune Datasets

	The biomedical downstream task can be download from [BLURB Leaderboard ](https://microsoft.github.io/BLURB/). The computer science tasks can be download from [allenai](https://github.com/allenai/dont-stop-pretraining)

	#### Finetune Classification Task

	```bash
	# Set path to read training/dev dataset
	export DATASET_PATH=/path/to/read/glue/task/data/ # Example: "/path/to/downloaded-glue-data-dir/mnli/"

	# Set path to save the finetuned model and result score
	export OUTPUT_PATH=/path/to/save/result_of_finetuning

	export TASK_NAME=chemprot
	# Set path to the model checkpoint you need to test
	export CKPT_PATH=/path/to/your/model/checkpoint

	# Set config file
	export CONFIG_FILE=/path/to/config/file

	# Set vocab file
	export VOCAB_FILE=/path/to/vocab/file

	# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
	export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
	export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache

	# Setting the hyperparameters for the run.
	export BSZ=32
	export LR=1.5e-5
	export EPOCH=30
	export WD=0.1
	export WM=0.1
	CUDA_VISIBLE_DEVICES=0 python finetune/run_classifier.py \
	--model_type bert --model_name_or_path $CKPT_PATH \
	--config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
	--data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
	--do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
	--max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
	--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
	--fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir

	```



	#### Finetune NER Task

	To finetune the PICO task, just need to change the run_ner to run_pico.

	```bash
	# Set path to read training/dev dataset
	export DATASET_PATH=/path/to/ner/task/data/

	# Set path to save the finetuned model and result score
	export OUTPUT_PATH=/path/to/save/result_of_finetuning

	export TASK_NAME=chemprot
	# Set path to the model checkpoint you need to test
	export CKPT_PATH=/path/to/your/model/checkpoint

	# Set config file
	export CONFIG_FILE=/path/to/config/file

	# Set vocab file
	export VOCAB_FILE=/path/to/vocab/file

	# Set label file such as the BIO tag
	export LABEL_FILE=/path/to/vocab/file

	# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
	export CACHE_DIR=/path/to/cache

	# Setting the hyperparameters for the run.
	export BSZ=16
	export LR=1.5e-5
	export EPOCH=30
	export WD=0.1
	export WM=0.1
	CUDA_VISIBLE_DEVICES=0 python finetune/run_ner.py \
	--model_type bert --model_name_or_path $CKPT_PATH \
	--config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
	--data_dir $DATASET_PATH--cache_dir $CACHE_DIR --labels $LABEL_FILE \
	--do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
	--max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
	--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
	--fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir
	```

	#### Results

	Biomedical

	\| \| JNLPBA \| PICO \| ChemProt \| Average \|
	\| --------------- \| --------- \| --------- \| --------- \| --------- \|
	\| BERT \| 78.63 \| 72.34 \| 71.86 \| 74.28 \|
	\| BioBERT \| 79.35 \| 73.18 \| 76.14 \| 76.22 \|
	\| PubmedBERT \| 80.06 \| 73.38 \| 77.24 \| 76.89 \|
	\| AdaLM-bio-base \| 79.46 \| 75.47 \| 78.41 \| 77.74 \|
	\| AdaLM-bio-small \| 79.04 \| 74.91 \| 72.06 \| 75.34 \|

	Computer Science

	\| \| ACL-ARC \| SCIERC \| Average \|
	\| -------------- \| --------- \| --------- \| --------- \|
	\| BERT \| 64.92 \| 81.14 \| 73.03 \|
	\| AdaLM-cs-base \| 73.61 \| 81.91 \| 77.76 \|
	\| AdaLM-cs-small \| 68.74 \| 78.88 \| 73.81 \|

	<!--

	## Citation

	If you find LayoutLM useful in your research, please cite the following paper:

	``` latex
	@misc{xu2019layoutlm,
	title={LayoutLM: Pre-training of Text and Layout for Document Image Understanding},
	author={Yiheng Xu and Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou},
	year={2019},
	eprint={1912.13318},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```
	-->

	## License

	This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
	Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) project.
	[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

	### Contact Information

	For help or issues using AdaLM, please submit a GitHub issue.

	For other communications related to AdaLM, please contact Shaohan Huang (`[email protected]`), Furu Wei (`[email protected]`).