--- license: gpl-3.0 language: - en metrics: - accuracy base_model: facebook/bart-large --- # Model Card for ANGEL_pretrained This model card provides detailed information about the ANGEL_pretrained model, designed for biomedical entity linking. # Model Details #### Model Description - **Developed by:** Chanhwi Kim, Hyunjae Kim, Sihyeon Park, Jiwoo Lee, Mujeen Sung, Jaewoo Kang - **Model type:** Generative Biomedical Entity Linking Model - **Language(s):** English - **License:** GPL-3.0 - **Finetuned from model:** BART-large (Base architecture) #### Model Sources - **Repository:** https://github.com/dmis-lab/ANGEL - **Paper:** https://arxiv.org/pdf/2408.16493 # Direct Use ANGEL_pretrained is pretrained with UMLS dataset. We recommand to finetune this model to downstream dataset rather directly use. If you still want to run the model on a single sample, no preprocessing is required. Simply execute the run_sample.sh script: ```bash bash script/inference/run_sample.sh pretrained ``` To modify the sample with your own example, refer to the [Direct Use](https://github.com/dmis-lab/ANGEL?tab=readme-ov-file#direct-use) section in our GitHub repository. If you're interested in training or evaluating the model, check out the [Fine-tuning](https://github.com/dmis-lab/ANGEL?tab=readme-ov-file#fine-tuning) section and [Evaluation](https://github.com/dmis-lab/ANGEL?tab=readme-ov-file#evaluation) section. # Training Details #### Training Data The model was pretrained on the UMLS-2020-AA dataset. #### Training Procedure Positive-only Pre-training: Initial training using only positive examples, following the standard approach. Negative-aware Training: Subsequent training incorporated negative examples to improve the model's discriminative capabilities. # Evaluation #### Testing Data The model was evaluated using multiple biomedical datasets, including NCBI-disease, BC5CDR, COMETA, AAP, and MedMentions. The fine-tuned scores have also been included. #### Metrics **Accuracy at Top-1 (Acc@1)**: Measures the percentage of times the model's top prediction matches the correct entity. ### Results
Model NCBI-disease BC5CDR COMETA AAP MedMentions
ST21pv
Average
GenBioEL_pretrained 58.2 33.1 42.4 50.6 10.6 39.0
ANGEL_pretrained (Ours) 64.6 49.7 46.8 61.5 18.2 48.2
GenBioEL_pt_ft 91.0 93.1 80.9 89.3 70.7 85.0
ANGEL_pt_ft (Ours) 92.8 94.5 82.8 90.2 73.3 86.7
- In this table, "pt" refers to pre-training, where the model is trained on a large dataset (UMLS in this case), and "ft" refers to fine-tuning, where the model is further refined on specific datasets. In the pre-training phase, **ANGEL** was trained using UMLS dataset entities that were similar to a given word based on TF-IDF scores but had different CUIs (Concept Unique Identifiers). This negative-aware pre-training approach improved its performance across the benchmarks, leading to an average score of 48.2, which is **9.2** points higher than the GenBioEL pre-trained model, which scored 39.0 on average. The performance improvement continued during the fine-tuning phase. After fine-tuning, ANGEL achieved an average score of 86.7, surpassing the GenBioEL model's average score of 85.0, representing a further **1.7** point improvement. The ANGEL model consistently outperformed GenBioEL across all datasets in this phase. The results demonstrate that the negative-aware training introduced by ANGEL not only enhances performance during pre-training but also carries over into fine-tuning, helping the model generalize better to unseen data. # Citation If you use the ANGEL_ncbi model, please cite: ```bibtex @article{kim2024learning, title={Learning from Negative Samples in Generative Biomedical Entity Linking}, author={Kim, Chanhwi and Kim, Hyunjae and Park, Sihyeon and Lee, Jiwoo and Sung, Mujeen and Kang, Jaewoo}, journal={arXiv preprint arXiv:2408.16493}, year={2024} } ``` # Contact For questions or issues, please contact chanhwi_kim@korea.ac.kr.