--- license: mit language: - ko base_model: - klue/bert-base pipeline_tag: feature-extraction tags: - medical --- # 🍊 SapBERT-Ko-EN ## 1. Intro ν•œκ΅­μ–΄ λͺ¨λΈμ„ μ΄μš©ν•œ **SapBERT**(Self-alignment pretraining for BERT)μž…λ‹ˆλ‹€. ν•œΒ·μ˜ 의료 μš©μ–΄ 사전인 KOSTOM을 μ‚¬μš©ν•΄ ν•œκ΅­μ–΄ μš©μ–΄μ™€ μ˜μ–΄ μš©μ–΄λ₯Ό μ •λ ¬ν–ˆμŠ΅λ‹ˆλ‹€. μ°Έκ³ : [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), [Original Code](https://github.com/cambridgeltl/sapbert) ## 2. SapBERT-KO-EN **SapBERT**λŠ” μˆ˜λ§Žμ€ 의료 λ™μ˜μ–΄λ₯Ό λ™μΌν•œ 의미둜 μ²˜λ¦¬ν•˜κΈ° μœ„ν•œ 사전 ν•™μŠ΅ λ°©λ²•λ‘ μž…λ‹ˆλ‹€. **SapBERT-KO-EN**λŠ” **ν•œΒ·μ˜ 혼용체의 의료 기둝**을 μ²˜λ¦¬ν•˜κΈ° μœ„ν•΄ ν•œΒ·μ˜ 의료 μš©μ–΄λ₯Ό μ •λ ¬ν–ˆμŠ΅λ‹ˆλ‹€. β€» μžμ„Έν•œ μ„€λͺ… 및 ν•™μŠ΅ μ½”λ“œ: [Github](https://github.com/snumin44/SapBERT-KO-EN) ## 3. Training λͺ¨λΈ ν•™μŠ΅μ— ν™œμš©ν•œ 베이슀 λͺ¨λΈ 및 ν•˜μ΄νΌ νŒŒλΌλ―Έν„°λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€. - Model : klue/bert-base - Epochs : 1 - Batch Size : 64 - Max Length : 64 - Dropout : 0.1 - Pooler : 'cls' - Eval Step : 100 - Threshold : 0.8 - Scale Positive Sample : 1 - Scale Negative Sample : 60 SapBERT-KO-EN에 후속 **Fine-tuning**을 μ§„ν–‰ν•˜λŠ” λ°©μ‹μœΌλ‘œ νŠΉμ • ν…ŒμŠ€ν¬μ— μ μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. β€» μ˜μ–΄ μš©μ–΄μ˜ 경우 λŒ€λΆ€λΆ„ μ•ŒνŒŒλ²³ λ‹¨μœ„λ‘œ μ²˜λ¦¬ν•©λ‹ˆλ‹€. β€» λ™μΌν•œ μ§ˆλ³‘μ„ κ°€λ¦¬ν‚€λŠ” μš©μ–΄ κ°„μ˜ μœ μ‚¬λ„λ₯Ό μƒλŒ€μ μœΌλ‘œ 크게 ν‰κ°€ν•©λ‹ˆλ‹€. ```python import numpy as np from transformers import AutoModel, AutoTokenizer model_path = 'snumin44/sap-bert-ko-en' model = AutoModel.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) query = 'κ°„κ²½ν™”' targets = [ 'liver cirrhosis', 'κ°„κ²½λ³€', 'liver cancer', 'κ°„μ•”', 'brain tumor', 'λ‡Œμ’…μ–‘' ] query_feature = tokenizer(query, return_tensors='pt') query_outputs = model(**query_feature, return_dict=True) query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze() def cos_sim(A, B): return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) for idx, target in enumerate(targets): target_feature = tokenizer(target, return_tensors='pt') target_outputs = model(**target_feature, return_dict=True) target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze() similarity = cos_sim(query_embeddings, target_embeddings) print(f"Similarity between query and target {idx}: {similarity:.4f}") ``` ``` Similarity between query and target 0: 0.7145 Similarity between query and target 1: 0.7186 Similarity between query and target 2: 0.6183 Similarity between query and target 3: 0.6972 Similarity between query and target 4: 0.3929 Similarity between query and target 5: 0.4260 ``` ## Citing ``` @inproceedings{liu2021self, title={Self-Alignment Pretraining for Biomedical Entity Representations}, author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel}, booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, pages={4228--4238}, month = jun, year={2021} } ```