|
--- |
|
license: mit |
|
language: |
|
- ko |
|
base_model: |
|
- klue/bert-base |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- medical |
|
--- |
|
# π SapBERT-Ko-EN |
|
|
|
## 1. Intro |
|
|
|
νκ΅μ΄ λͺ¨λΈμ μ΄μ©ν **SapBERT**(Self-alignment pretraining for BERT)μ
λλ€. |
|
νΒ·μ μλ£ μ©μ΄ μ¬μ μΈ KOSTOMμ μ¬μ©ν΄ νκ΅μ΄ μ©μ΄μ μμ΄ μ©μ΄λ₯Ό μ λ ¬νμ΅λλ€. |
|
μ°Έκ³ : [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), [Original Code](https://github.com/cambridgeltl/sapbert) |
|
|
|
## 2. SapBERT-KO-EN |
|
**SapBERT**λ μλ§μ μλ£ λμμ΄λ₯Ό λμΌν μλ―Έλ‘ μ²λ¦¬νκΈ° μν μ¬μ νμ΅ λ°©λ²λ‘ μ
λλ€. |
|
**SapBERT-KO-EN**λ **νΒ·μ νΌμ©μ²΄μ μλ£ κΈ°λ‘**μ μ²λ¦¬νκΈ° μν΄ νΒ·μ μλ£ μ©μ΄λ₯Ό μ λ ¬νμ΅λλ€. |
|
|
|
β» μμΈν μ€λͺ
λ° νμ΅ μ½λ: [Github](https://github.com/snumin44/SapBERT-KO-EN) |
|
|
|
## 3. Training |
|
|
|
|
|
λͺ¨λΈ νμ΅μ νμ©ν λ² μ΄μ€ λͺ¨λΈ λ° νμ΄νΌ νλΌλ―Έν°λ λ€μκ³Ό κ°μ΅λλ€. |
|
|
|
- Model : klue/bert-base |
|
- Epochs : 1 |
|
- Batch Size : 64 |
|
- Max Length : 64 |
|
- Dropout : 0.1 |
|
- Pooler : 'cls' |
|
- Eval Step : 100 |
|
- Threshold : 0.8 |
|
- Scale Positive Sample : 1 |
|
- Scale Negative Sample : 60 |
|
|
|
SapBERT-KO-ENμ νμ **Fine-tuning**μ μ§ννλ λ°©μμΌλ‘ νΉμ ν
μ€ν¬μ μ μ©ν μ μμ΅λλ€. |
|
|
|
β» μμ΄ μ©μ΄μ κ²½μ° λλΆλΆ μνλ²³ λ¨μλ‘ μ²λ¦¬ν©λλ€. |
|
β» λμΌν μ§λ³μ κ°λ¦¬ν€λ μ©μ΄ κ°μ μ μ¬λλ₯Ό μλμ μΌλ‘ ν¬κ² νκ°ν©λλ€. |
|
|
|
```python |
|
import numpy as np |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
model_path = 'snumin44/sap-bert-ko-en' |
|
model = AutoModel.from_pretrained(model_path) |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
|
query = 'κ°κ²½ν' |
|
|
|
targets = [ |
|
'liver cirrhosis', |
|
'κ°κ²½λ³', |
|
'liver cancer', |
|
'κ°μ', |
|
'brain tumor', |
|
'λμ’
μ' |
|
] |
|
|
|
query_feature = tokenizer(query, return_tensors='pt') |
|
query_outputs = model(**query_feature, return_dict=True) |
|
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze() |
|
|
|
def cos_sim(A, B): |
|
return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) |
|
|
|
for idx, target in enumerate(targets): |
|
target_feature = tokenizer(target, return_tensors='pt') |
|
target_outputs = model(**target_feature, return_dict=True) |
|
target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze() |
|
similarity = cos_sim(query_embeddings, target_embeddings) |
|
print(f"Similarity between query and target {idx}: {similarity:.4f}") |
|
``` |
|
``` |
|
Similarity between query and target 0: 0.7145 |
|
Similarity between query and target 1: 0.7186 |
|
Similarity between query and target 2: 0.6183 |
|
Similarity between query and target 3: 0.6972 |
|
Similarity between query and target 4: 0.3929 |
|
Similarity between query and target 5: 0.4260 |
|
``` |
|
|
|
## Citing |
|
``` |
|
@inproceedings{liu2021self, |
|
title={Self-Alignment Pretraining for Biomedical Entity Representations}, |
|
author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel}, |
|
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, |
|
pages={4228--4238}, |
|
month = jun, |
|
year={2021} |
|
} |
|
``` |
|
|