File size: 2,777 Bytes
73164d2 a34751d 73164d2 cf9439c 10ee0d4 73164d2 9bc1a41 cf9439c 9bc1a41 73164d2 ccff2eb 73164d2 9bc1a41 6129ddb e304fc1 6129ddb 9bc1a41 a34751d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
license: mit
language:
- ko
base_model:
- klue/bert-base
pipeline_tag: feature-extraction
tags:
- medical
---
# π SapBERT-Ko-EN
## 1. Intro
νκ΅μ΄ λͺ¨λΈμ μ΄μ©ν **SapBERT**(Self-alignment pretraining for BERT)μ
λλ€.
νΒ·μ μλ£ μ©μ΄ μ¬μ μΈ KOSTOMμ μ¬μ©ν΄ νκ΅μ΄ μ©μ΄μ μμ΄ μ©μ΄λ₯Ό μ λ ¬νμ΅λλ€.
μ°Έκ³ : [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), [Original Code](https://github.com/cambridgeltl/sapbert)
## 2. SapBERT-KO-EN
**SapBERT**λ μλ§μ μλ£ λμμ΄λ₯Ό λμΌν μλ―Έλ‘ μ²λ¦¬νκΈ° μν μ¬μ νμ΅ λ°©λ²λ‘ μ
λλ€.
**SapBERT-KO-EN**λ **νΒ·μ νΌμ©μ²΄μ μλ£ κΈ°λ‘**μ μ²λ¦¬νκΈ° μν΄ νΒ·μ μλ£ μ©μ΄λ₯Ό μ λ ¬νμ΅λλ€.
β» μμΈν μ€λͺ
λ° νμ΅ μ½λ: [Github](https://github.com/snumin44/SapBERT-KO-EN)
## 3. Training
λͺ¨λΈ νμ΅μ νμ©ν λ² μ΄μ€ λͺ¨λΈ λ° νμ΄νΌ νλΌλ―Έν°λ λ€μκ³Ό κ°μ΅λλ€.
- Model : klue/bert-base
- Epochs : 1
- Batch Size : 64
- Max Length : 64
- Dropout : 0.1
- Pooler : 'cls'
- Eval Step : 100
- Threshold : 0.8
- Scale Positive Sample : 1
- Scale Negative Sample : 60
SapBERT-KO-ENμ νμ **Fine-tuning**μ μ§ννλ λ°©μμΌλ‘ νΉμ ν
μ€ν¬μ μ μ©ν μ μμ΅λλ€.
β» μμ΄ μ©μ΄μ κ²½μ° λλΆλΆ μνλ²³ λ¨μλ‘ μ²λ¦¬ν©λλ€.
β» λμΌν μ§λ³μ κ°λ¦¬ν€λ μ©μ΄ κ°μ μ μ¬λλ₯Ό μλμ μΌλ‘ ν¬κ² νκ°ν©λλ€.
```python
import numpy as np
from transformers import AutoModel, AutoTokenizer
model_path = 'snumin44/sap-bert-ko-en'
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
query = 'κ°κ²½ν'
targets = [
'liver cirrhosis',
'κ°κ²½λ³',
'liver cancer',
'κ°μ',
'brain tumor',
'λμ’
μ'
]
query_feature = tokenizer(query, return_tensors='pt')
query_outputs = model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()
def cos_sim(A, B):
return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
for idx, target in enumerate(targets):
target_feature = tokenizer(target, return_tensors='pt')
target_outputs = model(**target_feature, return_dict=True)
target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
similarity = cos_sim(query_embeddings, target_embeddings)
print(f"Similarity between query and target {idx}: {similarity:.4f}")
```
```
Similarity between query and target 0: 0.7145
Similarity between query and target 1: 0.7186
Similarity between query and target 2: 0.6183
Similarity between query and target 3: 0.6972
Similarity between query and target 4: 0.3929
Similarity between query and target 5: 0.4260
``` |