sap-bert-ko-en / README.md
snumin44's picture
Update README.md
a34751d verified
|
raw
history blame
2.78 kB
---
license: mit
language:
- ko
base_model:
- klue/bert-base
pipeline_tag: feature-extraction
tags:
- medical
---
# 🍊 SapBERT-Ko-EN
## 1. Intro
ν•œκ΅­μ–΄ λͺ¨λΈμ„ μ΄μš©ν•œ **SapBERT**(Self-alignment pretraining for BERT)μž…λ‹ˆλ‹€.
ν•œΒ·μ˜ 의료 μš©μ–΄ 사전인 KOSTOM을 μ‚¬μš©ν•΄ ν•œκ΅­μ–΄ μš©μ–΄μ™€ μ˜μ–΄ μš©μ–΄λ₯Ό μ •λ ¬ν–ˆμŠ΅λ‹ˆλ‹€.
μ°Έκ³ : [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), [Original Code](https://github.com/cambridgeltl/sapbert)
## 2. SapBERT-KO-EN
**SapBERT**λŠ” μˆ˜λ§Žμ€ 의료 λ™μ˜μ–΄λ₯Ό λ™μΌν•œ 의미둜 μ²˜λ¦¬ν•˜κΈ° μœ„ν•œ 사전 ν•™μŠ΅ λ°©λ²•λ‘ μž…λ‹ˆλ‹€.
**SapBERT-KO-EN**λŠ” **ν•œΒ·μ˜ 혼용체의 의료 기둝**을 μ²˜λ¦¬ν•˜κΈ° μœ„ν•΄ ν•œΒ·μ˜ 의료 μš©μ–΄λ₯Ό μ •λ ¬ν–ˆμŠ΅λ‹ˆλ‹€.
β€» μžμ„Έν•œ μ„€λͺ… 및 ν•™μŠ΅ μ½”λ“œ: [Github](https://github.com/snumin44/SapBERT-KO-EN)
## 3. Training
λͺ¨λΈ ν•™μŠ΅μ— ν™œμš©ν•œ 베이슀 λͺ¨λΈ 및 ν•˜μ΄νΌ νŒŒλΌλ―Έν„°λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.
- Model : klue/bert-base
- Epochs : 1
- Batch Size : 64
- Max Length : 64
- Dropout : 0.1
- Pooler : 'cls'
- Eval Step : 100
- Threshold : 0.8
- Scale Positive Sample : 1
- Scale Negative Sample : 60
SapBERT-KO-EN에 후속 **Fine-tuning**을 μ§„ν–‰ν•˜λŠ” λ°©μ‹μœΌλ‘œ νŠΉμ • ν…ŒμŠ€ν¬μ— μ μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
β€» μ˜μ–΄ μš©μ–΄μ˜ 경우 λŒ€λΆ€λΆ„ μ•ŒνŒŒλ²³ λ‹¨μœ„λ‘œ μ²˜λ¦¬ν•©λ‹ˆλ‹€.
β€» λ™μΌν•œ μ§ˆλ³‘μ„ κ°€λ¦¬ν‚€λŠ” μš©μ–΄ κ°„μ˜ μœ μ‚¬λ„λ₯Ό μƒλŒ€μ μœΌλ‘œ 크게 ν‰κ°€ν•©λ‹ˆλ‹€.
```python
import numpy as np
from transformers import AutoModel, AutoTokenizer
model_path = 'snumin44/sap-bert-ko-en'
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
query = 'κ°„κ²½ν™”'
targets = [
'liver cirrhosis',
'κ°„κ²½λ³€',
'liver cancer',
'κ°„μ•”',
'brain tumor',
'λ‡Œμ’…μ–‘'
]
query_feature = tokenizer(query, return_tensors='pt')
query_outputs = model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()
def cos_sim(A, B):
return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
for idx, target in enumerate(targets):
target_feature = tokenizer(target, return_tensors='pt')
target_outputs = model(**target_feature, return_dict=True)
target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
similarity = cos_sim(query_embeddings, target_embeddings)
print(f"Similarity between query and target {idx}: {similarity:.4f}")
```
```
Similarity between query and target 0: 0.7145
Similarity between query and target 1: 0.7186
Similarity between query and target 2: 0.6183
Similarity between query and target 3: 0.6972
Similarity between query and target 4: 0.3929
Similarity between query and target 5: 0.4260
```