File size: 3,209 Bytes
73164d2
 
 
 
 
 
a34751d
 
 
73164d2
 
 
cf9439c
 
10ee0d4
 
 
73164d2
9bc1a41
cf9439c
9bc1a41
73164d2
ccff2eb
73164d2
9bc1a41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6129ddb
 
e304fc1
6129ddb
9bc1a41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1544386
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
license: mit
language:
- ko
base_model:
- klue/bert-base
pipeline_tag: feature-extraction
tags:
- medical
---
# 🍊 SapBERT-Ko-EN

## 1. Intro

ν•œκ΅­μ–΄ λͺ¨λΈμ„ μ΄μš©ν•œ **SapBERT**(Self-alignment pretraining for BERT)μž…λ‹ˆλ‹€.    
ν•œΒ·μ˜ 의료 μš©μ–΄ 사전인 KOSTOM을 μ‚¬μš©ν•΄ ν•œκ΅­μ–΄ μš©μ–΄μ™€ μ˜μ–΄ μš©μ–΄λ₯Ό μ •λ ¬ν–ˆμŠ΅λ‹ˆλ‹€.     
μ°Έκ³ : [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), [Original Code](https://github.com/cambridgeltl/sapbert)   

## 2. SapBERT-KO-EN
**SapBERT**λŠ” μˆ˜λ§Žμ€ 의료 λ™μ˜μ–΄λ₯Ό λ™μΌν•œ 의미둜 μ²˜λ¦¬ν•˜κΈ° μœ„ν•œ 사전 ν•™μŠ΅ λ°©λ²•λ‘ μž…λ‹ˆλ‹€.     
**SapBERT-KO-EN**λŠ” **ν•œΒ·μ˜ 혼용체의 의료 기둝**을 μ²˜λ¦¬ν•˜κΈ° μœ„ν•΄ ν•œΒ·μ˜ 의료 μš©μ–΄λ₯Ό μ •λ ¬ν–ˆμŠ΅λ‹ˆλ‹€.

β€» μžμ„Έν•œ μ„€λͺ… 및 ν•™μŠ΅ μ½”λ“œ: [Github](https://github.com/snumin44/SapBERT-KO-EN)

## 3. Training


λͺ¨λΈ ν•™μŠ΅μ— ν™œμš©ν•œ 베이슀 λͺ¨λΈ 및 ν•˜μ΄νΌ νŒŒλΌλ―Έν„°λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

- Model : klue/bert-base
- Epochs : 1
- Batch Size : 64
- Max Length : 64
- Dropout : 0.1
- Pooler : 'cls'
- Eval Step : 100
- Threshold : 0.8
- Scale Positive Sample : 1
- Scale Negative Sample : 60

SapBERT-KO-EN에 후속 **Fine-tuning**을 μ§„ν–‰ν•˜λŠ” λ°©μ‹μœΌλ‘œ νŠΉμ • ν…ŒμŠ€ν¬μ— μ μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 

β€» μ˜μ–΄ μš©μ–΄μ˜ 경우 λŒ€λΆ€λΆ„ μ•ŒνŒŒλ²³ λ‹¨μœ„λ‘œ μ²˜λ¦¬ν•©λ‹ˆλ‹€.    
β€» λ™μΌν•œ μ§ˆλ³‘μ„ κ°€λ¦¬ν‚€λŠ” μš©μ–΄ κ°„μ˜ μœ μ‚¬λ„λ₯Ό μƒλŒ€μ μœΌλ‘œ 크게 ν‰κ°€ν•©λ‹ˆλ‹€.

```python
import numpy as np
from transformers import AutoModel, AutoTokenizer

model_path = 'snumin44/sap-bert-ko-en'
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

query = 'κ°„κ²½ν™”'

targets = [
    'liver cirrhosis',
    'κ°„κ²½λ³€',
    'liver cancer',
    'κ°„μ•”',
    'brain tumor',
    'λ‡Œμ’…μ–‘'
]

query_feature = tokenizer(query, return_tensors='pt')
query_outputs = model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()

def cos_sim(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

for idx, target in enumerate(targets):
    target_feature = tokenizer(target, return_tensors='pt')
    target_outputs = model(**target_feature, return_dict=True)
    target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
    similarity = cos_sim(query_embeddings, target_embeddings)
    print(f"Similarity between query and target {idx}: {similarity:.4f}")
```
```
Similarity between query and target 0: 0.7145
Similarity between query and target 1: 0.7186
Similarity between query and target 2: 0.6183
Similarity between query and target 3: 0.6972
Similarity between query and target 4: 0.3929
Similarity between query and target 5: 0.4260
```

## Citing
```
@inproceedings{liu2021self,
	title={Self-Alignment Pretraining for Biomedical Entity Representations},
	author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
	booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
	pages={4228--4238},
	month = jun,
	year={2021}
}
```