snumin44 commited on
Commit
9bc1a41
ยท
verified ยท
1 Parent(s): cf9439c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -3
README.md CHANGED
@@ -13,9 +13,68 @@ base_model:
13
  ํ•œยท์˜ ์˜๋ฃŒ ์šฉ์–ด ์‚ฌ์ „์ธ KOSTOM์„ ์‚ฌ์šฉํ•ด ํ•œ๊ตญ์–ด ์šฉ์–ด์™€ ์˜์–ด ์šฉ์–ด๋ฅผ ์ •๋ ฌํ–ˆ์Šต๋‹ˆ๋‹ค.
14
  ์ฐธ๊ณ : [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), [Original Code](https://github.com/cambridgeltl/sapbert)
15
 
16
- ## 1. SapBERT-KO-EN
17
  **SapBERT**๋Š” ์ˆ˜๋งŽ์€ ์˜๋ฃŒ ๋™์˜์–ด๋ฅผ ๋™์ผํ•œ ์˜๋ฏธ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ์‚ฌ์ „ ํ•™์Šต ๋ฐฉ๋ฒ•๋ก ์ž…๋‹ˆ๋‹ค.
18
- **SapBERT-KO-EN**๋Š” ํ•œยท์˜ ํ˜ผ์šฉ์ฒด์˜ ์˜๋ฃŒ ๊ธฐ๋ก์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํ•œยท์˜ ์˜๋ฃŒ ์šฉ์–ด๋ฅผ ์ •๋ ฌํ–ˆ์Šต๋‹ˆ๋‹ค.
19
 
 
20
 
21
- [Github](https://github.com/snumin44/SapBERT-KO-EN)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ํ•œยท์˜ ์˜๋ฃŒ ์šฉ์–ด ์‚ฌ์ „์ธ KOSTOM์„ ์‚ฌ์šฉํ•ด ํ•œ๊ตญ์–ด ์šฉ์–ด์™€ ์˜์–ด ์šฉ์–ด๋ฅผ ์ •๋ ฌํ–ˆ์Šต๋‹ˆ๋‹ค.
14
  ์ฐธ๊ณ : [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), [Original Code](https://github.com/cambridgeltl/sapbert)
15
 
16
+ ## 2. SapBERT-KO-EN
17
  **SapBERT**๋Š” ์ˆ˜๋งŽ์€ ์˜๋ฃŒ ๋™์˜์–ด๋ฅผ ๋™์ผํ•œ ์˜๋ฏธ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ์‚ฌ์ „ ํ•™์Šต ๋ฐฉ๋ฒ•๋ก ์ž…๋‹ˆ๋‹ค.
18
+ **SapBERT-KO-EN**๋Š” **ํ•œยท์˜ ํ˜ผ์šฉ์ฒด์˜ ์˜๋ฃŒ ๊ธฐ๋ก**์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํ•œยท์˜ ์˜๋ฃŒ ์šฉ์–ด๋ฅผ ์ •๋ ฌํ–ˆ์Šต๋‹ˆ๋‹ค.
19
 
20
+ โ€ป ์ž์„ธํ•œ ์„ค๋ช…: [Github](https://github.com/snumin44/SapBERT-KO-EN)
21
 
22
+ ## 3. Training
23
+
24
+
25
+ ๋ชจ๋ธ ํ•™์Šต์— ํ™œ์šฉํ•œ ๋ฒ ์ด์Šค ๋ชจ๋ธ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
26
+
27
+ - Model : klue/bert-base
28
+ - Epochs : 1
29
+ - Batch Size : 64
30
+ - Max Length : 64
31
+ - Dropout : 0.1
32
+ - Pooler : 'cls'
33
+ - Eval Step : 100
34
+ - Threshold : 0.8
35
+ - Scale Positive Sample : 1
36
+ - Scale Negative Sample : 60
37
+
38
+ โ€ป ์˜์–ด ์šฉ์–ด์˜ ๊ฒฝ์šฐ ๋Œ€๋ถ€๋ถ„ ์•ŒํŒŒ๋ฒณ ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
39
+
40
+ ```python
41
+ import numpy as np
42
+ from transformers import AutoModel, AutoTokenizer
43
+
44
+ model_path = 'snumin44/sap-bert-ko-en'
45
+ model = AutoModel.from_pretrained(model_path)
46
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
47
+
48
+ query = '๊ฐ„๊ฒฝํ™”'
49
+
50
+ targets = [
51
+ 'liver cirrhosis',
52
+ '๊ฐ„๊ฒฝ๋ณ€',
53
+ 'liver cancer',
54
+ '๊ฐ„์•”',
55
+ 'brain tumor',
56
+ '๋‡Œ์ข…์–‘'
57
+ ]
58
+
59
+ query_feature = tokenizer(query, return_tensors='pt')
60
+ query_outputs = model(**query_feature, return_dict=True)
61
+ query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()
62
+
63
+ def cos_sim(A, B):
64
+ return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
65
+
66
+ for idx, target in enumerate(targets):
67
+ target_feature = tokenizer(target, return_tensors='pt')
68
+ target_outputs = model(**target_feature, return_dict=True)
69
+ target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
70
+ similarity = cos_sim(query_embeddings, target_embeddings)
71
+ print(f"Similarity between query and target {idx}: {similarity:.4f}")
72
+ ```
73
+ ```
74
+ Similarity between query and target 0: 0.7145
75
+ Similarity between query and target 1: 0.7186
76
+ Similarity between query and target 2: 0.6183
77
+ Similarity between query and target 3: 0.6972
78
+ Similarity between query and target 4: 0.3929
79
+ Similarity between query and target 5: 0.4260
80
+ ```