Sentence Similarity
Safetensors
Korean
roberta
snumin44 commited on
Commit
9438498
β€’
1 Parent(s): a87b6b2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - kakaobrain/kor_nli
5
+ - kakaobrain/kor_nlu
6
+ - klue/klue
7
+ language:
8
+ - ko
9
+ metrics:
10
+ - spearmanr
11
+ - pearsonr
12
+ pipeline_tag: sentence-similarity
13
+ ---
14
+
15
+ # 🍊 SimCSE-KO
16
+
17
+ ## 1. Intro
18
+
19
+ **ν•œκ΅­μ–΄ SimCSE(RoBERTa, Unsupervised)** λͺ¨λΈμž…λ‹ˆλ‹€.
20
+ Princeton NLP의 μ½”λ“œκ°€ μ•„λ‹Œ μƒˆλ‘œμš΄ μ½”λ“œλ₯Ό μ΄μš©ν•΄ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
21
+ 두 λ¬Έμž₯ μ‚¬μ΄μ˜ 코사인 μœ μ‚¬λ„λ₯Ό 계산해 의미적 관련성을 νŒλ‹¨ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
22
+
23
+ - Github: [https://github.com/snumin44/SimCSE-KO](https://github.com/snumin44/SimCSE-KO)
24
+ - Original Code: [https://github.com/princeton-nlp/SimCSE](https://github.com/princeton-nlp/SimCSE)
25
+
26
+
27
+ ## 2. Experiments Settings
28
+
29
+ - Model: klue/roberta-base
30
+ - Dataset: Korean Wiki Text 1M (unsupervised training), KorSTS-dev (evaluation)
31
+ - epoch: 1
32
+ - max length: 64
33
+ - batch size: 256
34
+ - learning rate: 5e-5
35
+ - drop out: 0.1
36
+ - temp: 0.05
37
+ - pooler: cls
38
+ - 1 A100 GPU
39
+
40
+ ## 3. Performance
41
+
42
+ ## (1) KorSTS-test
43
+ |Model|AVG|Cosine Pearson|Cosine Spearman|Euclidean Pearson|Euclidean Spearman|Manhatten Pearson|Manhatten Spearman|Dot Pearson|Dot Spearman|
44
+ |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
45
+ |SimCSE-BERT-KO<br>(unsup)|72.85|73.00|72.77|72.96|72.92|72.93|72.86|72.80|72.53|
46
+ |SimCSE-BERT-KO<br>(sup)|85.98|86.05|86.00|85.88|86.08|85.90|86.08|85.96|85.89|
47
+ |SimCSE-RoBERTa-KO<br>(unsup)|**75.79**|**76.39**|**75.57**|**75.71**|**75.52**|**75.65**|**75.42**|**76.41**|**75.63**|
48
+ |SimCSE-RoBERTa-KO<br>(sup)|83.06|82.67|83.21|83.22|83.27|83.24|83.28|82.54|83.03|82.92|
49
+
50
+ ## (2) Klue-dev
51
+ |Model|AVG|Cosine Pearson|Cosine Spearman|Euclidean Pearson|Euclidean Spearman|Manhatten Pearson|Manhatten Spearman|Dot Pearson|Dot Spearman|
52
+ |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
53
+ |SimCSE-BERT-KO<br>(unsup)|65.27|66.27|64.31|66.18|64.05|66.00|63.77|66.64|64.93|
54
+ |SimCSE-BERT-KO<br>(sup)|83.96|82.98|84.32|84.32|84.30|84.28|84.20|83.00|84.29|
55
+ |SimCSE-RoBERTa-KO<br>(unsup)|**80.78**|**81.20**|**80.35**|**81.27**|**80.36**|**81.28**|**80.40**|**81.13**|**80.26**|
56
+ |SimCSE-RoBERTa-KO<br>(sup)|85.31|84.14|85.64|86.09|85.68|86.04|85.65|83.94|85.30|
57
+
58
+ ## Citing
59
+ ```
60
+ @article{gao2021simcse,
61
+ title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
62
+ author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
63
+ booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
64
+ year={2021}
65
+ }
66
+ @article{ham2020kornli,
67
+ title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
68
+ author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
69
+ journal={arXiv preprint arXiv:2004.03289},
70
+ year={2020}
71
+ }
72
+ @misc{park2021klue,
73
+ title={KLUE: Korean Language Understanding Evaluation},
74
+ author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
75
+ year={2021},
76
+ eprint={2105.09680},
77
+ archivePrefix={arXiv},
78
+ primaryClass={cs.CL}
79
+ }
80
+ ```