indiejoseph commited on
Commit
8c3360b
·
verified ·
1 Parent(s): 93ca945

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +65 -39
  2. model.safetensors +1 -1
README.md CHANGED
@@ -4,37 +4,39 @@ tags:
4
  - sentence-similarity
5
  - feature-extraction
6
  - generated_from_trainer
7
- - dataset_size:5749
8
  - loss:CosineSimilarityLoss
9
  base_model: hon9kon9ize/bert-large-cantonese-nli
10
  widget:
11
- - source_sentence: 有個男人彈緊豎琴。
12
  sentences:
13
- - 我鍾意將一心多用諗做快速轉換工作。
14
- - 有個女人彈緊結他。
15
- - 正如你原先個問題所講:維基百科嗰段嘢無話約瑟夫·P·甘迺迪買咗約翰·F·甘迺迪嘅勝選。
16
- - source_sentence: 一艘大型白色郵輪浮喺水上。
17
  sentences:
18
- - 有個女人喺台上唱歌。
19
- - 有個男人吹緊笛。
20
- - 一艘大型郵輪浮喺水上。
21
- - source_sentence: 個男人送緊蛋糕。
22
  sentences:
23
- - 有個女人剝緊蝦。
24
- - 有個男人係度調味鵪鶉。
25
- - 一架紅色雙層巴士喺一條擠迫嘅街道上。
26
- - source_sentence: 我哋相對宇宙共同靜止參考系嘅速度係……每秒大約 371 公里,方向係朝住獅子座。」
27
  sentences:
28
- - 間廳擺咗啲啡色嘅傢俬同埋一台平面電視。
29
- - 冇一樣嘢係「靜止」嘅,除非係相對某啲其他物體先至係。
30
- - 一個人喺水邊蹓狗。
31
- - source_sentence: 恆星喺恆星形成區形成,而恆星形成區本身係由分子雲演變而來。
32
  sentences:
33
- - 一個金毛小朋友喺屋企前面吹緊喇叭表演,佢細佬喺隔離睇緊。
34
- - 一架四驅車泥濘路面度行緊。
35
- - 「可能喺銀河系以外都存在好似我哋呢個太陽系嘅星系。」
36
  datasets:
37
- - hon9kon9ize/yue-stsb
 
 
38
  pipeline_tag: sentence-similarity
39
  library_name: sentence-transformers
40
  metrics:
@@ -51,10 +53,10 @@ model-index:
51
  type: sts-dev
52
  metrics:
53
  - type: pearson_cosine
54
- value: 0.825800689249711
55
  name: Pearson Cosine
56
  - type: spearman_cosine
57
- value: 0.8262408727980405
58
  name: Spearman Cosine
59
  - task:
60
  type: semantic-similarity
@@ -64,16 +66,17 @@ model-index:
64
  type: sts-test
65
  metrics:
66
  - type: pearson_cosine
67
- value: 0.7926469146205422
68
  name: Pearson Cosine
69
  - type: spearman_cosine
70
- value: 0.7911184368054363
71
  name: Spearman Cosine
72
  ---
73
 
74
  # SentenceTransformer based on hon9kon9ize/bert-large-cantonese-nli
75
 
76
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [hon9kon9ize/bert-large-cantonese-nli](https://huggingface.co/hon9kon9ize/bert-large-cantonese-nli) on the [yue-stsb](https://huggingface.co/datasets/hon9kon9ize/yue-stsb) dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
77
 
78
  ## Model Details
79
 
@@ -83,8 +86,7 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [h
83
  - **Maximum Sequence Length:** 512 tokens
84
  - **Output Dimensionality:** 1024 dimensions
85
  - **Similarity Function:** Cosine Similarity
86
- - **Training Dataset:**
87
- - [yue-stsb](https://huggingface.co/datasets/hon9kon9ize/yue-stsb)
88
  <!-- - **Language:** Unknown -->
89
  <!-- - **License:** Unknown -->
90
 
@@ -121,9 +123,9 @@ from sentence_transformers import SentenceTransformer
121
  model = SentenceTransformer("sentence_transformers_model_id")
122
  # Run inference
123
  sentences = [
124
- '恆星喺恆星形成區形成,而恆星形成區本身係由分子雲演變而來。',
125
- '「可能喺銀河系以外都存在好似我哋呢個太陽系嘅星系。」',
126
- '一架四驅車泥濘路面度行緊。',
127
  ]
128
  embeddings = model.encode(sentences)
129
  print(embeddings.shape)
@@ -170,8 +172,8 @@ You can finetune this model on your own dataset.
170
 
171
  | Metric | sts-dev | sts-test |
172
  |:--------------------|:-----------|:-----------|
173
- | pearson_cosine | 0.8258 | 0.7926 |
174
- | **spearman_cosine** | **0.8262** | **0.7911** |
175
 
176
  <!--
177
  ## Bias, Risks and Limitations
@@ -212,12 +214,32 @@ You can finetune this model on your own dataset.
212
  }
213
  ```
214
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
  ### Evaluation Dataset
216
 
217
- #### yue-stsb
218
 
219
- * Dataset: [yue-stsb](https://huggingface.co/datasets/hon9kon9ize/yue-stsb) at [40cea5d](https://huggingface.co/datasets/hon9kon9ize/yue-stsb/tree/40cea5d8e9d1aeb1498816d90d1e417bafcc96a8)
220
- * Size: 1,500 evaluation samples
221
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
222
  * Approximate statistics based on the first 1000 samples:
223
  | | sentence1 | sentence2 | score |
@@ -370,8 +392,12 @@ You can finetune this model on your own dataset.
370
  ### Training Logs
371
  | Epoch | Step | Training Loss | Validation Loss | sts-dev_spearman_cosine | sts-test_spearman_cosine |
372
  |:------:|:----:|:-------------:|:---------------:|:-----------------------:|:------------------------:|
373
- | 2.2222 | 100 | 0.0299 | 0.0312 | 0.8262 | - |
374
- | 4.0 | 180 | - | - | - | 0.7911 |
 
 
 
 
375
 
376
 
377
  ### Framework Versions
 
4
  - sentence-similarity
5
  - feature-extraction
6
  - generated_from_trainer
7
+ - dataset_size:16729
8
  - loss:CosineSimilarityLoss
9
  base_model: hon9kon9ize/bert-large-cantonese-nli
10
  widget:
11
+ - source_sentence: 啲狗喺雪入面玩緊。
12
  sentences:
13
+ - 呢個係我成日覺得對一年級學生好有幫助嘅例子。
14
+ - 兩隻狗喺沙灘到玩緊。
15
+ - 喺Linux系統,我用Bibble,雖然有啲缺點,但係依家得呢個係比較專業嘅選擇。
16
+ - source_sentence: 個女人整緊蛋。
17
  sentences:
18
+ - 一班老人家圍住張飯枱影相。
19
+ - 有個男人向個女人唱歌。
20
+ - 個女人係度食嘢。
21
+ - source_sentence: 一架電單車泊喺一幅畫滿城市景觀塗鴉嘅牆邊。
22
  sentences:
23
+ - 夜晚,一架電單車泊喺一幅城市壁畫隔離。
24
+ - 一隻黑白相間嘅狗喺藍色嘅水到游水。
25
+ - 個細路仔頭髮豎晒起,係咁碌落藍色滑梯。
26
+ - source_sentence: 有個男人孭住隻狗同埋一艘獨木舟。
27
  sentences:
28
+ - 隻狗孭住個男人喺獨木舟到。
29
+ - 我見我對孖仔就係咁:細路仔學說話嗰陣,都會自己發明啲獨特嘅方言。
30
+ - 「出汗就係出汗,你真係控制唔到。」
31
+ - source_sentence: 一個細路女同一個細路仔喺度睇書。
32
  sentences:
33
+ - 個女人孭住個BB。
34
+ - 有個男人彈緊結他。
35
+ - 一個大啲嘅小朋友玩緊公仔,望住窗外。
36
  datasets:
37
+ - hon9kon9ize/yue-stsb
38
+ - sentence-transformers/stsb
39
+ - C-MTEB/STSB
40
  pipeline_tag: sentence-similarity
41
  library_name: sentence-transformers
42
  metrics:
 
53
  type: sts-dev
54
  metrics:
55
  - type: pearson_cosine
56
+ value: 0.7983233550249502
57
  name: Pearson Cosine
58
  - type: spearman_cosine
59
+ value: 0.7996394101125816
60
  name: Spearman Cosine
61
  - task:
62
  type: semantic-similarity
 
66
  type: sts-test
67
  metrics:
68
  - type: pearson_cosine
69
+ value: 0.7637579307526682
70
  name: Pearson Cosine
71
  - type: spearman_cosine
72
+ value: 0.7604840209490058
73
  name: Spearman Cosine
74
  ---
75
 
76
  # SentenceTransformer based on hon9kon9ize/bert-large-cantonese-nli
77
 
78
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [hon9kon9ize/bert-large-cantonese-nli](https://huggingface.co/hon9kon9ize/bert-large-cantonese-nli) on the [yue-stsb](https://huggingface.co/datasets/hon9kon9ize/yue-stsb), [stsb](https://huggingface.co/datasets/sentence-transformers/stsb) and [C-MTEB/STSB](https://huggingface.co/datasets/C-MTEB/STSB) dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
79
+
80
 
81
  ## Model Details
82
 
 
86
  - **Maximum Sequence Length:** 512 tokens
87
  - **Output Dimensionality:** 1024 dimensions
88
  - **Similarity Function:** Cosine Similarity
89
+ <!-- - **Training Dataset:** Unknown -->
 
90
  <!-- - **Language:** Unknown -->
91
  <!-- - **License:** Unknown -->
92
 
 
123
  model = SentenceTransformer("sentence_transformers_model_id")
124
  # Run inference
125
  sentences = [
126
+ '一個細路女同一個細路仔喺度睇書。',
127
+ '一個大啲嘅小朋友玩緊公仔,望住窗外。',
128
+ '有個男人彈緊結他。',
129
  ]
130
  embeddings = model.encode(sentences)
131
  print(embeddings.shape)
 
172
 
173
  | Metric | sts-dev | sts-test |
174
  |:--------------------|:-----------|:-----------|
175
+ | pearson_cosine | 0.7983 | 0.7638 |
176
+ | **spearman_cosine** | **0.7996** | **0.7605** |
177
 
178
  <!--
179
  ## Bias, Risks and Limitations
 
214
  }
215
  ```
216
 
217
+ * Size: 16,729 training samples
218
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
219
+ * Approximate statistics based on the first 1000 samples:
220
+ | | sentence1 | sentence2 | score |
221
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------|
222
+ | type | string | string | float |
223
+ | details | <ul><li>min: 5 tokens</li><li>mean: 20.29 tokens</li><li>max: 74 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 20.36 tokens</li><li>max: 76 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.52</li><li>max: 1.0</li></ul> |
224
+ * Samples:
225
+ | sentence1 | sentence2 | score |
226
+ |:----------------------------------------------------------------|:---------------------------------------------------------|:------------------|
227
+ | <code>奧巴馬登記咗參加奧巴馬醫保。 </code> | <code>美國人爭住喺限期前登記參加奧巴馬醫保計劃,</code> | <code>0.24</code> |
228
+ | <code>Search ends for missing asylum-seekers</code> | <code>Search narrowed for missing man</code> | <code>0.28</code> |
229
+ | <code>檢察官喺五月突然轉軚,要求公開驗屍報告,因為有利於辯方嘅康納·彼得森驗屍報告部分內容已經洩露畀媒體。</code> | <code>佢哋要求公開驗屍報告,因為彼得森腹中胎兒嘅驗屍報告中,對辯方有利嘅部分已經洩露俾傳媒。</code> | <code>0.8</code> |
230
+ * Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
231
+ ```json
232
+ {
233
+ "loss_fct": "torch.nn.modules.loss.MSELoss"
234
+ }
235
+ ```
236
+
237
  ### Evaluation Dataset
238
 
239
+ #### Unnamed Dataset
240
 
241
+
242
+ * Size: 4,458 evaluation samples
243
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
244
  * Approximate statistics based on the first 1000 samples:
245
  | | sentence1 | sentence2 | score |
 
392
  ### Training Logs
393
  | Epoch | Step | Training Loss | Validation Loss | sts-dev_spearman_cosine | sts-test_spearman_cosine |
394
  |:------:|:----:|:-------------:|:---------------:|:-----------------------:|:------------------------:|
395
+ | 0.7634 | 100 | 0.0549 | 0.0403 | 0.7895 | - |
396
+ | 1.5267 | 200 | 0.027 | 0.0368 | 0.7941 | - |
397
+ | 2.2901 | 300 | 0.0187 | 0.0349 | 0.7968 | - |
398
+ | 3.0534 | 400 | 0.0119 | 0.0354 | 0.8004 | - |
399
+ | 3.8168 | 500 | 0.0076 | 0.0359 | 0.7996 | - |
400
+ | 4.0 | 524 | - | - | - | 0.7605 |
401
 
402
 
403
  ### Framework Versions
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9c1d5859a25e08b31120af6f1f2ababea81234d1ef546b566846773dc944d964
3
  size 1304182568
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2bb63b9fa78a3638d714e24acbe696ea02301ea7dfbba5a232a65f11cdaac6c1
3
  size 1304182568