Upload folder using huggingface_hub
Browse files- README.md +65 -39
- model.safetensors +1 -1
README.md
CHANGED
@@ -4,37 +4,39 @@ tags:
|
|
4 |
- sentence-similarity
|
5 |
- feature-extraction
|
6 |
- generated_from_trainer
|
7 |
-
- dataset_size:
|
8 |
- loss:CosineSimilarityLoss
|
9 |
base_model: hon9kon9ize/bert-large-cantonese-nli
|
10 |
widget:
|
11 |
-
- source_sentence:
|
12 |
sentences:
|
13 |
-
-
|
14 |
-
-
|
15 |
-
-
|
16 |
-
- source_sentence:
|
17 |
sentences:
|
18 |
-
-
|
19 |
-
-
|
20 |
-
-
|
21 |
-
- source_sentence:
|
22 |
sentences:
|
23 |
-
-
|
24 |
-
-
|
25 |
-
-
|
26 |
-
- source_sentence:
|
27 |
sentences:
|
28 |
-
-
|
29 |
-
-
|
30 |
-
-
|
31 |
-
- source_sentence:
|
32 |
sentences:
|
33 |
-
-
|
34 |
-
-
|
35 |
-
-
|
36 |
datasets:
|
37 |
-
- hon9kon9ize/yue-stsb
|
|
|
|
|
38 |
pipeline_tag: sentence-similarity
|
39 |
library_name: sentence-transformers
|
40 |
metrics:
|
@@ -51,10 +53,10 @@ model-index:
|
|
51 |
type: sts-dev
|
52 |
metrics:
|
53 |
- type: pearson_cosine
|
54 |
-
value: 0.
|
55 |
name: Pearson Cosine
|
56 |
- type: spearman_cosine
|
57 |
-
value: 0.
|
58 |
name: Spearman Cosine
|
59 |
- task:
|
60 |
type: semantic-similarity
|
@@ -64,16 +66,17 @@ model-index:
|
|
64 |
type: sts-test
|
65 |
metrics:
|
66 |
- type: pearson_cosine
|
67 |
-
value: 0.
|
68 |
name: Pearson Cosine
|
69 |
- type: spearman_cosine
|
70 |
-
value: 0.
|
71 |
name: Spearman Cosine
|
72 |
---
|
73 |
|
74 |
# SentenceTransformer based on hon9kon9ize/bert-large-cantonese-nli
|
75 |
|
76 |
-
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [hon9kon9ize/bert-large-cantonese-nli](https://huggingface.co/hon9kon9ize/bert-large-cantonese-nli) on the [yue-stsb](https://huggingface.co/datasets/hon9kon9ize/yue-stsb) dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
|
|
77 |
|
78 |
## Model Details
|
79 |
|
@@ -83,8 +86,7 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [h
|
|
83 |
- **Maximum Sequence Length:** 512 tokens
|
84 |
- **Output Dimensionality:** 1024 dimensions
|
85 |
- **Similarity Function:** Cosine Similarity
|
86 |
-
- **Training Dataset:**
|
87 |
-
- [yue-stsb](https://huggingface.co/datasets/hon9kon9ize/yue-stsb)
|
88 |
<!-- - **Language:** Unknown -->
|
89 |
<!-- - **License:** Unknown -->
|
90 |
|
@@ -121,9 +123,9 @@ from sentence_transformers import SentenceTransformer
|
|
121 |
model = SentenceTransformer("sentence_transformers_model_id")
|
122 |
# Run inference
|
123 |
sentences = [
|
124 |
-
'
|
125 |
-
'
|
126 |
-
'
|
127 |
]
|
128 |
embeddings = model.encode(sentences)
|
129 |
print(embeddings.shape)
|
@@ -170,8 +172,8 @@ You can finetune this model on your own dataset.
|
|
170 |
|
171 |
| Metric | sts-dev | sts-test |
|
172 |
|:--------------------|:-----------|:-----------|
|
173 |
-
| pearson_cosine | 0.
|
174 |
-
| **spearman_cosine** | **0.
|
175 |
|
176 |
<!--
|
177 |
## Bias, Risks and Limitations
|
@@ -212,12 +214,32 @@ You can finetune this model on your own dataset.
|
|
212 |
}
|
213 |
```
|
214 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
215 |
### Evaluation Dataset
|
216 |
|
217 |
-
####
|
218 |
|
219 |
-
|
220 |
-
* Size:
|
221 |
* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
|
222 |
* Approximate statistics based on the first 1000 samples:
|
223 |
| | sentence1 | sentence2 | score |
|
@@ -370,8 +392,12 @@ You can finetune this model on your own dataset.
|
|
370 |
### Training Logs
|
371 |
| Epoch | Step | Training Loss | Validation Loss | sts-dev_spearman_cosine | sts-test_spearman_cosine |
|
372 |
|:------:|:----:|:-------------:|:---------------:|:-----------------------:|:------------------------:|
|
373 |
-
|
|
374 |
-
|
|
|
|
|
|
|
|
|
|
375 |
|
376 |
|
377 |
### Framework Versions
|
|
|
4 |
- sentence-similarity
|
5 |
- feature-extraction
|
6 |
- generated_from_trainer
|
7 |
+
- dataset_size:16729
|
8 |
- loss:CosineSimilarityLoss
|
9 |
base_model: hon9kon9ize/bert-large-cantonese-nli
|
10 |
widget:
|
11 |
+
- source_sentence: 啲狗喺雪入面玩緊。
|
12 |
sentences:
|
13 |
+
- 呢個係我成日覺得對一年級學生好有幫助嘅例子。
|
14 |
+
- 兩隻狗喺沙灘到玩緊。
|
15 |
+
- 喺Linux系統,我用Bibble,雖然有啲缺點,但係依家得呢個係比較專業嘅選擇。
|
16 |
+
- source_sentence: 個女人整緊蛋。
|
17 |
sentences:
|
18 |
+
- 一班老人家圍住張飯枱影相。
|
19 |
+
- 有個男人向個女人唱歌。
|
20 |
+
- 個女人係度食嘢。
|
21 |
+
- source_sentence: 一架電單車泊喺一幅畫滿城市景觀塗鴉嘅牆邊。
|
22 |
sentences:
|
23 |
+
- 夜晚,一架電單車泊喺一幅城市壁畫隔離。
|
24 |
+
- 一隻黑白相間嘅狗喺藍色嘅水到游水。
|
25 |
+
- 個細路仔頭髮豎晒起,係咁碌落藍色滑梯。
|
26 |
+
- source_sentence: 有個男人孭住隻狗同埋一艘獨木舟。
|
27 |
sentences:
|
28 |
+
- 隻狗孭住個男人喺獨木舟到。
|
29 |
+
- 我見我對孖仔就係咁:細路仔學說話嗰陣,都會自己發明啲獨特嘅方言。
|
30 |
+
- 「出汗就係出汗,你真係控制唔到。」
|
31 |
+
- source_sentence: 一個細路女同一個細路仔喺度睇書。
|
32 |
sentences:
|
33 |
+
- 個女人孭住個BB。
|
34 |
+
- 有個男人彈緊結他。
|
35 |
+
- 一個大啲嘅小朋友玩緊公仔,望住窗外。
|
36 |
datasets:
|
37 |
+
- hon9kon9ize/yue-stsb
|
38 |
+
- sentence-transformers/stsb
|
39 |
+
- C-MTEB/STSB
|
40 |
pipeline_tag: sentence-similarity
|
41 |
library_name: sentence-transformers
|
42 |
metrics:
|
|
|
53 |
type: sts-dev
|
54 |
metrics:
|
55 |
- type: pearson_cosine
|
56 |
+
value: 0.7983233550249502
|
57 |
name: Pearson Cosine
|
58 |
- type: spearman_cosine
|
59 |
+
value: 0.7996394101125816
|
60 |
name: Spearman Cosine
|
61 |
- task:
|
62 |
type: semantic-similarity
|
|
|
66 |
type: sts-test
|
67 |
metrics:
|
68 |
- type: pearson_cosine
|
69 |
+
value: 0.7637579307526682
|
70 |
name: Pearson Cosine
|
71 |
- type: spearman_cosine
|
72 |
+
value: 0.7604840209490058
|
73 |
name: Spearman Cosine
|
74 |
---
|
75 |
|
76 |
# SentenceTransformer based on hon9kon9ize/bert-large-cantonese-nli
|
77 |
|
78 |
+
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [hon9kon9ize/bert-large-cantonese-nli](https://huggingface.co/hon9kon9ize/bert-large-cantonese-nli) on the [yue-stsb](https://huggingface.co/datasets/hon9kon9ize/yue-stsb), [stsb](https://huggingface.co/datasets/sentence-transformers/stsb) and [C-MTEB/STSB](https://huggingface.co/datasets/C-MTEB/STSB) dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
79 |
+
|
80 |
|
81 |
## Model Details
|
82 |
|
|
|
86 |
- **Maximum Sequence Length:** 512 tokens
|
87 |
- **Output Dimensionality:** 1024 dimensions
|
88 |
- **Similarity Function:** Cosine Similarity
|
89 |
+
<!-- - **Training Dataset:** Unknown -->
|
|
|
90 |
<!-- - **Language:** Unknown -->
|
91 |
<!-- - **License:** Unknown -->
|
92 |
|
|
|
123 |
model = SentenceTransformer("sentence_transformers_model_id")
|
124 |
# Run inference
|
125 |
sentences = [
|
126 |
+
'一個細路女同一個細路仔喺度睇書。',
|
127 |
+
'一個大啲嘅小朋友玩緊公仔,望住窗外。',
|
128 |
+
'有個男人彈緊結他。',
|
129 |
]
|
130 |
embeddings = model.encode(sentences)
|
131 |
print(embeddings.shape)
|
|
|
172 |
|
173 |
| Metric | sts-dev | sts-test |
|
174 |
|:--------------------|:-----------|:-----------|
|
175 |
+
| pearson_cosine | 0.7983 | 0.7638 |
|
176 |
+
| **spearman_cosine** | **0.7996** | **0.7605** |
|
177 |
|
178 |
<!--
|
179 |
## Bias, Risks and Limitations
|
|
|
214 |
}
|
215 |
```
|
216 |
|
217 |
+
* Size: 16,729 training samples
|
218 |
+
* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
|
219 |
+
* Approximate statistics based on the first 1000 samples:
|
220 |
+
| | sentence1 | sentence2 | score |
|
221 |
+
|:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------|
|
222 |
+
| type | string | string | float |
|
223 |
+
| details | <ul><li>min: 5 tokens</li><li>mean: 20.29 tokens</li><li>max: 74 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 20.36 tokens</li><li>max: 76 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.52</li><li>max: 1.0</li></ul> |
|
224 |
+
* Samples:
|
225 |
+
| sentence1 | sentence2 | score |
|
226 |
+
|:----------------------------------------------------------------|:---------------------------------------------------------|:------------------|
|
227 |
+
| <code>奧巴馬登記咗參加奧巴馬醫保。 </code> | <code>美國人爭住喺限期前登記參加奧巴馬醫保計劃,</code> | <code>0.24</code> |
|
228 |
+
| <code>Search ends for missing asylum-seekers</code> | <code>Search narrowed for missing man</code> | <code>0.28</code> |
|
229 |
+
| <code>檢察官喺五月突然轉軚,要求公開驗屍報告,因為有利於辯方嘅康納·彼得森驗屍報告部分內容已經洩露畀媒體。</code> | <code>佢哋要求公開驗屍報告,因為彼得森腹中胎兒嘅驗屍報告中,對辯方有利嘅部分已經洩露俾傳媒。</code> | <code>0.8</code> |
|
230 |
+
* Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
|
231 |
+
```json
|
232 |
+
{
|
233 |
+
"loss_fct": "torch.nn.modules.loss.MSELoss"
|
234 |
+
}
|
235 |
+
```
|
236 |
+
|
237 |
### Evaluation Dataset
|
238 |
|
239 |
+
#### Unnamed Dataset
|
240 |
|
241 |
+
|
242 |
+
* Size: 4,458 evaluation samples
|
243 |
* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
|
244 |
* Approximate statistics based on the first 1000 samples:
|
245 |
| | sentence1 | sentence2 | score |
|
|
|
392 |
### Training Logs
|
393 |
| Epoch | Step | Training Loss | Validation Loss | sts-dev_spearman_cosine | sts-test_spearman_cosine |
|
394 |
|:------:|:----:|:-------------:|:---------------:|:-----------------------:|:------------------------:|
|
395 |
+
| 0.7634 | 100 | 0.0549 | 0.0403 | 0.7895 | - |
|
396 |
+
| 1.5267 | 200 | 0.027 | 0.0368 | 0.7941 | - |
|
397 |
+
| 2.2901 | 300 | 0.0187 | 0.0349 | 0.7968 | - |
|
398 |
+
| 3.0534 | 400 | 0.0119 | 0.0354 | 0.8004 | - |
|
399 |
+
| 3.8168 | 500 | 0.0076 | 0.0359 | 0.7996 | - |
|
400 |
+
| 4.0 | 524 | - | - | - | 0.7605 |
|
401 |
|
402 |
|
403 |
### Framework Versions
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 1304182568
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2bb63b9fa78a3638d714e24acbe696ea02301ea7dfbba5a232a65f11cdaac6c1
|
3 |
size 1304182568
|