Upload folder using huggingface_hub
Browse files- README.md +87 -36
- model.safetensors +1 -1
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
base_model: aubmindlab/bert-base-arabertv02
|
3 |
datasets: []
|
4 |
-
language: [
|
5 |
library_name: sentence-transformers
|
6 |
pipeline_tag: sentence-similarity
|
7 |
tags:
|
@@ -9,19 +9,47 @@ tags:
|
|
9 |
- sentence-similarity
|
10 |
- feature-extraction
|
11 |
- generated_from_trainer
|
12 |
-
- dataset_size:
|
13 |
- loss:MatryoshkaLoss
|
14 |
- loss:MultipleNegativesRankingLoss
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
---
|
16 |
|
17 |
-
#
|
18 |
|
19 |
-
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02).
|
20 |
-
It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search,
|
21 |
-
paraphrase mining, text classification, clustering, and more.
|
22 |
-
|
23 |
-
The model is based on a sample from the `akhooli/arabic-triplets-1m-curated-sims-len` dataset. This is an early test version. Do not use while the model name has the
|
24 |
-
word `test`.
|
25 |
|
26 |
## Model Details
|
27 |
|
@@ -68,9 +96,9 @@ from sentence_transformers import SentenceTransformer
|
|
68 |
model = SentenceTransformer("sentence_transformers_model_id")
|
69 |
# Run inference
|
70 |
sentences = [
|
71 |
-
'
|
72 |
-
'
|
73 |
-
'
|
74 |
]
|
75 |
embeddings = model.encode(sentences)
|
76 |
print(embeddings.shape)
|
@@ -125,19 +153,19 @@ You can finetune this model on your own dataset.
|
|
125 |
#### Unnamed Dataset
|
126 |
|
127 |
|
128 |
-
* Size:
|
129 |
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
|
130 |
* Approximate statistics based on the first 1000 samples:
|
131 |
-
| | anchor
|
132 |
-
|
133 |
-
| type | string
|
134 |
-
| details | <ul><li>min: 4 tokens</li><li>mean:
|
135 |
* Samples:
|
136 |
-
| anchor
|
137 |
-
|
138 |
-
| <code
|
139 |
-
| <code
|
140 |
-
| <code
|
141 |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
|
142 |
```json
|
143 |
{
|
@@ -165,19 +193,19 @@ You can finetune this model on your own dataset.
|
|
165 |
#### Unnamed Dataset
|
166 |
|
167 |
|
168 |
-
* Size:
|
169 |
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
|
170 |
* Approximate statistics based on the first 1000 samples:
|
171 |
-
| | anchor | positive
|
172 |
-
|
173 |
-
| type | string | string
|
174 |
-
| details | <ul><li>min: 4 tokens</li><li>mean:
|
175 |
* Samples:
|
176 |
-
| anchor
|
177 |
-
|
178 |
-
| <code
|
179 |
-
| <code
|
180 |
-
| <code
|
181 |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
|
182 |
```json
|
183 |
{
|
@@ -207,6 +235,7 @@ You can finetune this model on your own dataset.
|
|
207 |
- `per_device_train_batch_size`: 16
|
208 |
- `per_device_eval_batch_size`: 16
|
209 |
- `learning_rate`: 2e-05
|
|
|
210 |
- `warmup_ratio`: 0.1
|
211 |
- `fp16`: True
|
212 |
- `batch_sampler`: no_duplicates
|
@@ -230,7 +259,7 @@ You can finetune this model on your own dataset.
|
|
230 |
- `adam_beta2`: 0.999
|
231 |
- `adam_epsilon`: 1e-08
|
232 |
- `max_grad_norm`: 1.0
|
233 |
-
- `num_train_epochs`:
|
234 |
- `max_steps`: -1
|
235 |
- `lr_scheduler_type`: linear
|
236 |
- `lr_scheduler_kwargs`: {}
|
@@ -327,9 +356,31 @@ You can finetune this model on your own dataset.
|
|
327 |
</details>
|
328 |
|
329 |
### Training Logs
|
330 |
-
| Epoch | Step
|
331 |
-
|
332 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
333 |
|
334 |
|
335 |
### Framework Versions
|
|
|
1 |
---
|
2 |
base_model: aubmindlab/bert-base-arabertv02
|
3 |
datasets: []
|
4 |
+
language: []
|
5 |
library_name: sentence-transformers
|
6 |
pipeline_tag: sentence-similarity
|
7 |
tags:
|
|
|
9 |
- sentence-similarity
|
10 |
- feature-extraction
|
11 |
- generated_from_trainer
|
12 |
+
- dataset_size:75000
|
13 |
- loss:MatryoshkaLoss
|
14 |
- loss:MultipleNegativesRankingLoss
|
15 |
+
widget:
|
16 |
+
- source_sentence: ุฑุฌู ููุธุฑ ุฅูู ู
ุง ูุจุฏู ุฃูู ูุทุน ู
ู ุงููุฑู ุงูู
ููู ูุงู
ุฑุฃุฉ ูู ุงูู
ุทุจุฎ.
|
17 |
+
sentences:
|
18 |
+
- ุฒูุฌ ูุฒูุฌุชู ูุชุฒูุฌุงู ุนูู ุงูุฌุจุงู ุงูุณููุณุฑูุฉ
|
19 |
+
- ู
ุง ูู ุงููุชุงุจ ุงูุฌูุฏ ูููุฑุงุกุฉุ
|
20 |
+
- ุฑุฌู ูุญุฏู ูู ุงู
ุฑุฃุฉ ูู ุงูู
ุทุจุฎ
|
21 |
+
- source_sentence: ุงูููุจ ุงูุฑู
ุงุฏู ูุฑูุถ ุนูู ุฌุงูุจ ุจุฑูุฉ ุจููู
ุง ุงูููุจ ุงูุฃุตูุฑ ูููุฒ ุฅูู ุงูุจุฑูุฉ.
|
22 |
+
sentences:
|
23 |
+
- ุงูููุงุจ ุชุฃูู ุนุดุงุฆูุง ุงููููู
|
24 |
+
- ููุงู ููุจุงู ุจุงูุฎุงุฑุฌ ุจุงููุฑุจ ู
ู ุญู
ุงู
ุงูุณุจุงุญุฉ
|
25 |
+
- ููู ุชุตูุน ุฒุฌุงุฌ ุจูุฑููุณุ
|
26 |
+
- source_sentence: ููู ูู
ูููุง ูุณุจ ุงูู
ุงู ู
ู ููุชููุจุ
|
27 |
+
sentences:
|
28 |
+
- ููู ูู
ูููู ูุณุจ ุงูู
ุงู ู
ู ุฎูุงู ุงูููุชููุจุ
|
29 |
+
- ูุชู ูุฑู
ู ุญููุจุฉ.
|
30 |
+
- ูู ูู
ูู ูุดุฎุต ู
ุชุญูู ุฌูุณูุงู ุฃู ูุนูุฏ ุฅูู ุฌูุณู ุงูุณุงุจู ุจุนุฏ ุฌุฑุงุญุฉ ุชุบููุฑ ุงูุฌูุณุ
|
31 |
+
- source_sentence: ููู ูุญุตู ุงูู
ุฑุก ุนูู ุฑูู
ูุงุชู ูุชุงุฉ ุจุณุฑุนุฉุ
|
32 |
+
sentences:
|
33 |
+
- ุงู
ุฑุฃุฉ ุชุชุณูู ูู ุณูู ุงูู
ุฒุงุฑุนูู
|
34 |
+
- ููู ุชุญุตู ุนูู ุฑูู
ูุงุชู ูุชุงุฉุ
|
35 |
+
- ููู ูู
ูููู ุงูุชุฎูุต ู
ู ุญุจ ุงูุดุจุงุจุ
|
36 |
+
- source_sentence: ู
ุง ูู ููุน ุงูุฏููู ุงูู
ูุฌูุฏุฉ ูู ุงูุฃูููุงุฏู
|
37 |
+
sentences:
|
38 |
+
- ุญูุงูู 15 ูู ุงูู
ุงุฆุฉ ู
ู ุงูุฏููู ูู ุงูุฃูููุงุฏู ู
ุดุจุนุฉ ุ ู
ุน ูู ููุจ ูุงุญุฏ ู
ู ุงูุฃูููุงุฏู
|
39 |
+
ุงูู
ูุฑูู
ูุญุชูู ุนูู 3.2 ุฌุฑุงู
ู
ู ุงูุฏููู ุงูู
ุดุจุนุฉ ุ ููู ู
ุง ูู
ุซู 16 ูู ุงูู
ุงุฆุฉ ู
ู DV
|
40 |
+
ุงูุจุงูุบ 20 ุฌุฑุงู
ูุง. ุชุญุชูู ุงูุฃูููุงุฏู ูู ุงูุบุงูุจ ุนูู ุฏููู ุฃุญุงุฏูุฉ ุบูุฑ ู
ุดุจุนุฉ ุ ู
ุน 67
|
41 |
+
ูู ุงูู
ุงุฆุฉ ู
ู ุฅุฌู
ุงูู ุงูุฏููู ุ ุฃู 14.7 ุฌุฑุงู
ูุง ููู ููุจ ู
ูุฑูู
ุ ููุชููู ู
ู ูุฐุง ุงูููุน
|
42 |
+
ู
ู ุงูุฏููู.
|
43 |
+
- ุงู
ุฑุฃุฉ ุชุณุชู
ุชุน ุจุฑุงุฆุญุฉ ุดุงููุง ูู ุงูููุงุก ุงูุทูู.
|
44 |
+
- ูู
ูู ุฃู ูุคุฏู ุงุฑุชูุงุน ู
ุณุชูู ุงูุฏููู ุงูุซูุงุซูุฉ ุ ููู ููุน ู
ู ุงูุฏููู (ุงูุฏููู) ูู ุงูุฏู
|
45 |
+
ุ ุฅูู ุฒูุงุฏุฉ ุฎุทุฑ ุงูุฅุตุงุจุฉ ุจุฃู
ุฑุงุถ ุงูููุจ ุ ููู
ูู ุฃู ูุคุฏู ุชูููุฑ ู
ุณุชูู ู
ุฑุชูุน ู
ู ุงูุฏููู
|
46 |
+
ุงูุซูุงุซูุฉ ุ ููู ููุน ู
ู ุงูุฏููู (ุงูุฏููู) ูู ุงูุฏู
ุ ุฅูู ุฒูุงุฏุฉ ุฎุทุฑ ุงูุฅุตุงุจุฉ ุจุฃู
ุฑุงุถ ุงูููุจ.
|
47 |
+
ู
ุฑุถ.
|
48 |
---
|
49 |
|
50 |
+
# SentenceTransformer based on aubmindlab/bert-base-arabertv02
|
51 |
|
52 |
+
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
## Model Details
|
55 |
|
|
|
96 |
model = SentenceTransformer("sentence_transformers_model_id")
|
97 |
# Run inference
|
98 |
sentences = [
|
99 |
+
'ู
ุง ูู ููุน ุงูุฏููู ุงูู
ูุฌูุฏุฉ ูู ุงูุฃูููุงุฏู',
|
100 |
+
'ุญูุงูู 15 ูู ุงูู
ุงุฆุฉ ู
ู ุงูุฏููู ูู ุงูุฃูููุงุฏู ู
ุดุจุนุฉ ุ ู
ุน ูู ููุจ ูุงุญุฏ ู
ู ุงูุฃูููุงุฏู ุงูู
ูุฑูู
ูุญุชูู ุนูู 3.2 ุฌุฑุงู
ู
ู ุงูุฏููู ุงูู
ุดุจุนุฉ ุ ููู ู
ุง ูู
ุซู 16 ูู ุงูู
ุงุฆุฉ ู
ู DV ุงูุจุงูุบ 20 ุฌุฑุงู
ูุง. ุชุญุชูู ุงูุฃูููุงุฏู ูู ุงูุบุงูุจ ุนูู ุฏููู ุฃุญุงุฏูุฉ ุบูุฑ ู
ุดุจุนุฉ ุ ู
ุน 67 ูู ุงูู
ุงุฆุฉ ู
ู ุฅุฌู
ุงูู ุงูุฏููู ุ ุฃู 14.7 ุฌุฑุงู
ูุง ููู ููุจ ู
ูุฑูู
ุ ููุชููู ู
ู ูุฐุง ุงูููุน ู
ู ุงูุฏููู.',
|
101 |
+
'ูู
ูู ุฃู ูุคุฏู ุงุฑุชูุงุน ู
ุณุชูู ุงูุฏููู ุงูุซูุงุซูุฉ ุ ููู ููุน ู
ู ุงูุฏููู (ุงูุฏููู) ูู ุงูุฏู
ุ ุฅูู ุฒูุงุฏุฉ ุฎุทุฑ ุงูุฅุตุงุจุฉ ุจุฃู
ุฑุงุถ ุงูููุจ ุ ููู
ูู ุฃู ูุคุฏู ุชูููุฑ ู
ุณุชูู ู
ุฑุชูุน ู
ู ุงูุฏููู ุงูุซูุงุซูุฉ ุ ููู ููุน ู
ู ุงูุฏููู (ุงูุฏููู) ูู ุงูุฏู
ุ ุฅูู ุฒูุงุฏุฉ ุฎุทุฑ ุงูุฅุตุงุจุฉ ุจุฃู
ุฑุงุถ ุงูููุจ. ู
ุฑุถ.',
|
102 |
]
|
103 |
embeddings = model.encode(sentences)
|
104 |
print(embeddings.shape)
|
|
|
153 |
#### Unnamed Dataset
|
154 |
|
155 |
|
156 |
+
* Size: 75,000 training samples
|
157 |
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
|
158 |
* Approximate statistics based on the first 1000 samples:
|
159 |
+
| | anchor | positive | negative |
|
160 |
+
|:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
|
161 |
+
| type | string | string | string |
|
162 |
+
| details | <ul><li>min: 4 tokens</li><li>mean: 12.88 tokens</li><li>max: 58 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 13.74 tokens</li><li>max: 126 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 13.38 tokens</li><li>max: 146 tokens</li></ul> |
|
163 |
* Samples:
|
164 |
+
| anchor | positive | negative |
|
165 |
+
|:------------------------------------------------------------------------------------------|:--------------------------------------------------------------|:--------------------------------------------------|
|
166 |
+
| <code>ูู ุชุดุงุฌุฑ (ุณู ุฅุณ ูููุณ) ู (ุฌู ุขุฑ ุขุฑ ุชููููู) ุ ุฅู ูุงู ุงูุฃู
ุฑ ูุฐููุ ูู
ุง ูู ุงูุณุจุจุ</code> | <code>ูู ุตุญูุญ ุฃู (ุณู ุฅุณ ูููุณ) ู (ุชููููู) ุชุดุงุฌุฑุงุ</code> | <code>ู
ุง ูู ุฃูุถู ุงููุชุจ ููุฏุฑุงุณุฉ ูู ุงูุฌุงู
ุนุฉุ</code> |
|
167 |
+
| <code>ู
ุง ูู ุงุนุฑุงุถ ููุฑ ุงูุฏู
ุ</code> | <code>ู
ุง ูู ุงุนุฑุงุถ ุงูุงููู
ูุงุ</code> | <code>ููู ุงุญุถุฑ ูููุฉ ุงูุนุณูุ</code> |
|
168 |
+
| <code>ู
ู ุณุชุตูุช ููุ ุฏููุงูุฏ ุชุฑุงู
ุจ ุฃู
ูููุงุฑู ููููุชููุ</code> | <code>ูู ุชุคูุฏูู ุฏููุงูุฏ ุชุฑุงู
ุจ ุฃู
ูููุงุฑู ููููุชููุ ูู
ุงุฐุงุ</code> | <code>ููู ุฃุชุบูุจ ุนูู ุฅุฏู
ุงู ุงูู
ูุงุฏ ุงูุฅุจุงุญูุฉุ</code> |
|
169 |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
|
170 |
```json
|
171 |
{
|
|
|
193 |
#### Unnamed Dataset
|
194 |
|
195 |
|
196 |
+
* Size: 25,000 evaluation samples
|
197 |
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
|
198 |
* Approximate statistics based on the first 1000 samples:
|
199 |
+
| | anchor | positive | negative |
|
200 |
+
|:--------|:---------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
|
201 |
+
| type | string | string | string |
|
202 |
+
| details | <ul><li>min: 4 tokens</li><li>mean: 12.6 tokens</li><li>max: 70 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 14.82 tokens</li><li>max: 239 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 13.78 tokens</li><li>max: 128 tokens</li></ul> |
|
203 |
* Samples:
|
204 |
+
| anchor | positive | negative |
|
205 |
+
|:-----------------------------------------------------------|:-------------------------------------------------------------|:--------------------------------------------|
|
206 |
+
| <code>ูุนู
, ูุนู
, ุฃู ุฑุฃูุช " ุชุดูู
ุง ุจุงุฑุง ุฏูุณู "</code> | <code>ูุนู
ุ ุฃู "ุชุดูู
ุง ุจุงุฑุง ุฏูุณู" ูุงูุช ุชูู ุงูุชู ุดุงูุฏุชูุง</code> | <code>ุฃูุง ูู
ุฃุฑู "ุชุดูู
ุง ุจุงุฑุง ุฏูุณู".</code> |
|
207 |
+
| <code>ุฑุฌู ูุงู
ุฑุฃุฉ ูุฌูุณุงู ุนูู ุงูุดุงุทุฆ ุจููู
ุง ุชุบุฑุจ ุงูุดู
ุณ</code> | <code>ููุงู ุฑุฌู ูุงู
ุฑุฃุฉ ูุฌูุณุงู ุนูู ุงูุดุงุทุฆ</code> | <code>ุฅููู
ูุดุงูุฏูู ุดุฑูู ุงูุดู
ุณ</code> |
|
208 |
+
| <code>ููู ุฃุณูุทุฑ ุนูู ุบุถุจูุ</code> | <code>ู
ุง ูู ุฃูุถู ุทุฑููุฉ ููุณูุทุฑุฉ ุนูู ุงูุบุถุจุ</code> | <code>ููู ุฃุนุฑู ุฅู ูุงูุช ุฒูุฌุชู ุชุฎููููุ</code> |
|
209 |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
|
210 |
```json
|
211 |
{
|
|
|
235 |
- `per_device_train_batch_size`: 16
|
236 |
- `per_device_eval_batch_size`: 16
|
237 |
- `learning_rate`: 2e-05
|
238 |
+
- `num_train_epochs`: 5
|
239 |
- `warmup_ratio`: 0.1
|
240 |
- `fp16`: True
|
241 |
- `batch_sampler`: no_duplicates
|
|
|
259 |
- `adam_beta2`: 0.999
|
260 |
- `adam_epsilon`: 1e-08
|
261 |
- `max_grad_norm`: 1.0
|
262 |
+
- `num_train_epochs`: 5
|
263 |
- `max_steps`: -1
|
264 |
- `lr_scheduler_type`: linear
|
265 |
- `lr_scheduler_kwargs`: {}
|
|
|
356 |
</details>
|
357 |
|
358 |
### Training Logs
|
359 |
+
| Epoch | Step | Training Loss | loss |
|
360 |
+
|:------:|:-----:|:-------------:|:------:|
|
361 |
+
| 0.2133 | 500 | 1.4163 | 0.3134 |
|
362 |
+
| 0.4266 | 1000 | 0.3306 | 0.1912 |
|
363 |
+
| 0.6399 | 1500 | 0.2263 | 0.1527 |
|
364 |
+
| 0.8532 | 2000 | 0.1818 | 0.1297 |
|
365 |
+
| 1.0666 | 2500 | 0.1658 | 0.1167 |
|
366 |
+
| 1.2799 | 3000 | 0.1139 | 0.1040 |
|
367 |
+
| 1.4932 | 3500 | 0.0808 | 0.1018 |
|
368 |
+
| 1.7065 | 4000 | 0.0692 | 0.0959 |
|
369 |
+
| 1.9198 | 4500 | 0.058 | 0.0958 |
|
370 |
+
| 2.1331 | 5000 | 0.0653 | 0.0882 |
|
371 |
+
| 2.3464 | 5500 | 0.0503 | 0.0912 |
|
372 |
+
| 2.5597 | 6000 | 0.0338 | 0.0970 |
|
373 |
+
| 2.7730 | 6500 | 0.0363 | 0.0906 |
|
374 |
+
| 2.9863 | 7000 | 0.0375 | 0.0856 |
|
375 |
+
| 3.1997 | 7500 | 0.0401 | 0.0879 |
|
376 |
+
| 3.4130 | 8000 | 0.031 | 0.0848 |
|
377 |
+
| 3.6263 | 8500 | 0.0255 | 0.0938 |
|
378 |
+
| 3.8396 | 9000 | 0.0239 | 0.0858 |
|
379 |
+
| 4.0529 | 9500 | 0.0305 | 0.0840 |
|
380 |
+
| 4.2662 | 10000 | 0.0281 | 0.0833 |
|
381 |
+
| 4.4795 | 10500 | 0.0174 | 0.0840 |
|
382 |
+
| 4.6928 | 11000 | 0.0216 | 0.0882 |
|
383 |
+
| 4.9061 | 11500 | 0.022 | 0.0866 |
|
384 |
|
385 |
|
386 |
### Framework Versions
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 540795752
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ee308a99b75411cbc36588efb0b0a39c698668b9d5a9cdf2afd8fcd82bdb2f44
|
3 |
size 540795752
|