wasjaip commited on
Commit
7bbdcb4
1 Parent(s): 2f7abdd

Upload 10 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,197 +1,80 @@
1
  ---
2
- language:
3
- - multilingual
4
- - af
5
- - sq
6
- - am
7
- - ar
8
- - hy
9
- - as
10
- - az
11
- - eu
12
- - be
13
- - bn
14
- - bs
15
- - bg
16
- - my
17
- - ca
18
- - ceb
19
- - zh
20
- - co
21
- - hr
22
- - cs
23
- - da
24
- - nl
25
- - en
26
- - eo
27
- - et
28
- - fi
29
- - fr
30
- - fy
31
- - gl
32
- - ka
33
- - de
34
- - el
35
- - gu
36
- - ht
37
- - ha
38
- - haw
39
- - he
40
- - hi
41
- - hmn
42
- - hu
43
- - is
44
- - ig
45
- - id
46
- - ga
47
- - it
48
- - ja
49
- - jv
50
- - kn
51
- - kk
52
- - km
53
- - rw
54
- - ko
55
- - ku
56
- - ky
57
- - lo
58
- - la
59
- - lv
60
- - lt
61
- - lb
62
- - mk
63
- - mg
64
- - ms
65
- - ml
66
- - mt
67
- - mi
68
- - mr
69
- - mn
70
- - ne
71
- - no
72
- - ny
73
- - or
74
- - fa
75
- - pl
76
- - pt
77
- - pa
78
- - ro
79
- - ru
80
- - sm
81
- - gd
82
- - sr
83
- - st
84
- - sn
85
- - si
86
- - sk
87
- - sl
88
- - so
89
- - es
90
- - su
91
- - sw
92
- - sv
93
- - tl
94
- - tg
95
- - ta
96
- - tt
97
- - te
98
- - th
99
- - bo
100
- - tr
101
- - tk
102
- - ug
103
- - uk
104
- - ur
105
- - uz
106
- - vi
107
- - cy
108
- - wo
109
- - xh
110
- - yi
111
- - yo
112
- - zu
113
  pipeline_tag: sentence-similarity
114
  tags:
115
  - sentence-transformers
116
  - feature-extraction
117
  - sentence-similarity
118
- - transformers
119
- license: apache-2.0
120
  ---
121
- # LaBSE_geonames_v1 (ru)
122
 
123
- ## Описание
124
- Это порт модели [LaBSE](https://tfhub.dev/google/LaBSE/1) для PyTorch, который позволяет отображать текст на 109 языках в общее векторное пространство.
125
- Эта модель используется для поиска похожих географических названий на основе введенного запроса.
126
- С помощью функции можно кодировать запрос в векторное пространство, и искать наиболее близкие эмбеддинги и возвращает результаты в удобном формате.
 
 
 
127
 
128
- ## Установка
129
- Для использования этой модели рекомендуется установить пакет `sentence-transformers`:
130
 
131
  ```
132
  pip install -U sentence-transformers
133
  ```
134
 
135
- ## Использование
136
- Пример использования модели:
137
 
138
  ```python
139
  from sentence_transformers import SentenceTransformer
140
  sentences = ["This is an example sentence", "Each sentence is converted"]
141
 
142
- model = SentenceTransformer('wasjaip/LaBSE_geonames_v1')
143
  embeddings = model.encode(sentences)
144
  print(embeddings)
145
  ```
146
 
147
- ## Оценка Модели
148
- Подробности автоматической оценки модели можно найти на Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/LaBSE)
149
 
150
- ## Архитектура Модели
151
- ```
152
- SentenceTransformer(
153
- (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
154
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
155
- (2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
156
- (3): Normalize()
157
- )
158
- ```
159
-
160
- ## Более подробная информация и публикация, описывающая LaBSE, доступна на [LaBSE](https://tfhub.dev/google/LaBSE/1)
161
 
 
162
 
 
163
 
164
- # LaBSE_geonames_v1 (en)
165
- This is a port of the [LaBSE](https://tfhub.dev/google/LaBSE/1) model to PyTorch. It can be used to map 109 languages to a shared vector space.
166
 
167
 
168
- ## Usage (Sentence-Transformers)
 
169
 
170
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
171
 
 
172
  ```
173
- pip install -U sentence-transformers
174
  ```
175
 
176
- Then you can use the model like this:
177
 
178
- ```python
179
- from sentence_transformers import SentenceTransformer
180
- sentences = ["This is an example sentence", "Each sentence is converted"]
 
181
 
182
- model = SentenceTransformer('wasjaip/LaBSE_geonames_v1')
183
- embeddings = model.encode(sentences)
184
- print(embeddings)
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  ```
186
-
187
-
188
-
189
- ## Evaluation Results
190
-
191
-
192
-
193
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/LaBSE)
194
-
195
 
196
 
197
  ## Full Model Architecture
@@ -206,6 +89,4 @@ SentenceTransformer(
206
 
207
  ## Citing & Authors
208
 
209
- Have a look at [LaBSE](https://tfhub.dev/google/LaBSE/1) for the respective publication that describes LaBSE.
210
-
211
-
 
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  pipeline_tag: sentence-similarity
3
  tags:
4
  - sentence-transformers
5
  - feature-extraction
6
  - sentence-similarity
7
+
 
8
  ---
 
9
 
10
+ # {MODEL_NAME}
11
+
12
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
13
+
14
+ <!--- Describe your model here -->
15
+
16
+ ## Usage (Sentence-Transformers)
17
 
18
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 
19
 
20
  ```
21
  pip install -U sentence-transformers
22
  ```
23
 
24
+ Then you can use the model like this:
 
25
 
26
  ```python
27
  from sentence_transformers import SentenceTransformer
28
  sentences = ["This is an example sentence", "Each sentence is converted"]
29
 
30
+ model = SentenceTransformer('{MODEL_NAME}')
31
  embeddings = model.encode(sentences)
32
  print(embeddings)
33
  ```
34
 
 
 
35
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
+ ## Evaluation Results
38
 
39
+ <!--- Describe how your model was evaluated -->
40
 
41
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
 
42
 
43
 
44
+ ## Training
45
+ The model was trained with the parameters:
46
 
47
+ **DataLoader**:
48
 
49
+ `torch.utils.data.dataloader.DataLoader` of length 9661 with parameters:
50
  ```
51
+ {'batch_size': 64, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
52
  ```
53
 
54
+ **Loss**:
55
 
56
+ `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
57
+ ```
58
+ {'scale': 20.0, 'similarity_fct': 'cos_sim'}
59
+ ```
60
 
61
+ Parameters of the fit()-Method:
62
+ ```
63
+ {
64
+ "epochs": 6,
65
+ "evaluation_steps": 5000,
66
+ "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
67
+ "max_grad_norm": 1,
68
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
69
+ "optimizer_params": {
70
+ "lr": 2e-05
71
+ },
72
+ "scheduler": "WarmupLinear",
73
+ "steps_per_epoch": null,
74
+ "warmup_steps": 100,
75
+ "weight_decay": 0.01
76
+ }
77
  ```
 
 
 
 
 
 
 
 
 
78
 
79
 
80
  ## Full Model Architecture
 
89
 
90
  ## Citing & Authors
91
 
92
+ <!--- Describe where people can find more information -->
 
 
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/root/.cache/torch/sentence_transformers/sentence-transformers_LaBSE/",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "directionality": "bidi",
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-12,
16
+ "max_position_embeddings": 512,
17
+ "model_type": "bert",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 0,
21
+ "pooler_fc_size": 768,
22
+ "pooler_num_attention_heads": 12,
23
+ "pooler_num_fc_layers": 3,
24
+ "pooler_size_per_head": 128,
25
+ "pooler_type": "first_token_transform",
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.36.0",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 501153
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.7.0",
5
+ "pytorch": "1.9.0+cu102"
6
+ }
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d8033b1f372db516d5abb6dbc48057e812be116c73e55e4a4298ba55e073a12
3
+ size 1883730160
modules.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Dense",
18
+ "type": "sentence_transformers.models.Dense"
19
+ },
20
+ {
21
+ "idx": 3,
22
+ "name": "3",
23
+ "path": "3_Normalize",
24
+ "type": "sentence_transformers.models.Normalize"
25
+ }
26
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92262b29204f8fdc169a63f9005a0e311a16262cef4d96ecfe2a7ed638662ed3
3
+ size 13632172
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "full_tokenizer_file": null,
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff