Luke Merrick commited on
Commit
286cfcf
·
1 Parent(s): ca5f86e
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,3 +1,109 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - arctic
8
+ license: cc-by-nc-4.0
9
+ ---
10
+
11
+ # E5 Base, Arctic Edition
12
+
13
+ This model is the result of the Arctic Embed [walkthrough example](https://github.com/snowflakedb/ArcticTraining/blob/main/projects/arctic_embed/examples/finetune_models/README.md) for training embedding models using the [open-source Arctic Embed codebase](https://github.com/snowflakedb/ArcticTraining/blob/main/projects/arctic_embed/). In the walkthrough, we fine-tune the [`e5-base-unsupervised`](https://huggingface.co/intfloat/e5-base-unsupervised) using an improved dataset that leverages modern hard-negative mining practices and includes three more high-quality retrieval datasets than the original E5 finetuning pipeline.
14
+
15
+ | Model | BEIR Score (nDCG@10) | CLEF English (nDCG@10) |
16
+ |:--------------------|-----------------------:|-------------------------:|
17
+ | e5-base-v2 | 50.19 | 45.38 |
18
+ | arctic-e5-base | 54.70 | 52.77 |
19
+ | gte-base-en-v1.5 | 54.02 | 47.91 |
20
+ | arctic-embed-m-v1.0 | 54.89 | 47.62 |
21
+ | arctic-embed-m-v2.0 | 55.38 | 54.06 |
22
+
23
+ **NOTE: This model was trained as an example and heavily leverages in-domain datasets from the data sources used by the BEIR benchmark. Though it performs well on the CLEF English dataset, it may be substantially overfit to the domains of the BEIR benchmark and may not generalize well to certain applications.**
24
+
25
+ ## Usage
26
+
27
+
28
+ ### Using Sentence Transformers
29
+
30
+ You can use the sentence-transformers package to use an snowflake-arctic-embed model, as shown below.
31
+
32
+ ```python
33
+ from sentence_transformers import SentenceTransformer
34
+
35
+ model = SentenceTransformer("Snowflake/snowflake-arctic-e5-base")
36
+
37
+ queries = ['what is snowflake?', 'Where can I get the best tacos?']
38
+ documents = ['The Data Cloud!', 'Mexico City of Course!']
39
+
40
+ query_embeddings = model.encode(queries, prompt_name="query")
41
+ document_embeddings = model.encode(documents)
42
+
43
+ scores = query_embeddings @ document_embeddings.T
44
+ for query, query_scores in zip(queries, scores):
45
+ doc_score_pairs = list(zip(documents, query_scores))
46
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
47
+ # Output passages & scores
48
+ print("Query:", query)
49
+ for document, score in doc_score_pairs:
50
+ print(score, document)
51
+ ```
52
+ Produces:
53
+ ```
54
+ Query: what is snowflake?
55
+ 0.2747492 The Data Cloud!
56
+ 0.19998045 Mexico City of Course!
57
+ Query: Where can I get the best tacos?
58
+ 0.29974818 Mexico City of Course!
59
+ 0.2344071 The Data Cloud!
60
+ ```
61
+
62
+ ### Using Huggingface transformers
63
+
64
+ You can use the transformers package to use the model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion (not mean pooling) and use the standard E5 query and document prefixes below.
65
+
66
+ ```python
67
+ import torch
68
+ from transformers import AutoModel, AutoTokenizer
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-e5-base')
71
+ model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-e5-base')
72
+ model.eval()
73
+
74
+ query_prefix = 'query: '
75
+ queries = ['what is snowflake?', 'Where can I get the best tacos?']
76
+ queries_with_prefix = ["{}{}".format(query_prefix, q) for q in queries]
77
+ query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
78
+
79
+ document_prefix = 'passage: '
80
+ documents = ['The Data Cloud!', 'Mexico City of Course!']
81
+ documents_with_prefix = ["{}{}".format(document_prefix, d) for d in documents]
82
+ document_tokens = tokenizer(documents_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
83
+
84
+ # Compute token embeddings
85
+ with torch.inference_mode():
86
+ query_embeddings = model(**query_tokens)[0][:, 0]
87
+ document_embeddings = model(**document_tokens)[0][:, 0]
88
+
89
+
90
+ # normalize embeddings
91
+ query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
92
+ document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
93
+
94
+ scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
95
+ for query, query_scores in zip(queries, scores):
96
+ doc_score_pairs = list(zip(documents, query_scores))
97
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
98
+ #Output passages & scores
99
+ print("Query:", query)
100
+ for document, score in doc_score_pairs:
101
+ print(score, document)
102
+ ```
103
+
104
+ ## License
105
+
106
+
107
+ Arctic is licensed under the [Apache-2](https://www.apache.org/licenses/LICENSE-2.0). The released models can be used for commercial purposes free of charge.
108
+
109
+ <img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=6ad53892-f1e7-4d3a-a135-60ca6264a7aa" />
biencoder_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "pooling": "first_token"
3
+ }
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "intfloat/e5-base-unsupervised",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "bfloat16",
22
+ "transformers_version": "4.47.0",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.7.0.dev0",
4
+ "transformers": "4.39.3",
5
+ "pytorch": "2.1.0+cu121"
6
+ },
7
+ "prompts": {
8
+ "query": "query: ",
9
+ "document": "passage: "
10
+ },
11
+ "default_prompt_name": "document"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:172d385d66a8e09d15b4185f4224c55a4118837ae84103f32e0072a1cabf45f6
3
+ size 218986928
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "BertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff