File size: 16,129 Bytes
2b47113
 
 
 
 
 
fa6db2c
2b47113
 
 
 
 
 
 
 
 
 
 
 
fa6db2c
2b47113
fa6db2c
 
 
 
 
 
 
 
2b47113
 
fa6db2c
2b47113
f984452
 
 
 
2b47113
fa6db2c
2b47113
f984452
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b47113
f984452
2b47113
 
 
 
f984452
2b47113
280fb0f
2b47113
280fb0f
 
2b47113
 
280fb0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b47113
5f430b8
2b47113
5f430b8
 
 
 
2b47113
b04675c
2b47113
b04675c
 
 
 
 
 
2b47113
7c6354e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8c2d5e5
 
 
 
 
 
 
 
 
 
 
 
 
 
5f23e11
8c2d5e5
5f23e11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1076870
 
 
 
 
 
 
 
 
 
 
 
 
5f23e11
8c2d5e5
 
 
 
 
 
 
 
 
 
 
 
 
7c6354e
 
 
2b47113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
---
license: other
license_name: open-aleph-license
license_link: LICENSE
---

# Model Card for Pharia-1-Embedding-4608-control

This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control. 
For additional training details, including architecture, tokenization, tokenizer fertility, pre-training, 
instruction fine-tuning and resource usage we refer to the model card of Pharia-1-LLM-7B-control.
Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized 
embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated 
data in compliance with applicable EU and national regulations, including copyright and data privacy laws. 
Furthermore it shows good cross-lingual performance allowing for prompting and text to be embedded written 
in different languages. The finetuning was always performed using English instructions.


## Model Overview

- **Developed by:** Aleph Alpha Research
<!--- **Funded by [optional]:** [More Information Needed]-->
<!--- **Shared by [optional]:** [More Information Needed]-->
- **Model type:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
  instruction-tuning (inspired by the approach of GritLM).
- **Language(s) (NLP):** Trained on English, German, French, Spanish.
<!--- **License:** [More Information Needed]-->
<!--- **Finetuned from model [optional]:** [More Information Needed]-->


### Model Description


|Model                           |Embedding Size|Description|
|--------------------------------|--------------|-----------|
|Pharia-1-Embedding-4608-control |4608|Pharia-1-Embedding-4608-control is an Embedding model optimized for German, French and Spanish and designed for customizable embeddings at runtime via instructions (prompts)|

<!-- Provide a longer summary of what this model is. -->

### Model Access

We provide access to our models through the channels listed below.
- On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights
and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment. 
We provide our customers with open access to our full model checkpoint including weights and code for commercial use.
Please refer to the changelog for updates to the models served. We do not deprecate officially released versions
 of old model generations when we release newer versions, so users can continue to have access to available models.
No prompt data is stored when using our systems, which means that we do not 
collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions. 
We do not log user inputs to the models. We do not train on user data.
- **Note:** The same models are made available to users regardless of their geographic location, 
and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply. 
The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.


<!-- Provide the basic links for the model. 

- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
-->

### Intended Use

Pharia-1-Embedding-4608-control is intended to be deployed as components of AI systems or applications. 
Use-cases and the model's capabilities include but are not limited to: information retrieval, semantic search, re-ranking and clustering.


#### Out-of-Scope Use

Pharia-1-Embedding-4608-control is not to be used for illegal or unlawful actions of any kind and with any illegal 
or unlawful content. This includes in particular prohibited activities such as engaging in terrorism, 
violence, human trafficking, illegal distribution of materials to minors, sexual solicitation, any other 
criminal activities, harassment, discrimination, creating or promoting malicious code or activities risking death or harm, 
including those related to military or nuclear applications, and activities not in compliance with sanction regimes, 
technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards. 
The utilization of our technology is always governed by, and may be limited in accordance with, 
our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via 
our dedicated contact address [email protected] to communicate with us.

Customers and partners are enabled to use our ticketing 
system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.


### Use limitations

Beyond the risks & limitations stated in 
the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
Pharia-1-Embedding-4608-control has been optimized on embedding 
computation only. Therefore, we do not recommend usage for text generation purposes.

## How to Use

### Use with scaling inference code base

To perform inference with the original model files, you’ll first need to install the 
[Scaling library](https://github.com/Aleph-Alpha/scaling). Follow the installation instructions provided in the repository's README file. 
After installation, download the model weights and use the Scaling inference module to load the 
checkpoint, vocabulary, and configuration files.

```
from pathlib import Path
from torch.nn import CosineSimilarity
from scaling.transformer.inference import TransformerInferenceModule
MODEL_PATH = "/path/to/model"
inference_model = TransformerInferenceModule.from_checkpoint(
    checkpoint_dir=Path(MODEL_PATH),
)
# embed the query:
query = "Which country is Galileo from?"
query_embeddings = inference_model.encode_queries(query, convert_to_tensor=True)
print(f"Type of embeddings: {type(query_embeddings)},\n\
       shape of query embeddings: {query_embeddings.shape}")
# embed the documents:
document_1 =  "Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql."
document_embeddings_1 = inference_model.encode_corpus(document_1, convert_to_tensor=True)
document_2 = "Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy."
document_embeddings_2 = inference_model.encode_corpus(document_2, convert_to_tensor=True)
# customized embeddings steering the query:
instruction = "Represent the question about TV shows to find a paragraph that answers it."
steered_query_embeddings = inference_model.encode_queries(query, 
                                                          instruction=instruction,
                                                          convert_to_tensor=True)
# compute similarity between steered query and both documents
cossim = CosineSimilarity(dim=1, eps=1e-6)
sim1 = round(cossim(document_embeddings_1, steered_query_embeddings).item(), 3)
sim2 = round(cossim(document_embeddings_2, steered_query_embeddings).item(), 3)
print("Steered embedding causes higher similarity of query to TV show:")
print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
```

### Explanation of the instruct embedding code example

Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between 
text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
We use the task of information retrieval as a guiding example where we assume the 
following query: “Which country is Galileo from?” and two documents:
- Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql.
- Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy.
Source: Wikipedia
For our guiding example we assume the context of this use-case is a Question-Answer system for movies and TV shows.

**Step 1:**

Embed the Query
```
"input": "Which country is Galileo from?"
```
→ Embedding: ```[-0.6780134, 0.61449033, 0.102911085, ...]```

**Step 2:**

Embed the Documents
"input": "Galileo is a German television program series ..." 
→ Embedding: ```[-0.36119246, 0.7793595, -0.38735497, ...]```
"input": "Galileo di Vincenzo Bonaiuti de' Galilei ..."
→ Embedding: ```[-0.25108248, 1.0496024, -0.20945309, ...]```

**Step 3:**

Compare the similarity
A typical similarity measure between vectors is cosine similarity. Higher numbers 
indicate more similar vectors and by extension capture the concept of relevance. 
In a RAG application these scores determine the ranking during the retrieval step. 
In this example, we obtain the following cosine similarities:
Query vs. German TV show: ~0.661
Query vs. Italian polymath: ~0.757
This implies that the paragraph about the Italian polymath would be ranked higher than the paragraph 
about the German TV show which is the one we’re interested in.

#### Customized Embeddings

To further improve performance you can use instructions to steer the model. Instructions can help the model 
understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case. 
In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher 
than the paragraph about the Italian polymath.
**Step 1:**
Embed the Query with an Instruction
```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
```"input": "input": "Which country is Galileo from?"```
→ Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
**Step 2:**
Compare the similarity
We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
Query vs. German TV show: ~0.632
Query vs. Italian polymath: ~0.512
These new cosine similarities imply that the ranking has indeed changed and the paragraph about the German TV show is 
**now more relevant**. This shows that instructions can help the model understand nuances in the data better 
and ultimately lead to embeddings that are more useful for your use-case. 

#### Tips on using the model

- First try and ideally evaluate the model on your data without instructions to see whether performance aligns with your expectations out-of-the-box
- If you decide to use an instruction with the aim of further boosting performance we suggest using this template as a guideline
  * ```Template: Represent the [X] to find a [Y] that [describe how the X and Y relate]```
  * Examples
    1. Represent the newspaper paragraph to find a newspaper paragraph with the same topic
    2. Represent the sentence to find another sentence with the same meaning
- In cases where the two texts to compare are different in nature (e.g. query and document) – also called “asymmetric” – we suggest to first add an instruction to query texts only. Again, try and ideally evaluate the model in this setting. Then, if your aim is to further boost performance, we suggest that you add instructions to document texts as well where [X] and [Y] are flipped accordingly.

















## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

[More Information Needed]

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[More Information Needed]

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing [optional]

[More Information Needed]


#### Training Hyperparameters

- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

[More Information Needed]

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

[More Information Needed]

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

[More Information Needed]

### Results

[More Information Needed]

#### Summary



## Model Examination [optional]

<!-- Relevant interpretability work for the model goes here -->

[More Information Needed]

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]

## Technical Specifications [optional]

### Model Architecture and Objective

[More Information Needed]

### Compute Infrastructure

[More Information Needed]

#### Hardware

[More Information Needed]

#### Software

[More Information Needed]

## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]

## Glossary [optional]

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

[More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed]