peralp24 commited on
Commit
5f23e11
·
verified ·
1 Parent(s): 8c2d5e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -1
README.md CHANGED
@@ -145,8 +145,59 @@ For our guiding example we assume the context of this use-case is a Question-Ans
145
  **Step 1:**
146
 
147
  Embed the Query
 
148
  "input": "Which country is Galileo from?"
149
- → Embedding: [-0.6780134, 0.61449033, 0.102911085, ...]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
 
152
 
 
145
  **Step 1:**
146
 
147
  Embed the Query
148
+ ```
149
  "input": "Which country is Galileo from?"
150
+ ```
151
+ → Embedding: ```[-0.6780134, 0.61449033, 0.102911085, ...]```
152
+
153
+ **Step 2:**
154
+
155
+ Embed the Documents
156
+ "input": "Galileo is a German television program series ..."
157
+ → Embedding: ```[-0.36119246, 0.7793595, -0.38735497, ...]```
158
+ "input": "Galileo di Vincenzo Bonaiuti de' Galilei ..."
159
+ → Embedding: ```[-0.25108248, 1.0496024, -0.20945309, ...]```
160
+
161
+ **Step 3:**
162
+
163
+ Compare the similarity
164
+ A typical similarity measure between vectors is cosine similarity. Higher numbers
165
+ indicate more similar vectors and by extension capture the concept of relevance.
166
+ In a RAG application these scores determine the ranking during the retrieval step.
167
+ In this example, we obtain the following cosine similarities:
168
+ Query vs. German TV show: ~0.661
169
+ Query vs. Italian polymath: ~0.757
170
+ This implies that the paragraph about the Italian polymath would be ranked higher than the paragraph
171
+ about the German TV show which is the one we’re interested in.
172
+
173
+ #### Customized Embeddings
174
+
175
+ To further improve performance you can use instructions to steer the model. Instructions can help the model
176
+ understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
177
+ In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
178
+ than the paragraph about the Italian polymath.
179
+ **Step 1:**
180
+ Embed the Query with an Instruction
181
+ ```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
182
+ ```"input": "input": "Which country is Galileo from?"```
183
+ → Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
184
+ **Step 2:**
185
+ Compare the similarity
186
+ We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
187
+ Query vs. German TV show: ~0.632
188
+ Query vs. Italian polymath: ~0.512
189
+ These new cosine similarities imply that the ranking has indeed changed and the paragraph about the German TV show is now more relevant. This shows that instructions can help the model understand nuances in the data better and ultimately lead to embeddings that are more useful for your use-case.
190
+ Tips on using the model
191
+
192
+
193
+ First try and ideally evaluate the model on your data without instructions to see whether performance aligns with your expectations out-of-the-box
194
+ If you decide to use an instruction with the aim of further boosting performance we suggest using this template as a guideline
195
+ Template: Represent the [X] to find a [Y] that [describe how the X and Y relate]
196
+ Examples
197
+ Represent the newspaper paragraph to find a newspaper paragraph with the same topic
198
+ Represent the sentence to find another sentence with the same meaning
199
+ In cases where the two texts to compare are different in nature (e.g. query and document) – also called “asymmetric” – we suggest to first add an instruction to query texts only. Again, try and ideally evaluate the model in this setting. Then, if your aim is to further boost performance, we suggest that you add instructions to document texts as well where [X] and [Y] are flipped accordingly.
200
+
201
 
202
 
203