Why does local inference differ from the API?
I am computing Jina v2 embeddings via the transformers
Python libraries and via the API (see https://jina.ai/embeddings/).
With transformers
I can run code along the lines of the model card
from transformers import AutoModel
sentences = ['How is the weather today?']
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
embeddings_1 = model.encode(sentences)
or
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en')
embeddings_2 = model.encode(sentences)
and the resulting embeddings_1
and embeddings_2
match.
However if I use the Jina API e.g. via
import requests
url = 'https://api.jina.ai/v1/embeddings'
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer jina_123456...' # visit https://jina.ai/embeddings/ for an API key
}
data = {
'input': sentences,
'model': 'jina-embeddings-v2-base-en' # note that the model name matches
}
response = requests.post(url, headers=headers, json=data)
embeddings_3 = eval(response.content)["data"][0]["embedding"]
embeddings_3
differ from the other two arrays by a small difference, around 2e-4 in absolute value on average. I see this discrepancy both with CPU and GPU runtimes.
What am I doing wrong? I also posted this very question on https://stackoverflow.com/questions/77875253/why-does-local-inference-differ-from-the-api-when-computing-jina-embeddings
A good catch. Actually, in our API, we are running the model forward with half-precision fp16
for higher cost-efficiency. I believe this is the source of the inconsistence.
Hi @numb3r3 , thanks for your prompt reply, that would explain!
Just to be 100% sure, is there a way to prove this, e.g. is there code I can run with available model weights to match API results exactly?
Having half-precision models running locally would be neat, as we'd get
- a local equivalent of the Jina API
- the same performance/efficiency gains that you are enjoying at Jina
Hopefully it's not a cheeky ask :)
You can easily enjoy the fp16 optimization by
from transformers import AutoModel
sentences = ['How is the weather today?']
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, dtype=torch.float16)
embeddings_1 = model.encode(sentences)
By the way, if you are interested in a private deployment of our API with all related optimizations such as fp16 support, I wanted to let you know that we also offer Jina AI through AWS SageMaker. This allows for optimized, on-premises deployment of our 8k embedding models. You can find more details here: https://jina.ai/news/jina-ai-8k-embedding-models-hit-aws-marketplace-for-on-prem-deployment/
Hey, thanks again (and sorry for cross-posting on StackOverflow, will make sure that the result/conclusion of this discussion will be there as well).
I tried this following your advice:
from transformers import AutoModel
import torch
sentences = ['How is the weather today?']
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, token = "hf_my_authorized_token", torch_dtype = torch.float16)
embeddings_1 = model.encode(sentences)
i.e. slightly modifying the from_pretrained
call because dtype
(as in your original response) is not an expected kwarg for JinaBertModel.__init__()
.
However, the snippet above throws a RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'
, which is the same error I get with
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, token = "hf_my_authorized_token")
model.half()
embeddings_1 = model.encode(sentences)
using transformers==4.35.2
. So this approach doesn't seem to work for me :/
Thanks a ton also for the AWS suggestion, it's not the cloud provider I am using atm but it's good to know!
Hello,
To effectively utilize fp16 precision with CUDA for the model, can you try the following snippet?
from transformers import AutoModel
import torch
sentences = ['How is the weather today?']
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, attn_implementation='torch', torch_dtype=torch.float16, device_map='cuda')
embeddings_1 = model.encode(sentences)
Let me know if this works for you or if there's anything else I can assist with!
Hi @ziniuyu , thank you!
Indeed your snippet runs without problems (just requires pip install accelerate
), and that's great to run at half precision!
Still
from transformers import AutoModel
import torch
sentences = ['How is the weather today?']
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, attn_implementation='torch', torch_dtype=torch.float16, device_map='cuda')
embeddings_1 = model.encode(sentences)
will give me embeddings that do not match with what I get from the same model with https://api.jina.ai/v1/embeddings
(you can try that yourself). The difference is again on the order of 2e-4. I thus get slightly different embeddings from full precision, but numbers are not the same as those of the API. So, I couldn't verify if
@numb3r3
's explanation for the discrepancy is the right one... So I wouldn't close this yet.
Is the API using a slightly different checkpoint/version maybe?
@davidefiocco i don't think so, we're using the same model, at least for jina-v2.
Hi @bwang0911 , thanks again for your answer. Here's a reproducible example:
from transformers import AutoModel
import torch
import requests
import numpy as np
sentences = ['How is the weather today?']
url = 'https://api.jina.ai/v1/embeddings'
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer jina_123456...' # change it here
}
data = {
'input': sentences,
'model': 'jina-embeddings-v2-base-en' # note that the model name matches that of the HF model below
}
response = requests.post(url, headers=headers, json=data)
embeddings_api = eval(response.content)["data"][0]["embedding"]
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, attn_implementation='torch', torch_dtype=torch.float16, device_map='cuda')
embeddings_hf = model.encode(sentences)
print(np.abs(embeddings_api - embeddings_hf).mean())
The output of the average absolute value of the difference is around 0.0003, close but still not as close as I would have expected.
from sentence_transformers import SentenceTransformer
sentences = 'This framework generates embeddings for each sentence'
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en')
embeddings = model.encode(sentences)
print(embeddings)
Why is this message being displayed? Do I need to train the model to use it, or is it already trained to convert the text into vectors?
# use latest SentenceTransformer pip install -U sentence_transformers
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)