jinaai/jina-embeddings-v2-base-en · Why does local inference differ from the API?

Jan 24, 2024

•

edited Jan 25, 2024

I am computing Jina v2 embeddings via the transformers Python libraries and via the API (see https://jina.ai/embeddings/).

With transformers I can run code along the lines of the model card

from transformers import AutoModel

sentences = ['How is the weather today?']

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
embeddings_1 = model.encode(sentences)

or

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en')
embeddings_2 = model.encode(sentences)

and the resulting embeddings_1 and embeddings_2 match.

However if I use the Jina API e.g. via

import requests

url = 'https://api.jina.ai/v1/embeddings'

headers = {
  'Content-Type': 'application/json',
  'Authorization': 'Bearer jina_123456...' # visit https://jina.ai/embeddings/ for an API key
}

data = {
  'input': sentences,
  'model': 'jina-embeddings-v2-base-en' # note that the model name matches
}

response = requests.post(url, headers=headers, json=data)
embeddings_3 = eval(response.content)["data"][0]["embedding"]

embeddings_3 differ from the other two arrays by a small difference, around 2e-4 in absolute value on average. I see this discrepancy both with CPU and GPU runtimes.

What am I doing wrong? I also posted this very question on https://stackoverflow.com/questions/77875253/why-does-local-inference-differ-from-the-api-when-computing-jina-embeddings

numb3r3

Jina AI org Jan 25, 2024

A good catch. Actually, in our API, we are running the model forward with half-precision fp16 for higher cost-efficiency. I believe this is the source of the inconsistence.

davidefiocco

Jan 25, 2024

•

edited Jan 25, 2024

Hi @numb3r3 , thanks for your prompt reply, that would explain!

Just to be 100% sure, is there a way to prove this, e.g. is there code I can run with available model weights to match API results exactly?

Having half-precision models running locally would be neat, as we'd get

a local equivalent of the Jina API
the same performance/efficiency gains that you are enjoying at Jina

Hopefully it's not a cheeky ask :)

numb3r3

Jina AI org Jan 25, 2024

You can easily enjoy the fp16 optimization by

from transformers import AutoModel

sentences = ['How is the weather today?']

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True,  dtype=torch.float16)
embeddings_1 = model.encode(sentences)

By the way, if you are interested in a private deployment of our API with all related optimizations such as fp16 support, I wanted to let you know that we also offer Jina AI through AWS SageMaker. This allows for optimized, on-premises deployment of our 8k embedding models. You can find more details here: https://jina.ai/news/jina-ai-8k-embedding-models-hit-aws-marketplace-for-on-prem-deployment/

davidefiocco

Jan 25, 2024

•

edited Jan 25, 2024

Hey, thanks again (and sorry for cross-posting on StackOverflow, will make sure that the result/conclusion of this discussion will be there as well).

I tried this following your advice:

from transformers import AutoModel
import torch

sentences = ['How is the weather today?']

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, token = "hf_my_authorized_token", torch_dtype = torch.float16)
embeddings_1 = model.encode(sentences)

i.e. slightly modifying the from_pretrained call because dtype (as in your original response) is not an expected kwarg for JinaBertModel.__init__().

However, the snippet above throws a RuntimeError: "LayerNormKernelImpl" not implemented for 'Half', which is the same error I get with

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, token = "hf_my_authorized_token")
model.half()
embeddings_1 = model.encode(sentences)

using transformers==4.35.2. So this approach doesn't seem to work for me :/

Thanks a ton also for the AWS suggestion, it's not the cloud provider I am using atm but it's good to know!

ziniuyu

Jina AI org Jan 26, 2024

•

edited Jan 31, 2024

Hello,

To effectively utilize fp16 precision with CUDA for the model, can you try the following snippet?

from transformers import AutoModel
import torch

sentences = ['How is the weather today?']

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, attn_implementation='torch', torch_dtype=torch.float16, device_map='cuda')
embeddings_1 = model.encode(sentences)

Let me know if this works for you or if there's anything else I can assist with!

davidefiocco

Jan 27, 2024

•

edited Jan 29, 2024

Hi @ziniuyu , thank you!

Indeed your snippet runs without problems (just requires pip install accelerate), and that's great to run at half precision!

Still

from transformers import AutoModel
import torch
sentences = ['How is the weather today?']
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, attn_implementation='torch', torch_dtype=torch.float16, device_map='cuda')
embeddings_1 = model.encode(sentences)

will give me embeddings that do not match with what I get from the same model with https://api.jina.ai/v1/embeddings (you can try that yourself). The difference is again on the order of 2e-4. I thus get slightly different embeddings from full precision, but numbers are not the same as those of the API. So, I couldn't verify if @numb3r3 's explanation for the discrepancy is the right one... So I wouldn't close this yet.
Is the API using a slightly different checkpoint/version maybe?

bwang0911

Jina AI org Feb 26, 2024

@davidefiocco i don't think so, we're using the same model, at least for jina-v2.

davidefiocco

Feb 26, 2024

Hi @bwang0911 , thanks again for your answer. Here's a reproducible example:

from transformers import AutoModel
import torch
import requests
import numpy as np

sentences = ['How is the weather today?']

url = 'https://api.jina.ai/v1/embeddings'

headers = {
  'Content-Type': 'application/json',
  'Authorization': 'Bearer jina_123456...' # change it here
}

data = {
  'input': sentences,
  'model': 'jina-embeddings-v2-base-en' # note that the model name matches that of the HF model below
}

response = requests.post(url, headers=headers, json=data)
embeddings_api = eval(response.content)["data"][0]["embedding"]

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, attn_implementation='torch', torch_dtype=torch.float16, device_map='cuda')
embeddings_hf = model.encode(sentences)

print(np.abs(embeddings_api - embeddings_hf).mean())

The output of the average absolute value of the difference is around 0.0003, close but still not as close as I would have expected.

shivpatel117

Mar 19, 2024

•

edited Mar 19, 2024

from sentence_transformers import SentenceTransformer

sentences = 'This framework generates embeddings for each sentence'

model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en')
embeddings = model.encode(sentences)

print(embeddings)

Why is this message being displayed? Do I need to train the model to use it, or is it already trained to convert the text into vectors?

bwang0911

Jina AI org Mar 19, 2024

@shivpatel117

# use latest SentenceTransformer pip install -U sentence_transformers
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en',     trust_remote_code=True)

shivpatel117

Mar 19, 2024

It works. @bwang0911 Appreciate the help!

bwang0911 changed discussion status to closed Sep 10, 2024

Talha

Jan 24

•

edited Jan 24

I got v close results with cpu and compared to gpu.

With float16 i got less difference