Jina Clip V2: Inconsistent Embeddings
A concerning observation: Changing Jina embedding batch size slightly varies the embedding.
emb(A): a1
emb([A,A]): [a2,a2]
emb([A,A,A]): [a3,a3,a3]
emb([A,B]): [a4, b4]
a1, a2, a3, a4 are very similar, but not the same. cosine similarities of a1, a2, a3, a4 are close to 1.
This happens with/without xformers.
I dont think attention causes this, but does anyone have an idea? Is this expected?
The embeddings compared are full 1024 size and not normalized .
Hey @lssatvik , I need to take a closer look! Can you share a code snippet, so I can reproduce this?
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained("jinaai/jina-clip-v2", trust_remote_code=True).to("cpu")
img = "https://www.shutterstock.com/image-photo/beautiful-sunset-wave-vibrant-translucent-600nw-2255651435.jpg"
x = model.encode_image([img], batch_size=1, convert_to_numpy=False, normalize_embeddings=False)
y = model.encode_image([img, img], batch_size=2, convert_to_numpy=False, normalize_embeddings=False)
print(torch.equal(x[0], y[0]))
transformers==4.47.1
torch==2.5.1
Top level versions, incase it was an issue with dependencies.
Hey @lssatvik after taking a closer look, I would say this is expected. The model weights are in bf16, so when moving to cpu they are casted to fp32 and this causes some slight variations in the embeddings when batch size is changing. If you run on bf16 or fp16 the embeddings should be identical
But
@gmastrapas
, My understanding is conversion of bf16 to fp32 is done on vector level, not batch level. Even if there are variations; the variations would not be dependent on other vectors no?
bf_16_to_fp_32([x,x]) must be equal to [bf_16_to_fp_32(x), bf_16_to_fp_32(x)].
In my sample code, I haven't converted them to fp32. They remain n bf16 tensor form stored on cuda. They are not same at this stage itself. So conversion couldn't be the cause.
In the sample code, you are moving the model to the cpu. BFloat16 and Float16 are not supported on CPU, so when moving to CPU the model weights automatically go to Float32
a = torch.randn(1,512, dtype=torch.bfloat16)
b = torch.randn(1,512, dtype=torch.bfloat16)
x = a.to(torch.float32)
y = b.to(torch.float32)
z = torch.concat((a,b)).to(torch.float32)
torch.equal(x.flatten(), z[0].flatten()), torch.equal(y.flatten(), z[1].flatten())
I can't see batch size causing difference in the output here. Could you share some resource/code that demonstrates this phenomenon for different output based on batch size?
Didn't see your earlier comment. I tested on gpu entirely. cuda==12.4
It's still unequal.
In initial sample code, model ran in cpu agreed, but predictions were done afterwards. All predictions are happening in fp32 and this still should not be dependent on batch, but solely the vector.
Can you check your model.dtype
when on CUDA?
That is a bit strange, for me in bf16 and fp16 the tensors are equal, and there is a max abs difference of tensor(4.2915e-06, device='cuda:0')
when using fp32
What GPU are you using? Are you using xformers for the image encoder? Can you calculate the max absolute difference across dtypes using torch.abs(x[0] - y[0]).max()