Introduction
Based on dunzhang/stella_en_1.5B_v5 and google/siglip-so400m-patch14-384.
It can encode both text and image.
The training code, data and report will be released soon.
The core training code will be integrated into the rag-retrieval library(https://github.com/NLPJCL/RAG-Retrieval) in the near future. (Welcome to star)
This work was accomplished during my free time; please grant time a little time.
Here's a short introduction to the training method:
The core idea of jasper and stella is distillation: Let student model learn teacher model's vectors. The training process of jasper have 4 stage:
Stage1&2: Distill from teacher vectors. In jasper model the teacher model is nvidia/NV-Embed-v2 and dunzhang/stella_en_1.5B_v5 (Stage1 and Stage2 will freeze different parameters.)
Stage3: MRL training, I made some modifications to MRL to enable training on unsupervised text
Stage4: Alignment between jasper token embeddings from image's detailed caption and vision embeddings from google/siglip-so400m-patch14-384.
I use a AdaptiveAvgPool2d to do an adjustment on vision tokens' number and dimensions, this method does not need additional parameters.
The meaning of distillation is to achieve better results with smaller models or as a way of pre-training, not to hit the top of the leaderboards. Actually, I've got first place on MTEB (Chinese and English), I will not release the two models, as I said before, it's meaningless and has poor generalisability.
Usage
import torch
from sentence_transformers import SentenceTransformer
DOC1 = """
Blue light is scattered in all directions by the tiny molecules of air in Earth's atmosphere.
Blue is scattered more than other colors because it travels as shorter, smaller waves. This is why we see a blue sky most of the time.
Closer to the horizon, the sky fades to a lighter blue or white.
"""
DOC2 = """
When choosing colors, you can consider the following factors:
Color theory: Understand how colors work together and how they can evoke different reactions.
Color psychology: Consider how colors affect emotions, behaviors, and responses.
Brand identity: Colors can convey meaning and information about a brand.
Mood: Consider the mood you want to create. For example, brighter colors can feel cheerful, while cooler colors can be calming.
Space: Consider the size of the space and the amount of natural light it receives. Dark colors can make a room feel smaller, while light colors can make it feel larger.
Color wheel: Use the color wheel to identify primary, secondary, and tertiary colors.
Color combinations: Decide how to best complement your preferred color with others.
Color palette: Limit your color palette to a main color and one or two additional colors.
60-30-10 rule: Use a primary color 60% of the time, a secondary color 30% of the time, and an accent color 10% of the time
"""
if __name__ == "__main__":
# load model
use_gpu = False
model_name = "infgrad/jasper_en_vision_language_v1"
model = SentenceTransformer(
model_name,
trust_remote_code=True,
device="cpu" if not use_gpu else "cuda",
model_kwargs={
"torch_dtype": torch.bfloat16 if use_gpu else torch.float32,
"attn_implementation": "sdpa"
},
# vector_dim must be 12288, 1024, 512, 256
## 1024 is recommended
# set is_text_encoder 'True', if you do not encode image
config_kwargs={"is_text_encoder": False, "vector_dim": 1024},
)
# We can reduce the max_seq_length from the default of 2048 for faster encoding
model.max_seq_length = 1024
# data
q_list = [
"Why the sky is blue?",
"how to choose suitable color",
]
doc_list = [
DOC1,
[{"type": "image_path", "content": "./assets/img1.png"}, {"type": "text", "content": "Hope this image helps!"}],
DOC2,
[{"type": "image_path", "content": "./assets/img2.png"}],
]
q_vecs = model.encode(q_list, prompt_name="s2p_query")
doc_vecs = model.encode(doc_list)
# calculate similarity
similarities = model.similarity(q_vecs, doc_vecs)
print(similarities)
# the output is:
# tensor([[0.7775, 0.7594, 0.2429, 0.2187],
# [0.3226, 0.3054, 0.7421, 0.5484]])
Evaluation on MTEB
script: ./scripts/evaluate_en_mteb/run_evaluate_mteb.py
License
This model should not be used for any commercial purpose!
- Downloads last month
- 2,463
Model tree for infgrad/jasper_en_vision_language_v1
Base model
dunzhang/stella_en_1.5B_v5Datasets used to train infgrad/jasper_en_vision_language_v1
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported95.727
- f1 on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported89.255
- f1_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported95.856
- ap on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported67.156
- ap_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported67.156
- main_score on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported95.727
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported93.776
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported90.758
- f1_weighted on MTEB AmazonCounterfactualClassification (en)test set self-reported93.974
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported74.888