Model Card for CoMP-MM-1B

This is an VFM that supports native image resolution inputs, continually pre-trained from SigLIP.

Model Sources

How to Get Started with the Model

Install the github repo, and use the code below to get started with the model.

import torch
from slimm.model.processor import SliMMQwen2VLProcessor
from slimm.model.utils_vl import process_vision_info
from slimm.model.vision_encoder import CoMPSiglipVisionModel
from PIL import Image

model_path = "SliMM-X/CoMP-SigLIP-So400M"

model = CoMPSiglipVisionModel.from_pretrained(
    model_path, torch_dtype="auto", device_map="cuda", w_merger=False
).to(torch.bfloat16)


processor = SliMMQwen2VLProcessor.from_pretrained(model_path)

image_input = Image.open("https://slimm-x.github.io/comp/figs/teaser.png")
inputs = processor(
    images=image_input,
    return_tensors="pt",
)

inputs = inputs.to("cuda")
output_feat = model(inputs.pixel_values.to(torch.bfloat16), inputs.image_grid_thw)
print(output_feat)

Citation

BibTeX:

@article{comp2025,
      title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models}, 
      author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
      year={2025},
      journal={arXiv preprint arXiv:2503.18931}, 
}
Downloads last month
73
Safetensors
Model size
413M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for SliMM-X/CoMP-SigLIP-So400M

Finetuned
(13)
this model