Same as https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2 with two changes:

  • increase max resolution to 980 x 980 (instead of 384 x 384) by interpolating the position embeddings
  • implement the strategy in NaViT to allow a/ variable resoltion images, b/ aspect ratio preserved images

These changes only apply to the vision tower. No changes to the text tower. Implementation is fully backward compatible to https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2 -> just don't specify the patch_attention_mask

Usage:

import torch
from modeling_siglip import SiglipVisionModel

DEVICE = torch.device("cuda:0")
PATCH_SIZE = 14

pixel_values = torch.randn(2, 3, 28, 42, dtype=torch.bfloat16, device=DEVICE)
pixel_attention_mask = [
    [
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,

        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
    ],
    [
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,

        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
    ],
]
pixel_attention_mask = torch.tensor(pixel_attention_mask, dtype=torch.bool, device=DEVICE)
patches_subgrid = pixel_attention_mask.unfold(
    dimension=1, size=PATCH_SIZE, step=PATCH_SIZE
).unfold(dimension=2, size=PATCH_SIZE, step=PATCH_SIZE)
patch_attention_mask = (patches_subgrid.sum(dim=(-1, -2)) > 0).bool()

model = SiglipVisionModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit", _flash_attn_2_enabled=True)
model.train()
model.vision_model.to(DEVICE, dtype=torch.bfloat16)

output = model.vision_model(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)
Downloads last month
5,547
Safetensors
Model size
883M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Spaces using HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit 8