Zero-Shot Image Classification
Safetensors
clip

Long-CLIP ViT-L/14 finetune: SAE-informed adversarial training

image/png

The original CLIP model has 77 tokens max input - but only ~20 tokens effective length. See the original Long-CLIP paper for details. HunyuanVideo demo:

69 tokens, normal scene:

  • Lens: 16mm. Aperture: f/2.8. Color Grading: Blue-green monochrome. Lighting: Low-key with backlit silhouettes. Background: Gothic cathedral at night, stained glass windows breaking. Camera angle: Over the shoulder of a ninja, tracking her mid-air leap as she lands on a rooftop.

52 tokens, OOD (Out-of-Distribution) scene: Superior handling for consistency and prompt-following despite OOD concept.

  • In this surreal nightmare documentary, a sizable spider with a human face is peacefully savoring her breakfast at a diner. The spider has a spider body, but a lady's face on the front, and regular human hands at the end of the spider legs.

image/png

Downloads last month
33
Safetensors
Model size
428M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for zer0int/LongCLIP-SAE-ViT-L-14

Finetuned
(3)
this model

Datasets used to train zer0int/LongCLIP-SAE-ViT-L-14