ποΈ FastViT-HD Image Encoder
A Hugging Face-compatible wrapper around the FastViT-HD vision backbone from
FastVLM: Efficient Vision Encoding for Vision-Language Models (Apple CVPR 2025).
This repo exposes only the image encoder β no text tower, no projection head β so you can plug it into any downstream pipeline that needs per-image embeddings.

β¨ What you get
- 3072-D global embedding for any resolution (default 1024 Γ 1024).
- Runs out-of-the-box with
transformers
. - Much faster than vanilla ViT-L/14 for high-res images (see original paper).
Variant | #Params (enc.) | Output dim | Patch size | Global pool |
---|---|---|---|---|
FastViT-HD (this) | ~272 M | 3 072 | 64 | Yes |
π Quick start
conda create --name fast-vit-hd python=3.10
conda activate fast-vit-hd
pip install torch torchvision transformers timm pillow
Then, run the following code to get a 3 072-D embedding for your image:
from transformers import AutoModel, AutoImageProcessor
import torch, PIL.Image
device = "cuda" # or "cpu" / "mps"
model = AutoModel.from_pretrained(
"kevin510/fast-vit-hd", trust_remote_code=True
).to(device).eval()
processor = AutoImageProcessor.from_pretrained(
"kevin510/fast-vit-hd", trust_remote_code=True
)
img = PIL.Image.open("your_image.jpg")
px = processor(img, do_center_crop=False, return_tensors="pt")["pixel_values"].to(device) # (1,3,1024,1024)
emb = model(px)
print(emb.shape) # (1, D, 3072)
D is the number of patches (e.g. 16 Γ 16 for 1024 Γ 1024 input). For 1024 Γ 1024 input, D = 16 Γ 16 = 256.
π οΈ Implementation details
Wrapper β
FastViTImageEncoder
extendsPreTrainedModel
; we keep the originalGlobalPool2D
head but replace the classifier by a 3 072 Γ 3 072 identity-mapped projection.Weights β lifted from Appleβs Stage-3 checkpoint
llava-fastvithd_0.5b_stage3/fast_vit/fast_vit.pth
.Config / processor JSONs follow the current
transformers
β₯ 4.48 schema.
π Citation
@inproceedings{fastvlm2025,
title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
author = {Vasu, Pavan Kumar Anasosalu and Faghri, Fartash and Li, Chun-Liang et al.},
booktitle = {CVPR},
year = {2025}
}
If you find this wrapper useful, please consider citing the upstream work above.
βοΈ License
The mci.py
code implementation is licensed according to Apple's LICENSE; it is a modified version of the original mci.py
from the FastVLM repo. The underlying weights inherit the license provided by Apple in their LICENSE_MODEL; review that file before use. All other code in this repo is licensed according to Apache 2.0.
π Acknowledgements
- Original FastViT implementation and checkpoints by Apple ML Research β see https://github.com/apple/ml-fastvlm.
- Wrapper inspired by the CLIP / SigLIP integrations in π€
Transformers
.
- Downloads last month
- 1,721