zer0int
/

LongCLIP-GmP-ViT-L-14

Zero-Shot Image Classification

Model card Files Files and versions Community

LongCLIP-GmP-ViT-L-14 / README.md

zer0int's picture

Update README.md

3300319 verified about 1 year ago

|

2.31 kB

	## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens.

	The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81)**.


	Made possible with Geometric Parametrization (GmP):

	```

	"Normal" CLIP MLP (multi-layer perceptron):

	(mlp): Sequential(
	\|-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
	\| (gelu): QuickGELU()
	\|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
	\| \|
	\| \|-- visual.transformer.resblocks.0.mlp.c_fc.weight
	\| \|-- visual.transformer.resblocks.0.mlp.c_fc.bias
	\|
	\|---- visual.transformer.resblocks.0.mlp.c_proj.weight
	\|---- visual.transformer.resblocks.0.mlp.c_proj.bias


	GmP CLIP MLP:

	Weight decomposition into:
	- radial component 'r' as norm of pre-trained weights
	- angular component 'theta' as normalized direction
	-> preserves weight vectors' directionality and magnitude

	(mlp): Sequential(
	\|-(c_fc): GeometricLinear()
	\| (gelu): QuickGELU()
	\|-}-(c_proj): GeometricLinear()
	\| \|
	\| \|-- visual.transformer.resblocks.0.mlp.c_fc.r
	\| \|-- visual.transformer.resblocks.0.mlp.c_fc.theta
	\| \|-- visual.transformer.resblocks.0.mlp.c_fc.bias
	\|
	\|---- visual.transformer.resblocks.0.mlp.c_proj.r
	\|---- visual.transformer.resblocks.0.mlp.c_proj.theta
	\|---- visual.transformer.resblocks.0.mlp.c_proj.bias

	(Same thing for [text] transformer.resblocks)

	```


	✅ The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using [SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP) custom nodes! 🤗

	** For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP)

	```
	@article{zhang2024longclip,
	title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
	author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
	journal={arXiv preprint arXiv:2403.15378},
	year={2024}
	}
	```

	Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE)