|
## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens. |
|
|
|
The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81)**. |
|
|
|
|
|
Made possible with Geometric Parametrization (GmP): |
|
|
|
``` |
|
|
|
"Normal" CLIP MLP (multi-layer perceptron): |
|
|
|
(mlp): Sequential( |
|
|-(c_fc): Linear(in_features=1024, out_features=4096, bias=True) |
|
| (gelu): QuickGELU() |
|
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True) |
|
| | |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias |
|
| |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.bias |
|
|
|
|
|
GmP CLIP MLP: |
|
|
|
Weight decomposition into: |
|
- radial component 'r' as norm of pre-trained weights |
|
- angular component 'theta' as normalized direction |
|
-> preserves weight vectors' directionality and magnitude |
|
|
|
(mlp): Sequential( |
|
|-(c_fc): GeometricLinear() |
|
| (gelu): QuickGELU() |
|
|-}-(c_proj): GeometricLinear() |
|
| | |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.r |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias |
|
| |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.theta |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.bias |
|
|
|
(Same thing for [text] transformer.resblocks) |
|
|
|
``` |
|
|
|
|
|
✅ The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using [SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP) custom nodes! 🤗 |
|
|
|
** For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP) |
|
|
|
``` |
|
@article{zhang2024longclip, |
|
title={Long-CLIP: Unlocking the Long-Text Capability of CLIP}, |
|
author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang}, |
|
journal={arXiv preprint arXiv:2403.15378}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE) |