File size: 4,955 Bytes
e610900
 
 
afb095a
e610900
63d6f69
 
 
0568c73
401c54a
 
 
 
0568c73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9736398
 
 
 
 
 
 
 
 
 
3300319
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07f6f14
3300319
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
datasets:
- SPRIGHT-T2I/spright_coco
base_model: BeichenZhang/LongCLIP-L
---
## A fine-tune of Long-CLIP - original model: [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L)
- ❀️ this CLIP? [Help feed it](https://ko-fi.com/zer0int) if you can. Besides data, CLIP eats time & expensive electricity of DE. TY! πŸ€—
- Want to feed it yourself? All code for fine-tuning and much more is on [my GitHub](https://github.com/zer0int).
----
- # Note for using Long-CLIP as the Text Encoder with Flux.1, SDXL, Stable Diffusion: 
- Get the ComfyUI Long-CLIP nodes here: [https://github.com/SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP)
- If you don't use Comfy, it's at least a starting point for reverse engineering & applying it to your code! πŸ€—
----
# 🚨 IMPORTANT NOTE for loading with HuggingFace Transformers: πŸ‘€

```
model_id = "zer0int/LongCLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
```
# ❌ Error due to mismatch with defined 77 tokens in Transformers library

# πŸ‘‡
# Option 1 (simple & worse):
Truncate to 77 tokens
`CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)`

```
# Cosine similarities for 77 tokens is WORSE:
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') πŸ“‰
```
# πŸ‘‡
# Option 2 (edit Transformers) πŸ’– RECOMMENDED πŸ’–:

- πŸ‘‰ Find the line that says `max_position_embeddings=77,` in `[System Python]/site-packages/transformers/models/clip/configuration_clip.py`
- πŸ‘‰ Change to: `max_position_embeddings=248,`

# Now, in your inference code, for text:
- `text_input = processor([your-prompt-or-prompts-as-usual], padding="max_length", max_length=248)`
- or:
- `text_input = processor([your-prompt-or-prompts-as-usual], padding="True")`

```
# Resulting Cosine Similarities for 248 tokens padded:
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') βœ…
```

----
## Update 12/AUG/2024:
New *BEST* model, custom loss with label smoothing.
Small gain for a diverse and large good quality dataset, but big relative gains for an overfit-prone fine-tune (small batch size, 1 GPU, narrow dataset of e.g. 'sneakers', etc.) are possible!
Fine-tune your model with the provided code for GmP-Smooth: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP)


![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/l3FYkaicihqXv5D9wLDAF.png)

----

The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81)**.


Made possible with Geometric Parametrization (GmP):

```

"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

```

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/OqhNxW-D9c58mkZyUQlL_.png)

βœ… The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using [SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP) custom nodes! πŸ€—

** For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP)

```
@article{zhang2024longclip,
        title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
        author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
        journal={arXiv preprint arXiv:2403.15378},
        year={2024}
}
```

Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE)