Inference with HF
Browse files
README.md
CHANGED
@@ -3,6 +3,45 @@ datasets:
|
|
3 |
- SPRIGHT-T2I/spright_coco
|
4 |
---
|
5 |
## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
----
|
7 |
## Update 12/AUG/2024:
|
8 |
New *BEST* model, custom loss with label smoothing.
|
|
|
3 |
- SPRIGHT-T2I/spright_coco
|
4 |
---
|
5 |
## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens.
|
6 |
+
----
|
7 |
+
|
8 |
+
# π¨ IMPORTANT NOTE for loading with HuggingFace Transformers: π
|
9 |
+
|
10 |
+
```
|
11 |
+
model_id = "zer0int/LongCLIP-GmP-ViT-L-14"
|
12 |
+
|
13 |
+
model = CLIPModel.from_pretrained(model_id)
|
14 |
+
processor = CLIPProcessor.from_pretrained(model_id)
|
15 |
+
```
|
16 |
+
# β Error due to mismatch with defined 77 tokens in Transformers library
|
17 |
+
|
18 |
+
# π
|
19 |
+
# Option 1 (simple & worse):
|
20 |
+
Truncate to 77 tokens
|
21 |
+
`CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)`
|
22 |
+
|
23 |
+
```
|
24 |
+
# Cosine similarities for 77 tokens is WORSE:
|
25 |
+
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
|
26 |
+
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') π
|
27 |
+
```
|
28 |
+
# π
|
29 |
+
# Option 2 (edit Transformers) π RECOMMENDED π:
|
30 |
+
|
31 |
+
- π Find the line that says `max_position_embeddings=77,` in `[System Python]/site-packages/transformers/models/clip/configuration_clip.py`
|
32 |
+
- π Change to: `max_position_embeddings=248,`
|
33 |
+
|
34 |
+
# Now, in your inference code, for text:
|
35 |
+
- `text_input = processor([your-prompt-or-prompts-as-usual], padding="max_length", max_length=248)`
|
36 |
+
- or:
|
37 |
+
- `text_input = processor([your-prompt-or-prompts-as-usual], padding="True")`
|
38 |
+
|
39 |
+
```
|
40 |
+
# Resulting Cosine Similarities for 248 tokens padded:
|
41 |
+
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
|
42 |
+
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') β
|
43 |
+
```
|
44 |
+
|
45 |
----
|
46 |
## Update 12/AUG/2024:
|
47 |
New *BEST* model, custom loss with label smoothing.
|