Inference with HF
Browse files
README.md
CHANGED
|
@@ -3,6 +3,45 @@ datasets:
|
|
| 3 |
- SPRIGHT-T2I/spright_coco
|
| 4 |
---
|
| 5 |
## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
----
|
| 7 |
## Update 12/AUG/2024:
|
| 8 |
New *BEST* model, custom loss with label smoothing.
|
|
|
|
| 3 |
- SPRIGHT-T2I/spright_coco
|
| 4 |
---
|
| 5 |
## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens.
|
| 6 |
+
----
|
| 7 |
+
|
| 8 |
+
# π¨ IMPORTANT NOTE for loading with HuggingFace Transformers: π
|
| 9 |
+
|
| 10 |
+
```
|
| 11 |
+
model_id = "zer0int/LongCLIP-GmP-ViT-L-14"
|
| 12 |
+
|
| 13 |
+
model = CLIPModel.from_pretrained(model_id)
|
| 14 |
+
processor = CLIPProcessor.from_pretrained(model_id)
|
| 15 |
+
```
|
| 16 |
+
# β Error due to mismatch with defined 77 tokens in Transformers library
|
| 17 |
+
|
| 18 |
+
# π
|
| 19 |
+
# Option 1 (simple & worse):
|
| 20 |
+
Truncate to 77 tokens
|
| 21 |
+
`CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)`
|
| 22 |
+
|
| 23 |
+
```
|
| 24 |
+
# Cosine similarities for 77 tokens is WORSE:
|
| 25 |
+
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
|
| 26 |
+
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') π
|
| 27 |
+
```
|
| 28 |
+
# π
|
| 29 |
+
# Option 2 (edit Transformers) π RECOMMENDED π:
|
| 30 |
+
|
| 31 |
+
- π Find the line that says `max_position_embeddings=77,` in `[System Python]/site-packages/transformers/models/clip/configuration_clip.py`
|
| 32 |
+
- π Change to: `max_position_embeddings=248,`
|
| 33 |
+
|
| 34 |
+
# Now, in your inference code, for text:
|
| 35 |
+
- `text_input = processor([your-prompt-or-prompts-as-usual], padding="max_length", max_length=248)`
|
| 36 |
+
- or:
|
| 37 |
+
- `text_input = processor([your-prompt-or-prompts-as-usual], padding="True")`
|
| 38 |
+
|
| 39 |
+
```
|
| 40 |
+
# Resulting Cosine Similarities for 248 tokens padded:
|
| 41 |
+
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
|
| 42 |
+
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') β
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
----
|
| 46 |
## Update 12/AUG/2024:
|
| 47 |
New *BEST* model, custom loss with label smoothing.
|