zer0int commited on
Commit
0568c73
Β·
verified Β·
1 Parent(s): 6ddee75

Inference with HF

Browse files
Files changed (1) hide show
  1. README.md +39 -0
README.md CHANGED
@@ -3,6 +3,45 @@ datasets:
3
  - SPRIGHT-T2I/spright_coco
4
  ---
5
  ## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ----
7
  ## Update 12/AUG/2024:
8
  New *BEST* model, custom loss with label smoothing.
 
3
  - SPRIGHT-T2I/spright_coco
4
  ---
5
  ## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens.
6
+ ----
7
+
8
+ # 🚨 IMPORTANT NOTE for loading with HuggingFace Transformers: πŸ‘€
9
+
10
+ ```
11
+ model_id = "zer0int/LongCLIP-GmP-ViT-L-14"
12
+
13
+ model = CLIPModel.from_pretrained(model_id)
14
+ processor = CLIPProcessor.from_pretrained(model_id)
15
+ ```
16
+ # ❌ Error due to mismatch with defined 77 tokens in Transformers library
17
+
18
+ # πŸ‘‡
19
+ # Option 1 (simple & worse):
20
+ Truncate to 77 tokens
21
+ `CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)`
22
+
23
+ ```
24
+ # Cosine similarities for 77 tokens is WORSE:
25
+ # tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
26
+ tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') πŸ“‰
27
+ ```
28
+ # πŸ‘‡
29
+ # Option 2 (edit Transformers) πŸ’– RECOMMENDED πŸ’–:
30
+
31
+ - πŸ‘‰ Find the line that says `max_position_embeddings=77,` in `[System Python]/site-packages/transformers/models/clip/configuration_clip.py`
32
+ - πŸ‘‰ Change to: `max_position_embeddings=248,`
33
+
34
+ # Now, in your inference code, for text:
35
+ - `text_input = processor([your-prompt-or-prompts-as-usual], padding="max_length", max_length=248)`
36
+ - or:
37
+ - `text_input = processor([your-prompt-or-prompts-as-usual], padding="True")`
38
+
39
+ ```
40
+ # Resulting Cosine Similarities for 248 tokens padded:
41
+ # tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
42
+ tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') βœ…
43
+ ```
44
+
45
  ----
46
  ## Update 12/AUG/2024:
47
  New *BEST* model, custom loss with label smoothing.