zer0int/CLIP-GmP-ViT-L-14 · Does this fix the 20 token problem?

Apr 9

I know about the 77-token vs long-clicp 268 token thing.
However, reguarding the original CLIP-L and SD1.5, chatgpt makes claims along the lines that it has an EFFECTIVE token limit of about 20 tokens. that is to say, tokens 21-77 have little effect on things.
Allegedly, this was due to bad training.

So, my question is whether this version bumps up the effective limit closer to 70?
Would be cool to effectively have a "medium-CLIP", that unlike LongCLIP, did not require any code changes.

zer0int

Owner Apr 12

ChatGPT's claim is valid; see the paper about Long-CLIP for details: https://arxiv.org/abs/2403.15378

And, it's more likely that the middle of the prompt gets "lost", rather than the end. So, let's say you have some long prompt of 60 tokens:

"a photo of a {imagine more details}, and there's an orange box in the background, hyper realistic octane render mesmerizing cryengine pbr" < -- concept of "orange box in the background" is most likely to be lost.

The beginning and end of the prompt is most likely to be preserved, while the center gets 'overlooked' / dropped. This is also an issue with LLM, for which the "needle in a haystack" benchmarks are used (google that if you're interested in the details).

I didn't measure the exact limit for my models, but I would assume it to not be significantly different to the original CLIP in terms of this particular issue.
So I'd hypothesize that a 'medium CLIP' without any architectural changes is unlikely, no matter how good the model is in terms of objective performance (e.g. zero-shot), unfortunately.

ppbrown

Apr 12

I'm confused on seemingly contradictory information.
Extrapolating a bit on what you wrote, I think you are saying that on the one hand, your GmP finetunes will still suffer from the 20 token effectively limit>

On the other hand, the whole premise of LongCLIP, is that it allows for more tokens to be used than 77. But in the broader picture, if you just drop LongCLIP into an existing model, rest of the model is "without any architectural changes".

It seems like the conclusion from your post here would be that dropping LongCLIP into existing models would be a waste of time too.
But it is not?