LTX Clip.

#2
by Evados - opened

Is it possible to create a GGUF for the PixArt-XL-2-1024-MS text encoder?
I’m not sure why, but this text encoder seems to perform better for animating scenes. With the T5XXL, the scenes appear more static.

That's the same T5XXL model though, isn't it? I remember testing it and it being 1:1 iirc.
If you have a workflow VS what you're comparing it with I can retest it, I think I still have the original pixart ones somewhere.
The ones for the GGUF files here were converted from the raw FP32 google T5 XXL files.

Ok, thank you for the response.
I will take a look and try to see if this issue occurs again during my testing. If it does, I will share the workflow to reproduce it.
When I use the GGUF Q4 models or the T5 XXL gguf, my video gets a noisy frame at the beginning. Do you experience this issue as well?
If I switch to the normal model, I don’t get this noisy frame at the start.

You should update ComfyUI-GGUF, this was fixed some time ago.

Ok, I have updated the GGUF custom node, but I still get the noisy frame when using I2V mode.
It seems fine when I use T2V mode.
I tried using the VAE from the original model and the CLIP from Pixart, but when I use the GGUF model, I still get some strange noise in the video.
I also tried loading the CLIP in GGUF and tested multiple versions of the model, but the noise persists.
For example, if I use the original LTX model and the T5 XXL CLIP in GGUF, everything seems to work fine, and I don’t get the noise at the beginning.
So, the noise issue in I2V mode appears to be present with all GGUF models. By deduction, there’s probably something incorrect in the code of the node that loads the model.

https://www.youtube.com/watch?v=kNbj0_Gwl5k
In the video, you can see that at around 5:20 minutes, I switch to the GGUF models.
After that, you can observe the difference.
Sorry if the processing is slow on my computer. I’m using multiple models in a single workflow, and I’m reaching some limits. Additionally, the video recording seems to significantly slow down GPU processing.
At the same time, I’d like to thank you for all your work. I use GGUF models a lot for Flux and SD 3.5, and it’s a great advantage to have them.

Here’s a simpler workflow that demonstrates the problem.
This way, you can check if it happens for you as well, thanks.
https://www.mediafire.com/file/gfq1pg1hi9ttngk/Dave_Gravel_TestBugWithGGUFModel.zip

If I use the original model with the GGUF clip like this, it works correctly.
As I mentioned before, it seems to be an issue with the model node.
Screenshot 2024-12-13 151725.png

Neat, thanks for these. I was able to reproduce it, could you test if this commit fixes it?

https://github.com/city96/ComfyUI-GGUF/commit/65a7c895bb0ac9547ba2f89d55fbdb609aa2bfe7

Thank you very much, it seems to be working very well now.
If you get the chance, take some time to try the PixArt model. From what I understand, it’s supposed to be based on T5xxl as well, but it seems to produce higher quality and better movement compared to all the other T5xxl models I’ve tried.
I’m not sure if this difference is due to compression or something else, but it’s quite noticeable in the final video result. The difference seems even more apparent in T2V mode.
Thanks again for the fix, it’s greatly appreciated.

Screenshot 2024-12-13 191635.png

No problem, glad it's resolved.

As for the text encoder, where did you get that file from...? The one highlighted in your screenshot I think is the actual pixart model file from here which doesn't have a text encoder in it.

And the one below it would only load one half of the text encoder since I don't think comfy has split model loading.

Did you check the console for missing/leftover keys while loading?

I cloned these files into my ComfyUI\models\text_encoders\PixArt-XL-2-1024-MS folder: https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS/tree/main
If I'm not mistaken, when I load the clip, it loads the model-00001-of-00002.safetensors and model-00002-of-00002.safetensors files.

I'm not sure where the difference comes from, but it's quite significant. The result isn't always better, but it often seems more appropriate. Sorry again if it takes time in my video; since I'm loading multiple models, it occasionally fills up almost all of my memory.
https://www.youtube.com/watch?v=fpFQIisCMKE

Oh wait, that's with the LTX Clip Model Loader and not the base CLIP loader. That's from this custom node set right?:

I guess in that case it's not a model weight difference but an implementational difference between base comfy and that custom node set instead. My guess would be that it's the tokenizer.

At a glance, the LTX one truncates your prompt at 128 tokens, so the rest is cut off: code
Comfy seems to pad to a minimum length w/o any max limit: code (The comment at the end even has "#pad to 128?")

One thing you could do is verify that the issue is reproducible without the GGUF nodes (i.e. by using the built in CLIP Loader set to ltx) then open an issue on the main ComfyUI repo.

I tried the normal clip loader with multiple T5xxl models, and the results are pretty much the same as with GGUF. To me, it looks like the GGUF and non-GGUF versions produce identical results.

The difference seems to occur only when I use the LTXV Clip loader.
(Truncates your prompt at 128 tokens.) Yes, this is probably the cause of the issue or difference.

Unfortunately, I can't load PixArt with the normal clip loader, so I can't test the difference with it.
If the prompt gets truncated, it can definitely make a big difference, haha.

Thanks a lot for the GGUF fix—I'm really happy to see it working correctly now!

Edit: I reduced the token count in my prompt, and now the PixArt and GGUF models are giving pretty much the same results.
Thanks a lot for pointing this out—now I see it's not a real issue.

Sign up or log in to comment