T5XXL-Unchained
it's unnecessary to extend t5 vocab
the link you shared is just a method to extend t5 embedding vocabulary with random tensor
and it has to be thoroughly trained first.
this method is redundant because MMDiT is partially language model already (it's both language and image expert model)
so finetuning the MMDiT only is sufficient.
besides, T5 arch is numerically unstable because it's based on old transformer arch prior to numerically stable QK norm and attention scaling. I don't want to exacerbate the problem.
it's unnecessary to extend t5 vocab
the link you shared is just a method to extend t5 embedding vocabulary with random tensor
and it has to be thoroughly trained first.this method is redundant because MMDiT is partially language model already (it's both language and image expert model)
so finetuning the MMDiT only is sufficient.besides, T5 arch is numerically unstable because it's based on old transformer arch prior to numerically stable QK norm and attention scaling. I don't want to exacerbate the problem.
What CLIP/Text Encode would you recommend with Chroma if you wanted to not use T5?
If you are looking for better text understanding/prompt adherence with a Chroma workflow.
I also realise prompt engineering plays a crutial role, just in case i am missing out on something newer and better suited.
Photo-realism is one usecase which i think it would help with, I use Chroma and its very good, i really appreciate your work, thanks (:
What CLIP/Text Encode would you recommend with Chroma if you wanted to not use T5?
the one it's trained on?! :D ... if you are an english professor ... and someone switches your tongue for a french one ... all that high capacity brain power will only produce jibberish ^^