Why does my image look like a mess at CFG: 5?
Basically when I set the CFG to ~5, which is the recommended value as far as I know, the image looks really messy and noisy. If you look at the buildings in the background, they look crooked and broken.
When I increase the CFG value to 12, the image becomes much more clear, the skyscrapers make sense and have proper walls and a logical structure, and the overall image quality becomes fantastic by comparison. The problem is that I get weird discoloration on the right side of the image.
What am I doing wrong? I'm linking my images as a reference, hopefully the metadata is not removed and you guys can load them into comfy.
I'm using a CONCAT node to merge two parts of a prompt, and I tried removing it to see if that was the problem, and it does get a bit more coherent, but it's still nowhere near as good as CFG: 12
CFG: 5, messy and broken buildings
CFG: 12, much better visual quality, but note the broken colors on the right
CFG: 5 without CONCAT node, more coherent, but nowhere near as good as CFG: 12
Are you utilizing the model in BF16 format? If you aren't, that may help given you can fit it in your GPU. If you can't utilize BF16 or are using regular FP8, a scaled implementation of FP8 may also help instead of regular FP8.
Also, if you're using ComfyUI, you may also want to experiment around with CFGRenorm, CFGNorm, and/or the Tangential Damping CFG node. All of which are native to ComfyUI and may assist at those higher CFG scales.
Edit: checking metadata n stuff now, will edit again once I check 'em.
Edit 2: Another other thing I'd recommend is utilizing another sampler, likely dpmpp_2m or Gradient Estimation may be better at lower for you CFG scales. Euler is good and standard, but due to it practically being a simple lerp, tricks that the other samplers do could assist. Gradient Estimation is the closest to Euler (it's a lerp extrapolated beyond a weight of 1.0) and would provide a better experience I'd think. There's also the RL models (available in this repo or in Lode's debug
repo, there may be more / newer ones soon) which provide a far more coherent image at lower CFG scales, in my opinion.
Are you utilizing the model in BF16 format? If you aren't, that may help given you can fit it in your GPU. If you can't utilize BF16 or are using regular FP8, a scaled implementation of FP8 may also help instead of regular FP8.
By "using in BF16 format" do you mean the weight_dtype? I have it set to default. I have a 4070 Ti
I tried the various methods you mentioned, some of them do improve the coherence at CFG: 5, and the CFGNorm does remove the weird color at CFG: 12, but the image quality is nowhere close to what I get with regular CFG: 12
Hi
@CSM360
,
You may want to try the Skimmed CFG extension.
I didn't test it properly, but it allows setting super high CFG (like 32, 64, even 100) without burning / overcooking.
It should also allow you to you use the SDE
versions of DPM++
samplers, but once again I don't have much experience with it.
@CSM360 , I did some quick tests for you:
- Using Skimmed CFG's
linear interpolation dual scales
at 5.0 / 5.0 and a KSampler CFG at 12-96 seems to improve things a bit- I didn't tweak the parameters, but it can be interesting
- Removing CONCAT surely improves things indeed
- What did improve things quite a lot: remove the word
canyon
from your prompt... it seems the model interprets this as a rock canyon (sorry I'm not a native English speaker).
Here what I got without Skimmed CFG, at CFG 5.0, without CONCAT and with canyon
removed:
I hope it helps.
Edit: and here is the same, but this time with Skimmed CFG at 5.0 / 5.0 and a KSampler CFG at 32:
Another interesting finding: in your prompt aesthetic 10
could possibly be aesthetic:10
. But, as someone wrote in another thread, it may be useful for photos, not for anime style (but I don't know myself, I only do photos).
Writing aesthetic:10
dramatically changes the image, even with the word canyon
restored: no more broken buildings.
So here a 3rd image (no CONCAT, no Skimmed CFG, CFG 5.0, with canyon
and aesthetic:10
):
Now, if I remove aesthetic:10
and keep canyon
, I get something not bad but a few issues with buildings..
Here is a 4th (and last) image, same as above with both aesthetic:10
and canyon
removed:
It may not be the style you expect, though, but I hope you'll have some useful paths to experiment.
Edit: did some checks, I may be wrong with the aesthetic:xx
writing, take it with a pinch of salt. But it surely influences results, and aesthetic:10
and aesthetic:3
do generate different images. TBH I never use it myself.
I seems you can also write aesthetic10
, to be tested.
Another interesting finding: in your prompt
aesthetic 10
could possibly beaesthetic:10
. But, as someone wrote in another thread, it may be useful for photos, not for anime style (but I don't know myself, I only do photos).
Writingaesthetic:10
dramatically changes the image, even with the wordcanyon
restored: no more broken buildings.
Aesthetic:10 and aesthetic 10 isn't much of a different thing. Read someone saying you can use : at the end of single words without () to still create emphasis (so "laugh:10" and "(laugh:10)" would be the same thing) but that seems to be false. It just interprets the : as a normal written letter/token/whathaveyou
Thanks for the suggestions everyone! Some of them gave decent results, but ultimately none of the solutions matched the CFG: 12 image quality
So I did the dumbest solution imaginable. I'm just using CFG: 12 and cropping out the broken pixels
Hey if it works it works huh