---
base_model:
- stabilityai/stable-diffusion-xl-base-1.0
---

# bigASP V2.5
## EXPERIMENTAL

This is a highly experimental model in the bigASP family.

## Usage

![ComfyUI Workflow](./ComfyUI_00389_.png)

Drag this image into ComfyUI to load the example workflow.  I've done limited testing so far on running the model, so take that workflow and the following advice with a grain of salt.

* **NOTE**: V2.5 is not a standard SDXL model!  It's a Flow Matching SDXL!  In the example workflow, the ModelSamplingSD3 node is _very_ important to get Comfy to use the model correctly.  As for other software beyond Comfy, I don't yet know if they can be coaxed in a similar fashion.

* First, CFG.  You can use the model with or without Perturbed Attention Guidance (PAG).  The example workflow includes it.  Generally it should be set around 2.0.  When using it, you should decrease your CFG lower than usual.  Settings between 2 and 5 seem to work okay when PAG is enabled.
If you bypass PAG (connect the ModelSampling node directly to KSampler), then I've generally found CFG should be between 4 and 6.  As usual with CFG you're trading off quality for generation diversity.  Anything above 7, at least on photorealistic gens, starts to cook the image too much.

* For negative prompt, I've tried leaving it blank, "low quality", and a more universal long negative prompt with things like "deformed", etc.  Of those three options, just "low quality" performed the best for me.

* For the sampler, euler is the only one I have found that works reliably.

* For the schedule, keeping the ModelSampling shift parameter at 1.0 and setting the schedule to "beta" works well and provides a good balance between the quality of the structure of the image and the details.  Alternatively, you can use the normal schedule and set shift to between 3.0 and 12.0.  I've found 6.0 to work best overall, with little benefit from going higher.  This configuration will have better overall structure in the image, but details suffer because the model spends more time on the low frequencies.  So it depends on what you're trying to generate, preferences, etc.  PAG helps with image structure as well, so enabling PAG and using a beta schedule with shift=1.0 is a nice all around good option.

* As for prompting, I don't yet know what works best.  The model was trained on captions generated by JoyCaption Beta One in various modes, so it should understand a variety of ways of prompting.  It was also trained on tag strings.  I would lean toward well written, natural language captions for now.  If you're prompting for NSFW things, try to use more vanilla/clincal terms like those used by chatbots, rather than slang.

* Images in the training dataset were assessed for overall quality and their quality rating was included during training.  So you can prompt with: worst quality, low quality, normal quality, high quality, best quality, and masterpiece quality.  Including the quality in your prompt is not _required_, and it can be included anywhere in the prompt.  I would caution against using masterpiece quality for now, since that tends to skew the model towards drawings rather than photoreal.


## Model Details (What the Fuck is this thing!?)

bigASP V2.5 is an intermediary experiment I wanted to run while I prepared for V3.  Like V2, it is a large scale finetune from SDXL, and follows most of V2's approach except for these bizarre tweaks:

1. Dataset expanded from 6.7M to 13M images by including additional sources as well as, significantly, anime.  Why?  Because I wanted to see if the model could dual purpose for both photoreal gens and anime gens.  Also I wanted to see if photoreal gens could benefit from the more varied concepts present in anime/furry/etc images.
2. Training length expanded from 40M samples to 150M samples.  One of my theories for the tempermental behavior of V2 is the scale of the training.  Models like Pony v6 had a much larger amount of training.  150M samples brings V2.5 within the ballpark of that class of model.
3. Using JoyCaption Beta One instead of the Pre-Alphas used in V2.  Beta One has more modes, and I tweaked its language and verbage quite a bit, so hopefully V2.5 is less sensitive to prompts.
4. Swapping SDXL's training objective over to Rectified Flow Matching like more modern models (i.e. Flux, Chroma, etc).  This was done for two reasons.  One, Flow Matching makes higher quality generations.  And two, it allowed me to ditch SDXL's broken noise schedule.  That latter bit greatly enhances the model's ability to control the overall structure of generations, resulting in less mangled mess generations and extra limbs.  It also allows V2.5 to generate more dynamic range from very dark images to very bright images.

Overall it's a very silly experiment, but I wanted to use it as a way to get my feet wet for V3.  The upshot is that I'm now far more familiar with the dynamics of flow matching, training at the 100M+ scale, and my training scripts now support multi-node training.  I trained V2.5 on four nodes, each with 8xH100sxm5 GPUs, for a total of 32 GPUs.  That significantly speeds up the rate at which I can train models.


## Performance/Results

* V2.5 is indeed an incremental "improvement" over V2.  V2 had lots of issues with reliability (lots of mangled mess gens), and was very sensitive to prompts.  So far I've found 2.5 to be a bit better in that regard, spitting out a decent gen 1 in 8 vs V2's 1 in 40.  Faces and hands are also better, with the right settings (namely using PAG), though far, far from perfect.  It's still tempermental at times, depending on what you prompt for, but a little better than V2.
* I think expanding the dataset was overall a win, giving the model more flexibility even for strictly photoreal gens.  But V2.5 is generally _terrible_ at anime/drawings/etc.  So I clearly need to work on that half of the dataset more going into V3.
* The expanded training definitely helped make the model more reliable, and generalize a bit better.  For example, being able to generate unseen things like a bison wearing a spacesuit.
* Swapping over to Flow Matching definitely helped the dynamic range and lighting a _lot_ and improved overall quality.  Though I've found Flow Matching to also result in the model tending towards overly contrasty images, requiring prompting tricks to tone it down.
* The structure of images is improved with Flow Matching and the non-broken noise schedule, though at least for V2.5 it has resulted in a trade-off between the quality of image structure versus the quality of fine details.  This is easiest to see when running the model with Normal schedule and adjusting shift from 1.0 to 6.0.  At 1.0, the model generates a lot of extra limbs, mangled messes, etc, but details are good.  At 6.0 overall structure is generally great, but then the images are softer and more plasticy.  The Beta schedule is a kind of middle ground, though I don't think it's _quite_ right.  I tried making my own schedule provides a better trade off but it was only marginally better than beta.
* I noticed a rather significant improvement in the quality of gens when including "low quality" in the negative prompt versus leaving the negative blank.  Not just in aesthetics, but also in overall image structure, which I find quite odd.

At this point I might do another blast of training on top of V2.5 with these tweaks:

* Unfreezing the text encoders and training them lightly (they were left completely frozen for this experiment so far)
* Adding garbage and mangled images to the dataset with the label worst quality

I suspect I can get a further boost in prompt adherence and generation quality by training the text encoders.  What I've noticed between training V2.5 with the encoders frozen, versus V2 with one of the encoders unfrozen, is that the model gets a quick initial boost early in training when it can train the text encoders.  My hypothesis is that the Unet is dumping some of the workload into the text encoder.  While that might provide a boost to image quality, it might also hurt the model's ability to be flexible and generalize.  I think doing a quick 10M additional run, and unfreezing the first couple of layers on the encoders, will strike a nice balance.
And given the odd improvements caused by including "low quality" in negative, I think explictly adding garbage and mangled images to the dataset might actually help the model.


## Support

Want to support more dumbass experiments like this, bigASP v3, and JoyCaption?  https://ko-fi.com/fpgaminer