codehappy
/

puzzlebox-xl

Model card Files Files and versions

xet

Community

codehappy commited on 21 days ago

Commit

2be87a5

verified ·

1 Parent(s): 94b860e

update readme for epoch 18

Browse files

Files changed (1) hide show

README.md +18 -7

README.md CHANGED Viewed

@@ -9,10 +9,10 @@ base_model:
 A latent diffusion model (LDM) geared toward illustration, style composability, and sample variety. Addresses a few deficiencies with the SDXL base model; feels more like an SD 1.x with better resolution and much better prompt adherence.
 * Architecture: SD XL (base model is v1.0)
-* Training procedure: U-Net fully unfrozen, all-parameter continued pretraining at LR between 3e-8 and 3e-7 for 18,000,000 steps (at epoch 17, batch size 4).
 Trained on the Puzzle Box dataset, a large collection of permissively licensed images from the public Internet (or generated by previous Puzzle Box models). Each image
-has from 3 to 17 different captions which are used interchangably during training. There are 9.3 million images and 62 million captions in the dataset.
 The model is substantially better than the base SDXL model at producing images that look like film photographs, any kind of cartoon art, or old artist styles. It's also
 heavily tuned toward personal aesthetic preference.
@@ -36,17 +36,25 @@ genre ("pop art", "advertising", "pixel art"), source ("wikiart", "library of co
 **Aesthetic labelling:** All images in the Puzzle Box dataset have been scored by multiple IQA models. There are also over 700,000 human paired image preferences. This data is combined to label especially high- or low-aesthetic images. Aesthetic breakpoints are chosen
 on a per-style/genre tag basis (the threshold for "pixel art" is different than "classical oil painting".)
-Training is broken into three phases: in the first phase, all images (regardless of aesthetic score) are used in training. In the second phase, bottom quartile-labelled
 images are removed from training. In the final phase, *only* images tagged as top quartile aesthetics are trained.
 **Other nifty tricks used:** Some less common techniques used in training Puzzle Box XL include:
 - *Attention masks*: constructed for images to exclude background or portions of the image not mentioned in captions/important to image content; we only update blocks that are not masked off.
-- *Lores-to-hires*: I save compute by training at lower resolution (512px) until the model learns new concepts satisfactorily, then training at higher resolution (768px).
-This allows later checkpoints to generate 1+ megapixel images without tiling or stuttering, while greatly speeding up earlier stages of training.
 Model checkpoints currently available:
 - from epoch 17, **18000k** training steps, 06 July 2025
 - from epoch 16, **16950k** training steps, 05 May 2025
 - from epoch 15, **15800k** training steps, 08 March 2025
@@ -54,15 +62,16 @@ Model checkpoints currently available:
 - from epoch 13, **11930k** training steps, 15 August 2024
 - from epoch 12, **10570k** training steps, 21 June 2024
-*Which checkpoint's best?* Later checkpoints have better aesthetics and better prompt adherence at higher resolution and lower CFG scale, but are also more 'opinionated'; longer conditioning may be necessary to get the generation as you like it. In particular, the latest checkpoints are trained on the most consensus captions, which are highly accurate but also quite long. Earlier checkpoints may give larger sample variety on short conditioning, which (at lower resolution) may make them useful drafting models: searching for good noise seeds, etc. Earlier checkpoints may also be better for merging with other LDMs based on SD XL.
 This model has been trained carefully on top of the SDXL base, with a widely diverse training set at low learning rate. Accordingly, it should *merge* well with most other
 LDMs built off SDXL base. (Merging LDMs built off the same base is a form of transfer learning; you can add Puzzle Box concepts to other SDXL models this way. Spherical
 interpolation is best.)
-The U-Net self-attention layers are the layers most modified by the continued pretrain; comparing those layers to SD XL 1.0, the correlation is:
 | Epoch |    Date    | R-squared |
 | ----- | ---------- | --------- |
 |  17   | 2025-07-06 |  97.705%  |
 |  16   | 2025-05-05 |  97.917%  |
 |  15   | 2025-03-08 |  98.312%  |
@@ -70,4 +79,6 @@ The U-Net self-attention layers are the layers most modified by the continued pr
 |  13   | 2024-08-05 |  98.876%  |
 |  12   | 2024-06-21 |  99.167%  |
 (For reference, Pony-family models, which are also based on SD XL 1.0 but are trained at much higher LR, paving the model, are around 40%, and Playground-derived models, which are trained on SD XL architecture from static, are below 25%.)

 A latent diffusion model (LDM) geared toward illustration, style composability, and sample variety. Addresses a few deficiencies with the SDXL base model; feels more like an SD 1.x with better resolution and much better prompt adherence.
 * Architecture: SD XL (base model is v1.0)
+* Training procedure: U-Net fully unfrozen, all-parameter continued pretraining at LR between 3e-8 and 3e-7 for 19,300,000 steps (at epoch 18, batch size 4). See below for more details.
 Trained on the Puzzle Box dataset, a large collection of permissively licensed images from the public Internet (or generated by previous Puzzle Box models). Each image
+has from 3 to 24 different captions which are used interchangably during training. There are approximately 12 million images and 78 million captions in the dataset.
 The model is substantially better than the base SDXL model at producing images that look like film photographs, any kind of cartoon art, or old artist styles. It's also
 heavily tuned toward personal aesthetic preference.
 **Aesthetic labelling:** All images in the Puzzle Box dataset have been scored by multiple IQA models. There are also over 700,000 human paired image preferences. This data is combined to label especially high- or low-aesthetic images. Aesthetic breakpoints are chosen
 on a per-style/genre tag basis (the threshold for "pixel art" is different than "classical oil painting".)
+**Staged training:** Each epoch is broken into phases. Up to epoch 15, there were three phases: in the first phase, all images (regardless of aesthetic score) are used in training. In the second phase, bottom quartile-labelled
 images are removed from training. In the final phase, *only* images tagged as top quartile aesthetics are trained.
+In later epochs, a form of curriculum training is used: a complexity proxy is calculated for every image in the dataset, and the epoch begins with all (non-bottom quartile) images with below median complexity. The second phase is non-bottom quartile images of any complexity. Third and fourth (short) phases are done with solely consensus labels, with and without the maximum entropy restriction.
+Epoch length was determined by the original size of the training set, and the best checkpoint that emerges after model soup experimentation is released.
 **Other nifty tricks used:** Some less common techniques used in training Puzzle Box XL include:
+- *Data augumentation/conditional dropout*: taking inspiration from GAN-space, transformations are done (with some probability) on both the images and their labels in training. For example, an image might be converted to grayscale, rotated, or blurred. A booru-style caption will have its order of tags randomized, an English caption might have its sentences re-ordered. Labels may also be dropped out, wholly or partially. This helps the model generalize and avoid overfitting.
+- *Mixed data*: my computer vision datasets contain extremely high quality captions (written by expert humans, or consensus captions drawn from a dozen+ human/machine captions) and quite low-quality captions (output by a super-fast but brain damaged bag of classifiers, etc.) They also take different forms: there are captions written in normal English, in other human languages, or booru-style captions that are just lists of tags applying to the image. *All* captions for a given image are candidates to be used as labels in supervised training. In any epoch, the high-quality captions are given a higher probability to be chosen as the label, but the short or bizarre captions are candidates as well. This permits the model to learn to respond to different prompting styles, rather than always expecting the verbose detailed consensus style.
+- *Curriculum training*: compression ratio is used as a proxy for sample complexity. Samples with low complexity measure are trained early in the epoch, before samples with high complexity measure. This improves the speed of model convergence. It also identifies the noisiest/most troublesome and difficult-to-learn samples in the training set, which can be singled out for improved labels.
+- *Synthetic data*: besides rendered images of text and raytracer output, the training set contains hundreds of thousands of Puzzle Box's own generations. This reinforces desired aesthetics and improves the quality of generations on the fringe of the model's capabilities.
 - *Attention masks*: constructed for images to exclude background or portions of the image not mentioned in captions/important to image content; we only update blocks that are not masked off.
+- *Lores-to-hires*: I save compute by training at lower resolution (512px) until the model learns new concepts satisfactorily, then training at higher resolution (768px). This allows later checkpoints to generate 1+ megapixel images without tiling or stuttering, while greatly speeding up earlier stages of training.
 Model checkpoints currently available:
+- from epoch 18, **19300k** training steps, 03 October 2025
 - from epoch 17, **18000k** training steps, 06 July 2025
 - from epoch 16, **16950k** training steps, 05 May 2025
 - from epoch 15, **15800k** training steps, 08 March 2025
 - from epoch 13, **11930k** training steps, 15 August 2024
 - from epoch 12, **10570k** training steps, 21 June 2024
+*Which checkpoint's best?* You probably just want the latest checkpoint, unless you're interested in "model soup" merging approaches. Later checkpoints have better aesthetics and better prompt adherence at higher resolution and lower CFG scale, but are also more 'opinionated'; longer conditioning may be necessary to get the generation as you like it. In particular, the latest checkpoints are trained on the most consensus captions, which are highly accurate but also quite long. Earlier checkpoints may give larger sample variety on short conditioning, which (at lower resolution) may make them useful drafting models: searching for good noise seeds, etc. Earlier checkpoints may also be better for merging with other LDMs based on SD XL.
 This model has been trained carefully on top of the SDXL base, with a widely diverse training set at low learning rate. Accordingly, it should *merge* well with most other
 LDMs built off SDXL base. (Merging LDMs built off the same base is a form of transfer learning; you can add Puzzle Box concepts to other SDXL models this way. Spherical
 interpolation is best.)
+The U-Net attention layers are the layers most modified by the continued pretrain; comparing those layers to SD XL 1.0, the correlation is:
 | Epoch |    Date    | R-squared |
 | ----- | ---------- | --------- |
+|  18   | 2025-10-03 |  97.426%  |
 |  17   | 2025-07-06 |  97.705%  |
 |  16   | 2025-05-05 |  97.917%  |
 |  15   | 2025-03-08 |  98.312%  |
 |  13   | 2024-08-05 |  98.876%  |
 |  12   | 2024-06-21 |  99.167%  |
+High correlation indicates minimal mode collapse/catastrophic forgetting of base model learnings; i.e. "what SDXL can do this model can still do". The generation quality and prompt adherence improvement are clear even cleaving close to the original model.
 (For reference, Pony-family models, which are also based on SD XL 1.0 but are trained at much higher LR, paving the model, are around 40%, and Playground-derived models, which are trained on SD XL architecture from static, are below 25%.)