Check out my blog!
You can use 6 numbers to fully describe the style of an (anime) image!
What's it and what could it do?
Many diffusion models, though, choose to use artist tags to control the style of output images. I am really not a fan of that, for three reasons:
- Many artists share very similar styles, making many artist tags redundant.
- Some artists have more than one distinct art style in their works. For basic example, sketch vs finished images.
- Prone to content bleeding. If the artist tag you choose draws lots of repeating content, it's very likely these content will bleed into your output despite not prompting for them.
One way to overcome this is using a style embedding model. It's a model which takes in images of arbitrary sizes and outputs a style vector for each image. The style vector lives in an N-Dimension space, and is essentially just a list of numbers with a length of N. Each number in the list corresponds to a specific style element the input image has.
Images with similar style should have similar embedding (low distance) while different style will have embeddings that are far apart (high distance).
The included py file gives minimal usage example. minimal_script.py provides the minimal codes for running an image through the network and obtain an output. While gallery_review.py contains the code I used to generate those visualisations and clustering.
Training data is here.
Training Hyperparameters
With current version (v3):
Training was done using PyTorch Lightning.
lr = 0.0001
weight_decay = 0.0001
AdEMAMix optimizer
ExponentialLR scheduler, with a gamma of 0.99, applied every epoch.
Batch size of 1. accumulate_grad_batches of 16.
With every anchor image, 16 positive images and 16 negative images are used.
Trained for 15 epoches. On 2 A100 GPUs. A total of 3434 optimizer updates.