Typical anime image style can be described in a 6-dim vector

Community Article Published August 2, 2025

After training an image embedding model on anime images from the Danbooru dataset, I notice something interesting: The intrinsic dimension of the output is consistently around 6!

What does that mean? It means you can use just 6 numbers to fully describe the style of a typical anime image!

In this article, I will cover:

Why is it useful
Training Configuration
How the styles are distributed
What each of the 6 number mean in image

Why is this useful & why do I want to do this

Initially, my goal was to train a small diffusion model from scratch, using the Danbooru dataset. To familiar myself with various diffusion model concepts.

Many diffusion models, though, choose to use artist tags to control the style of output images. This is especially common in anime image based models, like many SDXL finetunes.

I am really not a fan of that, for three reasons:

Many artists share very similar styles, making many artist tags redundant.
Some artists have more than one distinct art style in their works. For basic example, sketch vs finished images.
Prone to content bleeding. If the artist tag you choose draws lots of repeating content, it's very likely these content will bleed into your output despite not prompting for them.

Sure, for the third reason, you can get around that by using negative prompt. But sometimes even that won't be effective enough. It would be easier if the style can be controlled by something more directly.

So, mostly inspired by the PonyV7 model, I land on the idea of using a style embedding model.

What does that mean? It means a model which takes in images of arbitrary sizes and outputs a style vector for each image. The style vector lives in an N-Dimension space, and is essentially just a list of numbers with a length of N. Each number in the list corresponds to a specific style element the input image has.

If we can somehow inject this style vector alongside the training image into the diffusion model, we could in theory obtain a diffusion model that uses this style vector to determine what the output style would be like, instead of relying on the artist tags.

What and how many style elements are there? I didn't know yet, it's for the later training to figure out. But probably would be like thick vs thin lines, realistic vs cartoon, messy vs clean and so on.

So I dug into HuggingFace, trying to find if anyone ever had similar thought. Surprisingly, I couldn't find any model that does this for anime images.

That means I would have to build my own model.

At around the same time, it turns out that training a diffusion model with acceptable quality is too much for the graphic cards I own. So I ditched that idea and focus purely on the style embedding model. Hopefully, it will be useful for other people, like who is reading this article right now!

Training Configuration

Training of this model doing style embedding is HARD! It's a much harder task than classification and is very susceptible to overfitting. No wonder I couldn't find other people doing this.

After countless many trials, I finally found a configuration that produced a well-fit, small size model for this task.

Dataset

Preparing the dataset is probably the most time-consuming step for many deep learning tasks. Initially I tried with simple contrastive learning pairs, then triplets (https://lilianweng.github.io/posts/2021-05-31-contrastive/). They were the size of 6-8k, even with all the data augmentations I used, they either didn't work or experienced severe overfitting.

Triplets: [Anchor, Positive, Negative]. Positive is image that has similar style to the anchor image, while negative is image that has dissimilar style to the anchor image

Some research revealed that I need probably 10k+ triplets for effective learning. I definitely don't want to that many manual labeling.

However, we are using the Danbooru dataset. It has two nice properties we can take advantage of:

Images from the same artist are likely to share a similar style.
Images from different artists are more likely to be dissimilar in style.

Using these properties, I came up with a solution that dynamically constructs new triplets based on author-level annotations. Which massively reduced the work of manual annotations.

The training ground truth is obtained using this two-parts approach. Part 1:

Find artists with at least 5 images and at most 60 images.
Order them alphabetically.
One by one, inspect their portfolio. If the artist has a distinct and stable style, manually remove the images that are too dissimilar to the style of other images of this artist. Then record this artist and all their remaining works into a text file.

By doing so, I obtained a text file that looks like this:

artist1: 1.webp 2.webp 3.webp 4.webp 5.webp
artist2: 6.webp 7.webp 8.webp 9.webp 10.webp
...

Each line represents a group of images that I know would have similar styles to each other.

I skipped artists that doesn't have a stable and distinct style. I also skipped artists with too common style once I already recorded a dozen of them.

517 artists are recorded as my training set. (excluding the skipped artists)

Part 2 (using the data from Part 1):

Pick a random artist (a) that doesn't satisfy condition A. Then pick another random artist (b).
Compare their portfolios to decide whether their styles are "similar", "dissimilar" or "I cannot tell".
(a) remains unchanged, replace (b) with a new random artist.
Repeat step 2 and 3 until (a) satisfy condition A.
Repeat step 1 to 4 until all artists satisfy condition A.

Condition A: This artist has at least 4 other artists who are marked as "dissimilar" to them.

By doing so, we obtained a text file that looks like this:

artist1 vs artist2: 1
artist3 vs artist4: 0
artist5 vs artist6: NA
...

1 means similar, 0 means dissimilar, and NA means I cannot tell.

A total of 2406 similar/dissimilar pairs are collected. (most of which are dissimilar)

During the labelling, I tried to be as object as possible, which means I only label them as "similar" if I was very certain they have similar style, or "dissimilar" if I was very certain they were different.

It won't be "perfect", though. Since style is a complex thing, and everyone has different definition of it.

The two text files I created can be obtained here.

Anyway, with the above text files created, we can then dynamically generate triplets using this process:

Choose a random image as the anchor image
Find the artist this image belongs to
Choose random image(s) from artists that have similar styles, forming the positive image(s)
Choose random image(s) from artists that have dissimilar styles, forming the negative image(s)

Note an artist always have similar style to itself.

What about validation set? I manually annotated some triplets during my early approach. Which are used as the validation set now.

Preprocessing

Since the network would be capable of taking arbitrary sized input, there is no need for padding. I only used a rather simple image augmentation:

import torch.nn as nn
from torchvision.transforms import v2

def closest_interval(img, interval=8):
    c, h, w = img.shape
    new_h = h - (h % interval) if h % interval != 0 else h
    new_w = w - (w % interval) if w % interval != 0 else w
    h_start = (h - new_h) // 2
    w_start = (w - new_w) // 2
    new_h, new_w = max(new_h, interval), max(new_w, interval)
    return img[:, h_start:h_start + new_h, w_start:w_start + new_w]


class RandomSizeTransform(nn.Module):
    def __init__(self, smallest_ratio, size_range):
        super(RandomSizeTransform, self).__init__()
        self.smallest_ratio = smallest_ratio
        self.size_range = size_range

    def forward(self, img):
        c, h, w = img.shape
        ratio = random.uniform(self.smallest_ratio, 1)
        target_h, target_w = int(h * ratio), int(w * ratio)
        h_start, w_start = random.randint(0, h-target_h), random.randint(0, w-target_w)
        img = img[:, h_start:h_start+target_h, w_start:w_start+target_w]
        target_size = random.randint(*self.size_range)
        img = adj_size(img, size=target_size)
        return closest_interval(img)
        
transforms = v2.Compose([
    RandomSizeTransform(0.8, (1024, 1400)),
    v2.RandomHorizontalFlip(p=0.5),
    v2.RandomVerticalFlip(p=0.5),
    v2.ColorJitter(0.3, 0.3, 0.2, 0.2),
    v2.RandomGrayscale(p=0.2),
])

After the transformations, the images then get scaled to between -1 and +1: img = 2*(img/255)-1

Note that I use random grey-scaling here. I personally think a greyscaled version of an image has the same style as the original image. If you are training your own image embedding model, you might want to disable that.

Loss Function

My goal here is to have embeddings with similar styles close together, while embeddings with dissimilar styles far apart.

The obvious choice is the Triplet loss, which tries to make neg_dis be at least K greater than pos_dis: loss = max(K - (neg_dis - pos_dis), 0) pos_dis is the distance between the anchor sample and positive sample, while neg_dis is the distance between the anchor sample and negative sample.

K is the margin distance, it can be any number, I went with 3 here.

However, plain triplet loss is rather inefficient here. Because we don't do any padding and our model has variable input size, we can only do batch size of 1. So each training step the network could only see 3 images and 1 triplet. An obvious way to improve training efficiency to utilize more samples during each training step.

So I designed this:

class MultiSampleTripletLossAllPairs(nn.Module):
    def __init__(self, margin=1.0, p=2):
        """
        Considers all positive-negative pairs

        Args:
            margin: Margin for triplet loss
            p: The norm degree for pairwise distance
            reduction: 'mean', 'sum' or 'none'
        """
        super(MultiSampleTripletLossAllPairs, self).__init__()
        self.margin = margin
        self.p = p
       
    
    def calculate_average_norms(self, positive, negative):
        pos_norms = torch.linalg.vector_norm(positive, ord=self.p, dim=1)  # (1, N)
        neg_norms = torch.linalg.vector_norm(negative, ord=self.p, dim=1)  # (1, M)

        return pos_norms.mean(), neg_norms.mean()

    def forward(self, anchor, positive, negative):
        """
        Args:
            anchor: (1, D) - single anchor embedding
            positive: (N, D) - N positive samples
            negative: (M, D) - M negative samples
        Returns:
            loss value
        """
        avg_pos_norm, avg_neg_norm = self.calculate_average_norms(positive, negative)
        anchor, positive, negative = anchor.unsqueeze(0), positive.unsqueeze(0), negative.unsqueeze(0)
        # Compute positive distances (1, 1, N)
        pos_dist = torch.cdist(anchor, positive, p=self.p)
        # Compute negative distances (1, 1, M)
        neg_dist = torch.cdist(anchor, negative, p=self.p)

        # Compute all possible triplet combinations (N x M)
        pos_dist = pos_dist.permute(2, 1, 0).squeeze(1)  # (N, 1)
        neg_dist = neg_dist.squeeze(0)  # (1, M)

        # Compute loss for all combinations
        losses = F.relu(pos_dist - neg_dist + self.margin)

        # Apply reduction
        return losses.mean(), pos_dist.mean(), neg_dist.mean(), avg_pos_norm, avg_neg_norm

For N positive embeddings and M negative embeddings, it's effective N*M triplets. Smoothen the loss landscape and accelerates the training process.

To prevent embeddings being too faraway from each other, I also added the norm as a penalty term. loss += torch.linalg.vector_norm(anchor_embd, ord=2, dim=1).mean() * 0.001

Network Architecture

Initially, I tried an architecture similar to the VGG. Where images go through a sequence of convolutions and pooling layers before reduced to a specific size via AdaptiveAvgPool2d, then flatten and goes through MLP to obtain the final embedding.

It didn't work so well.

Later I found Gram Matrix. It uses matrix calculation to obtain the similarity between each channel, and these similarity results are then fed into the MLP. I replaced the adaptive pooling with Gram Matrix.

It has some nice properties:

Spatially invariant
Output has fixed shape
In theory, it is very capable of capturing the style of an image

class CompactGramMatrix(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        self.in_channels = in_channels
        # Precompute indices for lower triangle (including diagonal)
        self.register_buffer('tril_indices',
                             torch.tril_indices(in_channels, in_channels, offset=0, dtype=torch.int32))

    def forward(self, x):
        """
        Input: (B, C, H, W)
        Output: (B, C*(C+1)//2) compact Gram features
        """
        b, c, h, w = x.size()
        x = x.view(b, c, -1) / ((h * w) ** 0.5)  # Flatten spatial dimensions -> (B, C, H*W), then normalise

        # Compute full Gram matrix (still needed temporarily)
        gram = torch.bmm(x, x.transpose(1, 2))  # (B, C, C)

        # Extract lower triangle including diagonal
        compact_gram = gram[:, self.tril_indices[0], self.tril_indices[1]]  # (B, n_unique)
        return compact_gram

Note I only use the lower triangle, since the upper triangle of gram matrix contains the exact same information as the lower triangle.

And my model looks like this (check the Hugging Face model page for more detail!):

class EmbeddingNetwork(nn.Module):
    def __init__(self):
        super(EmbeddingNetwork, self).__init__()
        self.input_conv = nn.Conv2d(3, 16, 5, padding='same', padding_mode='reflect', bias=False)
        self.conv1 = ResBlock(16, 3, 2)
        self.pool1 = ConvPool(16, 32) # 2
        self.conv2 = ResBlock(32, 3, 2)
        self.pool2 = ConvPool(32, 64) # 4
        ...
        self.gram = CompactGramMatrix(256)
        self.compact = nn.Linear(256*(256+1)//2, 1024, bias=False)
        self.conpactnorm = nn.LayerNorm(1024, elementwise_affine=False)
        self.fc1 = nn.Linear(1024, 1024, bias=False)
        self.fc1norm = nn.LayerNorm(1024, elementwise_affine=False)
        self.act = nn.LeakyReLU(inplace=True)
        ...

    def forward(self, x):
        x = self.input_conv(x)
        x = self.pool1(self.conv1(x))
        x = self.pool2(self.conv2(x))
        ...
        x = self.gram(x)
        x = self.compact(x)
        x = self.conpactnorm(x)
        x = self.act(self.fc1norm(self.fc1(x)))
        ...

Training Hyperparameters

Training was done using PyTorch Lightning.

lr = 0.0001

weight_decay = 0.0001

AdEMAMix optimizer

ExponentialLR scheduler, with a gamma of 0.99, applied every epoch.

Batch size of 1. accumulate_grad_batches of 16.

With every anchor image, 16 positive images and 16 negative images are used.

Trained for 15 epoches. On 2 A100 GPUs. A total of 3434 optimizer updates.

Findings: Style distribution

I trained 5 models with the exact same configuration, but different output dimensions: 128, 32, 16, 8 and 6. (128 is trained first)

128 should be way too many dimensions needed to describe the style of an image. However, by running intrinsic dimension estimation using skdim package, we can get a sense of what the minimal output dimensions would be like.

import skdim

estimators = [skdim.id.TwoNN(), skdim.id.CorrInt(), skdim.id.DANCo()]
results = {}

for est in estimators:
    est.fit(predictions)
    results[type(est).__name__] = est.dimension_

print("Intrinsic Dimension Estimates:")
for name, dim in results.items():
    print(f"{name}: {dim:.2f}")

5000 random images from the entire dataset are used during each evaluation. The dimension numbers I have there are stable between different runs, between +-0.02.

Results:

Output Dim	TwoNN method	CorrInt method	DANCo method
128	6.00	5.16	7.98
32	7.29	6.19	9.03
16	5.23	4.49	6.60
8	4.86	4.51	6.00
6	4.87	4.44	6.00

DANCo method is the most accurate method along this three. We can see that the intrinsic dimension should be somewhere between 6 and 8.

Personally, I found that using 6 dim gives better explainability.

What each of the 6 dimensions represents?

Let's have a look at images at the central of embedding space to have some idea:

Dimension 1:

Images that have very high components in Dim 1: Images that have very low components in Dim 1: No obvious semantic meaning I can found. Not showing here.

Dimension 1 seems related to sharp contrast.

Dimension2:

Images that have very high components in Dim 2:

mages that have very low components in Dim 2:

Dimension2 seems related to the complexity of drawing.

Dimension3:

Images that have very high components in Dim 3: No obvious semantic meaning I can found. Not showing here.

Images that have very low components in Dim 2:

Dimension3 seems to related to smooth, oil-ish painting.

Dimension4:

Images that have very high components in Dim 4:

Images that have very low components in Dim 4: No obvious semantic meaning I can found. Not showing here.

Dimension4 seems to do with thick line or blocky colors?

Dimension5:

Images that have very high components in Dim 5: No obvious semantic meaning I can found. Not showing here.

Images that have very low components in Dim 5:

Dimension5 seems to do with complex scene of landscape or many characters.

Dimension6:

Images that have very high components in Dim 6:

Images that have very low components in Dim 6:

Dimension6 Seems something to do with comics and how shading is applied.

Do they form any clusters?

Originally, I expected that after using TSNE to cluster the embeddings, I would see distinct clusters representing common styles.

However, it doesn't quite seem so. With random images, the styles don't really form significant clusters. It's pretty much uniformly distributed in the embedding space.

Then I did some clustering again using AgglomerativeClustering, setting the distance_threshold to 32. A total of 116 clusters are formed from the 5000 samples. Here are some images from the same cluster: (randomly selected, no cherry-picking)

Well... I would call that a success!

Try it yourselves!

If you want to try this model, you can find it here.

minimal_script.py provides the minimal codes for running an image through the network and obtain an output. While gallery_review.py contains the code I used to generate those visualisations and clustering.

The training data can be found here.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote