HDM-xut-340M-Anime

World's smallest, cheapest anime-style T2I base

Source Code - HDM Source Code - HDM(ComfyUI) Model - HDM (Here) Document - Tech Report

Introduction

HDM(Home made Diffusion Model) is a project to investigate specialized training recipe/scheme for "pretraining T2I model at home" which require the training setup should be exectuable on customer level hardware or cheap enough second handed server hardware.

Under this constraint, we introduce a new transformer backbone designed for multi-modal (for example, text-to-image) generative model called "XUT" (Cross-U-Transformer). With minimalized arch design with TREAD technique, we can achieve usable performan with computes that cost 650USD at most. (based on pricing from vast.ai)

Gallery

Quick Start

Read [Usage Hint] section after install the model

READ IT READ IT READ IT

  • Use official gradio ui: https://github.com/KohakuBlueleaf/HDM
    • Follow the readme in this repository
  • Use comfyui loader node: https://github.com/KohakuBlueleaf/HDM-ext
    • Install this repository as comfyui custom node
    • Use the HDM-loader node with provided safetensors file
  • Use diffusers pipeline:
    1. Install hdm library from github repo
      git clone https://github.com/KohakuBlueleaf/HDM
      cd HDM
      # fused: xformers fused swiglu/attention
      # liger: liger-kernel fused swiglu
      # tipo : tipo prompt gen system for official UI
      # win  : windows specific (all windows user should add this)
      # finetune: install lycoris-lora for finetune with lycoris
      pip install -e .[tipo]
      ## For windows user: pip install -e .[win,tipo]
      
    2. Than use pipeline from hdm
      import torch
      from hdm.pipeline import HDMXUTPipeline
      
      torch.set_float32_matmul_precision("high")
      pipeline = (
          HDMXUTPipeline.from_pretrained(
              "KBlueLeaf/HDM-xut-340M-anime",
              trust_remote_code=True
          )
          .to("cuda")
          .to(torch.float16)
      )
      images = pipeline("1girl....", "").images
      

Usage Hint

  • Prompting:
    • Use Danbooru tag + Natural Language Description
    • For "tags", use "danbooru tag" ONLY.
    • For tags, use comma seperated tag sequence, "without underscore"
      • "1girl long_hair" -> "1girl, long hair"
    • Make prompt as detail as possible
      • TIPO is highly recommended to use here!!!
  • Sampling settings:
    • CFG scale: 2~4
    • Inference steps: at least 8, works best in 16~32
  • For 1024px model, recommend following resolution (WxH):
    • 1024xK where K >= 1024
    • WxH where W > H, W*H >= 1024*1024, H <= 1024
    • For example: 1024x1536(1:2), 1440x800(16:9), 1472x736(2:1), 1280x960(4:3)
  • Don't know where to start? New to danbooru-tag style prompting?
    • Check danbooru.donmai.us, find any image you feel good, copy its tag!
    • Remember to apply the transformation mentioned above.

Example prompt format

1girl, 
izuna (blue archive), blue archive, 

fukamiki kei, tail, halo, skin fang, solo, eyeshadow, medium breasts, short hair, skirt, makeup, hair ornament, pink halo, yellow eyes, fox tail, scarf, pom pom hair ornament, fox girl, river, pleated skirt, leaning forward, animal ears, fang, black hair, black skirt, one side up, nature, pink scarf, sleeveless, miniskirt, open mouth, looking at viewer, fox hair ornament, breasts, animal ear fluff, fox ears, outdoors, wading, pink eyeshadow, forest, pom pom (clothes), smile, waterfall, sunlight, white shirt, sleeveless shirt, sailor collar, shirt, tail raised, partially submerged, blush, water, cowboy shot, blue sailor collar, :d,

This image is a digital illustration by the artist masabodo, known for their detailed and vibrant style. The central character in the artwork is Izuna from the series "Blue Archive," depicted with fox-like features including pointed ears and a fluffy tail. Izuna is portrayed with her characteristic black hair adorned with pink ribbons, wearing a sailor-style outfit that includes a blue collar and skirt. The overall scene exudes a sense of tranquility and grace, capturing a moment of serene beauty in nature.

masterpiece, newest, absurdres

Or with placeholder:

<|special|>, 
<|characters|>, <|series|>, 
<|artist|>, 

<|general content tags|>,

<|natural language|>.

<|quality|>, <|rating|>, <|meta|>

NOTE: In ComfyUI, you SHOULD use backslash for ALL the bracket. for example izuna (blue archive) -> izuna \(blue archive\). To ensure TIPO to works correctly and the bracket will be sent to the model.

Model Spec

HDM-XUT overview

XUT arch details

Model Setup

  • 343M XUT diffusion
  • 596M Qwen3 Text Encoder (qwen3-0.6B)
  • EQ-SDXL-VAE
  • Support 1024x1024 or higher resolution
    • 512px/768px checkpoints provided
  • Sampling method/Training Objective: Flow Matching
  • Inference Steps: 16~32
  • Hardware Recommendations: any Nvidia GPU with tensor core and >=6GB vram
  • Minimal Requirements: x86-64 computer with more than 16GB ram
    • 512 and 768px can achieve reasonable speed on CPU

Dataset

  • Danbooru2023 (around 7.6M images, latest id around 8.2M)
  • Curated Pixiv set (around 700k images)
  • Private pvc figure photos. (less than 50k images)

Training

Stage 256² 512² 768² 1024²
Dataset Danbooru 2023 Danbooru2023 + extra* - curated**
Image Count 7.624M 8.469M - 3.649M
Epochs 20 5 1 1
Samples Seen 152.5M 42.34M 8.469M 3.649M
Patches Seen 39B 43.5B 19.5B 15B
Learning Rate (muP, base_dim=1) 0.5 0.1 0.05 0.02
Batch Size (per GPU) 128 64 64 16
Gradient Checkpointing No Yes Yes Yes
Gradient Accumulation 4 2 2 4
Global Batch Size 2048 512 512 256
TREAD Selection Rate 0.5 0.5 0.5 0.0
Context Length 256 256 256 512
Training Wall Time 174h 120h 42h 49h

Limitations

  • Due to constrained training budget and model size:
    • This model require user to provide highly detailed and well structured prompt to works well.
      • Users are highly recommended to use TIPO
    • This model are not good at generate good detail structure or not well-annotated concept (such as hands or special ornaments)
  • This model can only generate anime-style images for now due to dataset choice.
    • A more general T2I model is under training, stay tuned.
  • This model are capable to generate lot of different anime-like style but it's hard to trigger specific art style stably. It is recommend to try different combination of content, meta, artist tag to achieve different art style.

License

This project is still under developement, therefore all the models, source code, text, documents or any media in this project are licensed under CC-BY-NC-SA 4.0 until the finish of development.

For any usage that may require any kind of standalone, specialized license. Please directly contact [email protected]

Downloads last month
282
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train KBlueLeaf/HDM-xut-340M-anime