Diffusers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.38.0).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

ZImageTransformer2DModel

A Transformer model for image-like data from Z-Image.

ZImageTransformer2DModel

class diffusers.ZImageTransformer2DModel

< source >

( all_patch_size = (2,) all_f_patch_size = (1,) in_channels = 16 dim = 3840 n_layers = 30 n_refiner_layers = 2 n_heads = 30 n_kv_heads = 30 norm_eps = 1e-05 qk_norm = True cap_feat_dim = 2560 siglip_feat_dim = None rope_theta = 256.0 t_scale = 1000.0 axes_dims = [32, 48, 48] axes_lens = [1024, 512, 512] )

forward

< source >

( x: list t cap_feats: list return_dict: bool = True controlnet_block_samples: dict[int, torch.Tensor] | None = None siglip_feats: list[list[torch.Tensor]] | None = None image_noise_mask: list[list[int]] | None = None patch_size: int = 2 f_patch_size: int = 1 )

Parameters

x (list of torch.Tensor or nested list of torch.Tensor) — Input latents. A flat list when running in standard mode, or a nested list when running in omni mode.
t (torch.Tensor) — Used to indicate denoising step.
cap_feats (list of torch.Tensor or nested list of torch.Tensor) — Conditional caption embeddings (embeddings computed from the input conditions such as prompts) to use.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~models.transformer_2d.Transformer2DModelOutput instead of a plain tuple.
controlnet_block_samples (dict of int to torch.Tensor, optional) — A mapping from block index to tensor that if specified are added to the residuals of transformer blocks.
siglip_feats (list of list of torch.Tensor, optional) — Optional SigLIP image features used as additional conditioning.
image_noise_mask (list of list of int, optional) — Per-image noise masks indicating noisy vs. clean tokens in omni mode.
patch_size (int, optional, defaults to 2) — Spatial patch size used to patchify the input latents.
f_patch_size (int, optional, defaults to 1) — Temporal patch size used to patchify the input latents.

The ZImageTransformer2DModel forward method.

Flow: patchify -> t_embed -> x_embed -> x_refine -> cap_embed -> cap_refine -> [siglip_embed -> siglip_refine] -> build_unified -> main_layers -> final_layer -> unpatchify

patchify_and_embed

< source >

( all_image: list all_cap_feats: list patch_size: int f_patch_size: int )

Patchify for basic mode: single image per batch item.

patchify_and_embed_omni

< source >

( all_x: list all_cap_feats: list all_siglip_feats: list patch_size: int f_patch_size: int images_noise_mask: list )

Patchify for omni mode: multiple images per batch item with noise masks.

Update on GitHub

←WanTransformer3DModel StableCascadeUNet→