Diffusers documentation
ZImageTransformer2DModel
ZImageTransformer2DModel
A Transformer model for image-like data from Z-Image.
ZImageTransformer2DModel
class diffusers.ZImageTransformer2DModel
< source >( all_patch_size = (2,) all_f_patch_size = (1,) in_channels = 16 dim = 3840 n_layers = 30 n_refiner_layers = 2 n_heads = 30 n_kv_heads = 30 norm_eps = 1e-05 qk_norm = True cap_feat_dim = 2560 siglip_feat_dim = None rope_theta = 256.0 t_scale = 1000.0 axes_dims = [32, 48, 48] axes_lens = [1024, 512, 512] )
forward
< source >( x: list t cap_feats: list return_dict: bool = True controlnet_block_samples: dict[int, torch.Tensor] | None = None siglip_feats: list[list[torch.Tensor]] | None = None image_noise_mask: list[list[int]] | None = None patch_size: int = 2 f_patch_size: int = 1 )
Parameters
- x (
listoftorch.Tensoror nestedlistoftorch.Tensor) — Input latents. A flat list when running in standard mode, or a nested list when running in omni mode. - t (
torch.Tensor) — Used to indicate denoising step. - cap_feats (
listoftorch.Tensoror nestedlistoftorch.Tensor) — Conditional caption embeddings (embeddings computed from the input conditions such as prompts) to use. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~models.transformer_2d.Transformer2DModelOutputinstead of a plain tuple. - controlnet_block_samples (
dictofinttotorch.Tensor, optional) — A mapping from block index to tensor that if specified are added to the residuals of transformer blocks. - siglip_feats (
listoflistoftorch.Tensor, optional) — Optional SigLIP image features used as additional conditioning. - image_noise_mask (
listoflistofint, optional) — Per-image noise masks indicating noisy vs. clean tokens in omni mode. - patch_size (
int, optional, defaults to 2) — Spatial patch size used to patchify the input latents. - f_patch_size (
int, optional, defaults to 1) — Temporal patch size used to patchify the input latents.
The ZImageTransformer2DModel forward method.
Flow: patchify -> t_embed -> x_embed -> x_refine -> cap_embed -> cap_refine -> [siglip_embed -> siglip_refine] -> build_unified -> main_layers -> final_layer -> unpatchify
patchify_and_embed
< source >( all_image: list all_cap_feats: list patch_size: int f_patch_size: int )
Patchify for basic mode: single image per batch item.
patchify_and_embed_omni
< source >( all_x: list all_cap_feats: list all_siglip_feats: list patch_size: int f_patch_size: int images_noise_mask: list )
Patchify for omni mode: multiple images per batch item with noise masks.