arxiv:2507.23268

PixNerd: Pixel Neural Field Diffusion

Published on Jul 31

· Submitted by

wangsssssss on Aug 4

#3 Paper of the day

Upvote

Authors:

Abstract

Pixel Neural Field Diffusion (PixNerd) achieves high-quality image generation in a single-scale, single-stage process without VAEs or complex pipelines, and extends to text-to-image applications with competitive performance.

AI-generated summary

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

View arXiv page View PDF Project page GitHub 46 Add to collection

Community

wangsssssss

Paper submitter 2 days ago

A new speedy pixel diffusion transformer with neural field!!

TL;DR: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256×256 and 2.84 FID on ImageNet 512×512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

wangsssssss

Paper submitter 2 days ago

Online Space for text-to-image: https://huggingface.co/spaces/MCG-NJU/PixNerd

wangsssssss

Paper submitter 2 days ago

Revision of the inference time statistics

Model	Inference			Training
	1 image	1 step	Mem (GB)	Speed (s/it)	Mem (GB)
SiT-L/2(VAE-f8)	0.51s	0.0097s	2.9	0.30	18.4
Baseline-L/16	0.48s	0.0097s	2.1	0.18	18
PixNerd-L/16	0.51s	0.010s	2.1	0.19	22

Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.