Abstract
Pixel Neural Field Diffusion (PixNerd) achieves high-quality image generation in a single-scale, single-stage process without VAEs or complex pipelines, and extends to text-to-image applications with competitive performance.
The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.
Community
A new speedy pixel diffusion transformer with neural field!!
TL;DR: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256×256 and 2.84 FID on ImageNet 512×512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.
Revision of the inference time statistics
Model | Inference | Training | |||
---|---|---|---|---|---|
1 image | 1 step | Mem (GB) | Speed (s/it) | Mem (GB) | |
SiT-L/2(VAE-f8) | 0.51s | 0.0097s | 2.9 | 0.30 | 18.4 |
Baseline-L/16 | 0.48s | 0.0097s | 2.1 | 0.18 | 18 |
PixNerd-L/16 | 0.51s | 0.010s | 2.1 | 0.19 | 22 |
Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.
Since the arxiv paper has been updated, so i closed this issue! feel free to open!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think (2025)
- DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning (2025)
- Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation (2025)
- Taming Diffusion Transformer for Real-Time Mobile Video Generation (2025)
- Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation (2025)
- Pyramidal Patchification Flow for Visual Generation (2025)
- DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper