arxiv:2508.15769

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Published on Aug 21

· Submitted by

haoningwu on Aug 22

Upvote

Authors:

Haoning Wu ,

Weidi Xie

Abstract

SceneGen generates multiple 3D assets from a single scene image using a novel framework that integrates local and global scene information, enabling efficient and robust 3D content creation.

AI-generated summary

3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

View arXiv page View PDF Project page GitHub 37 Add to collection

Community

haoningwu

Paper author Paper submitter 2 days ago

Project Page: https://mengmouxu.github.io/SceneGen/
Paper: https://arxiv.org/abs/2508.15769
Code: https://github.com/Mengmouxu/SceneGen

To summarize, we make the following contributions in this paper:
(i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval;
(ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass;
(iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs;
(iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach.
We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks.

We are organizing our code, data, and checkpoints, and will gradually open-source them in the near future. Please stay tuned!!! Feel free to reach out for discussions!