arxiv:2506.18839

4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

Published on Jun 18

· Submitted by

Authors:

Abstract

A new framework combines 4D video modeling and 3D reconstruction using a unified architecture with sparse attention patterns, achieving superior visual quality and reconstruction.

AI-generated summary

We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.

View arXiv page View PDF Add to collection

Community

ashmrz

Paper submitter about 18 hours ago

4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.18839 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.18839 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.18839 in a Space README.md to link it from this page.

4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1