VMem: Consistent Video Scene Generation with Surfel-Indexed View Memory

📖 Project Page ｜ 🖥️ GitHub | 🤗 Hugging Face | 📑 Paper
## Model Details

VMem is a plug-and-play memory mechanism for consistent autoregressive scene / novel-view video generation.
Key idea: anchor past frames to surface elements (surfels) in a global memory. At every step the target camera pose queries this memory for the most relevant past views, which are combined (via Plücker-line embeddings + a VAE) with noise to synthesize the next frame. The generated frame is then written back, yielding long-term geometric consistency while keeping compute low.


Developed by	Runjia Li, Philip Torr, Andrea Vedaldi, Tomas Jakab
Affiliation	University of Oxford
First released	arXiv pre-print, 2025
Model type	Generative CV (diffusion / autoregressive latent generator with external surfel memory)
Modality	Images → video (RGB); camera pose conditioning
License	Apache-2.0

Direct Use

Novel-view video generation from a single image or a short camera sweep.
Scene roaming in AR/VR: move a virtual camera through a captured room while preserving layout and textures.
Video editing / completion where long-term geometry must stay stable.

liguang0115
/

vmem

You need to agree to share your contact information to access this model

VMem: Consistent Video Scene Generation with Surfel-Indexed View Memory

Direct Use

Space using liguang0115/vmem 1