L4GM: Large 4D Gaussian Reconstruction Model
Paper | Project Page | Code
We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second.
Model Overview
Description:
L4GM is a 4D reconstruction model that reconstructs a sequence of 3D Gaussians from a monocular input video within seconds. An additional interpolation model helps to interpolate the 3D Gaussian sequence and thus enhance the framerate.
This model is for research and development only.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA ashawkey/LGM.
License/Terms of Use:
CC-BY-NC-SA-4.0
References:
Model Architecture:
The L4GM model is a modified version of the origin LGM model.
The difference is that L4GM has additional temporal attention modules following every cross-view attention module in the asymmetric U-Net.
The difference makes L4GM able to aggregate the temporal information across frames.
The model is initialized from the pretrained LGM weight, and trained on a multi-view dynamic object dataset.
The interpolation model shares the same model architecture with L4GM but is trained with an interpolation objective.
Input:
Input Type(s): Video
Input Format(s): RGB sequence
Input Parameters: 4D
Other Properties Related to Input: Input resolution is 256x256. Video length in one forward pass is usually 16 frames.
Output:
Output Type(s): 3D GS sequence
Output Format: frame legnth x Gaussians-per-frame
Output Parameters: 3D
Other Properties Related to Output: 65,536 3D Gaussians per frame.
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
Preferred Operating System(s):
- Linux
Model Version(s):
v1.0
Training, Testing, and Evaluation Datasets:
Training Dataset:
Link: Objaverse
Data Collection Method by dataset: Unkown
Labeling Method by dataset: Unknown
Properties: We use 110K object animation data, which is a subset from the Objaverse. We filter the Objaverse dataset by the motion magnitude. We render the animations from 48 cameras that produces 12M videos in total.
Dataset License(s): Objaverse License
Inference:
Engine: PyTorch
Test Hardware: A100
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Citation
@inproceedings{ren2024l4gm,
title={L4GM: Large 4D Gaussian Reconstruction Model},
author={Jiawei Ren and Kevin Xie and Ashkan Mirzaei and Hanxue Liang and Xiaohui Zeng and Karsten Kreis and Ziwei Liu and Antonio Torralba and Sanja Fidler and Seung Wook Kim and Huan Ling},
booktitle={Proceedings of Neural Information Processing Systems(NeurIPS)},
month = {Dec},
year={2024}
}