YAML Metadata Warning: The pipeline tag "image-to-4d" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

This repository contains the model and code for the paper Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis.

This work presents a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. It introduces a Direct 4DMesh-to-GS Variation Field VAE to encode canonical Gaussian Splats (GS) and their temporal variations into a compact latent space. Building on this, a Gaussian Variation Field diffusion model is trained with a temporal-aware Diffusion Transformer, conditioned on input videos and canonical GS. The model demonstrates superior generation quality and remarkable generalization to in-the-wild video inputs.

Project Page: https://gvfdiffusion.github.io/ Code: https://github.com/ForeverFancy/GVFDiffusion

Abstract

We present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content.

Installation and Quick Start

For detailed installation instructions and how to run a minimal inference example, please refer to the GitHub repository.

# Clone the repository
git clone https://github.com/ForeverFancy/GVFDiffusion.git
cd GVFDiffusion

# Setup environment and dependencies
. ./setup.sh --new-env --basic --xformers --flash-attn --diffoctreerast --spconv --mipgaussian --kaolin --nvdiffrast

# Run a minimal inference example
accelerate launch --num_processes 1 inference_dpm_latent.py --batch_size 1 --exp_name /path/to/your/output --config configs/diffusion.yml --start_idx 0 --end_idx 2 --txt_file ./assets/in_the_wild.txt --use_fp16 --num_samples 2 --adaptive --data_dir ./assets/ --num_timesteps 32 --download_assets --in_the_wild

Citation

If you find the work useful, please consider citing:

@article{zhang2025gaussian,
  title={Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis},
  author={Zhang, Bowen and Xu, Sicheng and Wang, Chuxin and Yang, Jiaolong and Zhao, Feng and Chen, Dong and Guo, Baining},
  journal={arXiv preprint arXiv:2507.23785},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support