TL;DR: Create novel videos from any desired trajectory of real or animation videos.



How it Works

Given an input video, we first estimate the camera trajectory and the 3D geomtery of the scene with video geomtery estimators. By projecting the estimated 3D geomtery onto the desired camera trajectory, we obtain the per-frame 2D flow between the input video and the novel video. Instead of directly utilizing the rendered frames from the estimated 3D geometry, we ground the video diffusion model on this estimated geometry by re-aligning the video feature tokens according to the 2D flow.
How it works illustration

In the video diffusion model, which consists of spatial and temporal blocks, we enhance the spatial block by integrating tokens provided by the video encoder based on the estimated 2D flow, thereby transforming it into a multi-view block. The tokens encoded by the video encoder are concatenated as key and value inputs within the self-attention layers of the multi-view block in the video diffusion model.

Why incorporate geometry estimation?

While generative models have shown promising results in synthesizing highly realistic novel views in static scenes by training on large-scale multi-view image datasets (3D data), applying this approach to dynamic scenes is challenging due to the limited availability of extensive real-world multi-view videos (4D data). To sidestep this issue, Generative Camera Dolly (GCD) utilizes synthetic 4D data to train the generatrive model, which we find to be suboptimal for real-world applications due to domain gaps. Instead of taking the data-driven solution, we decompose the task into two sub-tasks: (1) temporally-consistent geometry estimation and (2) generative video rendering based on the estimated geometry. Incorporating this geometric prior reduces the burden on the generative model, enabling it to focus primarily on enhancing uncertain regions instead of learning full 4D dynamics from scratch, thereby greatly reducing the need for large-scale 4D training data.
How it works illustration

To edit camera trajectories in monocular videos, we embed knowledge from video geometry prediction models, e.g., MonST3R, into video generative models, allowing the model to synthesize realistic novel views by filling occluded regions the geometry model cannot infer. By incorporating geometrical cues for generation, our approach demonstrates superior performance on novel view video synthesis, compared to fully generative approaches e.g, Generative Camera Dolly.

Spatial-Temporal Factorized Fine-tuning

To further reduce the need for 4D data, we incorporate a factorized fine-tuning strategy. By considering the spatio-temporal blocks of our video generative model independently, we train the spatial block with multi-view image (3D) data and train the temporal block with video data. As both 3D and video data are accessible up to scale, the training of generative models no longer requires 4D data.
How it works illustration

Qualtitative Comparisons

Quantitative Results

We show quantitative comparisons in multi-view dynamic datasets, Neu3D and ST-NeRF. Ours is best in both LPIPS and Frame-Consistency (Frame-Con.), i.e., CLIP score between each frame of the input video and the generated video. Note that datasets containing videos from a stationary camera are used, as our primary baseline, Generative Camera Dolly, does not officially accept input videos from moving cameras, e.g., DyCheck.
How it works illustration

Application: Generate Additional Data for 4D Reconstruction

As our method can generate novel views from a single video, it can be used to generate additional data for 4D reconstruction. Here, we show an example of leveraging additional data generated by our method during the training of a 4D reconstruction model Shape-of-Motion in the DyCheck dataset. As shown below, incorporating additional data can improve the overall reconstructioon quality of the 4D reconstruction model.
How it works illustration

BibTeX