While generative models have shown promising results in synthesizing highly realistic novel views in static scenes by training on large-scale multi-view image datasets (3D data), applying this approach to dynamic scenes is challenging due to the limited availability of extensive real-world multi-view videos (4D data).
To sidestep this issue, Generative Camera Dolly (GCD) utilizes synthetic 4D data to train the generatrive model, which we find to be suboptimal for real-world applications due to domain gaps.
Instead of taking the data-driven solution, we decompose the task into two sub-tasks: (1) temporally-consistent geometry estimation and (2) generative video rendering based on the estimated geometry.
Incorporating this geometric prior reduces the burden on the generative model, enabling it to focus primarily on enhancing uncertain regions instead of learning full 4D dynamics from scratch, thereby greatly reducing the need for large-scale 4D training data.
To edit camera trajectories in monocular videos, we embed knowledge from video geometry prediction models, e.g., MonST3R, into video generative models, allowing the model to synthesize realistic novel views by filling occluded regions the geometry model cannot infer. By incorporating geometrical cues for generation, our approach demonstrates superior performance on novel view video synthesis, compared to fully generative approaches e.g, Generative Camera Dolly.