Vid-CamEdit: Video Camera Trajectory Editing with
Generative Rendering from Estimated Geometry

Junyoung Seo*¹ Jisang Han*¹ Jaewoo Jung*¹ Siyoon Jin¹ Joungbin Lee¹
Takuya Narihira² Kazumi Fukuda² Takashi Shibuya² Donghoon Ahn¹ Shoukang Hu²
Seungryong Kim^†,1 Yuki Mitsufuji^†,2,3

¹KAIST AI ²Sony AI ³Sony Group Corporation

arXiv Code and Model

TL;DR: Create novel videos from any desired trajectory of real or animation videos.

How it Works

Given an input video, we first estimate the camera trajectory and the 3D geomtery of the scene with video geomtery estimators. By projecting the estimated 3D geomtery onto the desired camera trajectory, we obtain the per-frame 2D flow between the input video and the novel video. Instead of directly utilizing the rendered frames from the estimated 3D geometry, we ground the video diffusion model on this estimated geometry by re-aligning the video feature tokens according to the 2D flow.

In the video diffusion model, which consists of spatial and temporal blocks, we enhance the spatial block by integrating tokens provided by the video encoder based on the estimated 2D flow, thereby transforming it into a multi-view block. The tokens encoded by the video encoder are concatenated as key and value inputs within the self-attention layers of the multi-view block in the video diffusion model.

Why incorporate geometry estimation?

While generative models have shown promising results in synthesizing highly realistic novel views in static scenes by training on large-scale multi-view image datasets (3D data), applying this approach to dynamic scenes is challenging due to the limited availability of extensive real-world multi-view videos (4D data). To sidestep this issue, Generative Camera Dolly (GCD) utilizes synthetic 4D data to train the generatrive model, which we find to be suboptimal for real-world applications due to domain gaps. Instead of taking the data-driven solution, we decompose the task into two sub-tasks: (1) temporally-consistent geometry estimation and (2) generative video rendering based on the estimated geometry. Incorporating this geometric prior reduces the burden on the generative model, enabling it to focus primarily on enhancing uncertain regions instead of learning full 4D dynamics from scratch, thereby greatly reducing the need for large-scale 4D training data.

To edit camera trajectories in monocular videos, we embed knowledge from video geometry prediction models, e.g., MonST3R, into video generative models, allowing the model to synthesize realistic novel views by filling occluded regions the geometry model cannot infer. By incorporating geometrical cues for generation, our approach demonstrates superior performance on novel view video synthesis, compared to fully generative approaches e.g, Generative Camera Dolly.

Spatial-Temporal Factorized Fine-tuning

To further reduce the need for 4D data, we incorporate a factorized fine-tuning strategy. By considering the spatio-temporal blocks of our video generative model independently, we train the spatial block with multi-view image (3D) data and train the temporal block with video data. As both 3D and video data are accessible up to scale, the training of generative models no longer requires 4D data.

Qualtitative Comparisons

Input Video - Generate Novel Video Seen from Right 30°

Generative Camera Dolly (GCD)

Reprojection + Video Inpainting

Ours

Input Video - Generate Novel Video Seen from Top 10°

Generative Camera Dolly (GCD)

Reprojection + Video Inpainting

Ours

Input Video - Generate Novel Video Seen from Left 30°

Generative Camera Dolly (GCD)

Reprojection + Video Inpainting

Ours

Quantitative Results

We show quantitative comparisons in multi-view dynamic datasets, Neu3D and ST-NeRF. Ours is best in both LPIPS and Frame-Consistency (Frame-Con.), i.e., CLIP score between each frame of the input video and the generated video. Note that datasets containing videos from a stationary camera are used, as our primary baseline, Generative Camera Dolly, does not officially accept input videos from moving cameras, e.g., DyCheck.

Application: Generate Additional Data for 4D Reconstruction

As our method can generate novel views from a single video, it can be used to generate additional data for 4D reconstruction. Here, we show an example of leveraging additional data generated by our method during the training of a 4D reconstruction model Shape-of-Motion in the DyCheck dataset. As shown below, incorporating additional data can improve the overall reconstructioon quality of the 4D reconstruction model.

Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry

TL;DR: Create novel videos from any desired trajectory of real or animation videos.

How it Works

Why incorporate geometry estimation?

Spatial-Temporal Factorized Fine-tuning

Qualtitative Comparisons

Quantitative Results

Application: Generate Additional Data for 4D Reconstruction

BibTeX

Vid-CamEdit: Video Camera Trajectory Editing with
Generative Rendering from Estimated Geometry