TL;DR: 3DScenePrompt is a framework to generate a next chunk video from any arbitrary-length in-the-wild input video while allowing precise camera control and maintaining scene-consistency with the input video.

Teaser

Overall pipeline for generating the next video chunk that follows a user-specified camera trajectory while maintaining scene consistency. Our dual spatio-temporal conditioning combines the last few frames for temporal continuity and the rendered point cloud for spatial consistency.

Overview

Given a static 3D point cloud reconstructed from an input video and an action prompt, our model generates a dynamic video aligned with a user-specified camera trajectory. The framework jointly leverages temporal conditioning from the last few frames to ensure motion continuity, and spatial conditioning from rendered views of the static point cloud to preserve scene geometry.

We present 3DScenePrompt, a framework for scene-consistent, camera-controllable video generation that extends arbitrary-length input videos along user-specified camera trajectories while preserving scene geometry. The key idea is a dual spatio-temporal conditioning strategy that conditions on both temporal cues (last few frames for motion continuity) and spatial cues (geometry-aware views for scene consistency).

Directly reusing spatially adjacent frames can leak past dynamics into the future. To avoid this, we build a 3D scene memory that stores only static geometry, reconstructed via dynamic SLAM and refined with a dynamic masking pipeline that removes moving objects.

From this memory, we render projected static views that act as 3D scene prompts, giving geometrically accurate spatial guidance while temporal conditioning drives natural motion. This enables precise camera control, long-range spatial coherence, and efficient computation.

Analysis

Importance of Eliminating Dynamic Regions

Illustration of dynamic masking for static scene extraction.
(a) Without masking, moving objects create ghosting artifacts across frames.
(b) With our dynamic masking pipeline, dynamic elements are identified and removed, yielding clean static-only point clouds.

Dynamic masking plays a crucial role in separating motion from structure in scene-consistent video generation. By removing transient moving objects before reconstructing the 3D scene, it prevents ghosting artifacts and ensures that only persistent static geometry is preserved. This clean static representation enables accurate warping across viewpoints and maintains spatial coherence throughout the generated video.

Dynamic Region Detection Pipeline

Three-stage dynamic masking pipeline.
Motion detection → Tracking → Propagation.

Our Dynamic Masking Pipeline refines motion region detection through a three-stage process to produce complete object-level masks:

(1) Dynamic Thresholding — Optical flow differences detect pixel-level motion.
(2) Backward Tracking — Using CoTracker3, sampled points are tracked backward across all frames to aggregate motion evidence and identify objects that move at any point.
(3) Mask Propagation — Aggregated motion cues in the first frame are propagated to the entire video via SAM2, generating clean masks that remove moving elements while retaining the static background for precise 3D reconstruction.

Results

Evaluation of Spatial and Geometric Consistency

Methods	RealEstate10K				DynPose-100K
Methods	PSNR ↑	SSIM ↑	LPIPS ↓	MEt3R ↓	PSNR ↑	SSIM ↑	LPIPS ↓	MEt3R ↓
DFoT	18.30	0.596	0.308	0.181	12.15	0.304	0.417	0.183
3DScenePrompt (Ours)	20.89	0.717	0.212	0.0408	13.05	0.367	0.381	0.124

Evaluation of spatial and geometric consistency.
We compare DFoT and 3DScenePrompt on RealEstate10K and DynPose-100K. PSNR, SSIM, and LPIPS evaluate spatial consistency, while MEt3R measures geometric accuracy.

Camera Controllability Evaluation

Methods	DynPose-100K
Methods	mRotErr (°) ↓	mTransErr ↓	mCamMC ↓
MotionCtrl	3.5654	7.8231	9.7834
CameraCtrl	3.3273	9.5989	11.2122
FloVD	3.4811	11.0302	12.6202
AC3D	3.0675	9.7044	11.1634
DFoT	2.3977	8.0866	9.2330
3DScenePrompt (Ours)	2.3772	7.4174	8.6352

Evaluation of camera controllability.
Lower mRotErr, mTransErr, and mCamMC indicate more accurate control over camera pose transitions.

Video Generation Quality Evaluation

Methods	DynPose-100K
Methods	FVD ↓	Overall Score ↑	Subject Consist ↑	Bg Consist ↑	Aesthetic Quality ↑	Imaging Quality ↑	Temporal Flicker ↑	Motion Smooth ↑	Dynamic Degree ↑
MotionCtrl	1017.42	0.5625	0.5158	0.7093	0.3157	0.3149	0.8297	0.8432	0.7900
CameraCtrl	737.05	0.6280	0.6775	0.8238	0.3736	0.3888	0.6837	0.6955	0.9900
FloVD	171.27	0.7273	0.7964	0.8457	0.4722	0.5546	0.7842	0.8364	0.9900
AC3D	281.21	0.7428	0.8360	0.8674	0.4766	0.5381	0.8020	0.8673	1.0000
3DScenePrompt (Ours)	127.48	0.7747	0.8669	0.8727	0.4990	0.5964	0.8551	0.9260	1.0000

Evaluation of video generation quality.
FVD evaluates overall temporal quality (lower is better). VBench++ metrics measure subject and background consistency, perceptual quality, and temporal smoothness (higher is better).

Ablation Study

Methods	Dynamic Mask 𝓜	DynPose-100K
Methods	Dynamic Mask 𝓜	PSNR ↑	SSIM ↑	LPIPS ↓	MEt3R ↓
Ours (n=1)	✓	13.0207	0.3732	0.3771	0.1248
Ours (n=4)	✓	13.0382	0.3733	0.3758	0.1249
Ours (n=L)	✓	13.0206	0.3631	0.3810	0.1235
Ours (n=7)	✗	12.2304	0.3063	0.3821	0.1349
Ours (n=7)	✓	13.0468	0.3666	0.3812	0.1242

Ablation study on dynamic masking.
Removing the dynamic mask (𝓜) leads to degraded PSNR, SSIM, and higher MEt3R, confirming its importance for scene-consistent video generation. Here, n denotes the number of retrieved frames used for spatial conditioning.

Citation

If you use this work or find it helpful, please consider citing:

@misc{lee20253dscenepromptingsceneconsistent,
    title={3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation}, 
    author={JoungBin Lee and Jaewoo Jung and Jisang Han and Takuya Narihira and Kazumi Fukuda and Junyoung Seo and Sunghwan Hong and Yuki Mitsufuji and Seungryong Kim},
    year={2025},
    eprint={2510.14945},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2510.14945}, 
}

3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

arXiv 2025