Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

Mungyeom Kim^1* Minkyeong Jeon^1* Honggyu An^1* Jaewoo Jung¹ Hyunah Ko¹ Jisang Han¹ Hyeonseo Yu¹ Donghwan Shin¹ Sunghwan Hong² Takuya Narihira³ Kazumi Fukuda³ Yuki Mitsufuji^3,4† Seungryong Kim^1†

¹KAIST AI ²ETH AI Center, ETH Zurich ³SONY AI ⁴Sony Group Corporation

^*Co-first authors ^†Co-advising authors

Paper Code Checkpoints BibTeX

TL;DR

We propose a feed-forward dynamic reconstruction network that effectively captures the global motion of dynamic scenes. Our method employs a timestamp-conditioned, query-based transformer Gaussian decoder that aggregates geometrically consistent features from multi-frame videos, enabling each Gaussian to model globally coherent motion.

Overview

Existing feed-forward dynamic reconstruction methods predict pixel-wise 3D Gaussians for each frame, suffering from duplicated Gaussians and view-dependent biases that lead to ghost artifacts and occlusion issues. Our method replaces this design with learnable timestamp-conditioned queries that aggregate temporally and geometrically consistent evidence across the entire sequence. This compact bottleneck encourages globally coherent motion modeling and enables robust novel-view synthesis even under large temporal gaps.

To recover high-frequency details while preserving compactness, our framework further incorporates a diffusion-based rendering enhancement module. The same query-to-Gaussian aggregation is then reused for feature lifting, yielding a feed-forward 4D feature field that supports downstream tasks such as point tracking and dynamic scene understanding.

Comparison with Previous Works

Pixel-wise 4DGS ^{[1, 2, 3]}: predicts per-pixel Gaussians, which leads to duplicated Gaussians and ghost artifacts at interpolated timestamps, and a view-dependent bias that underuses distant temporal evidence and leaves occluded regions poorly reconstructed.
Our method: uses a compact query bottleneck to aggregate globally across time and views, yielding better motion consistency with fewer duplication and occlusion failures.
[1] MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
[2] 4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos
[3] NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

Model Architecture

Our model first extracts timestamp-injected visual features, then uses learnable queries refined by transformer decoding to produce timestamp-conditioned Gaussians. Query tokens gather geometrically coherent signals across frames, and decoded Gaussian positions are modulated by the target timestamp for temporal interpolation. A diffusion-based refinement module improves rendering fidelity, and a feature decoder reuses attention patterns for 4D feature lifting.

Quantitative Results

Citation