Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

Mungyeom Kim1* Minkyeong Jeon1* Honggyu An1* Jaewoo Jung1 Hyunah Ko1 Jisang Han1 Hyeonseo Yu1 Donghwan Shin1 Sunghwan Hong2 Takuya Narihira3 Kazumi Fukuda3 Yuki Mitsufuji3,4† Seungryong Kim1†
1KAIST AI 2ETH AI Center, ETH Zurich 3SONY AI 4Sony Group Corporation
*Co-first authors Co-advising authors
TL;DR

We propose a feed-forward dynamic reconstruction network that effectively captures the global motion of dynamic scenes. Our method employs a timestamp-conditioned, query-based transformer Gaussian decoder that aggregates geometrically consistent features from multi-frame videos, enabling each Gaussian to model globally coherent motion.

Overview

Existing feed-forward dynamic reconstruction methods predict pixel-wise 3D Gaussians for each frame, suffering from duplicated Gaussians and view-dependent biases that lead to ghost artifacts and occlusion issues. Our method replaces this design with learnable timestamp-conditioned queries that aggregate temporally and geometrically consistent evidence across the entire sequence. This compact bottleneck encourages globally coherent motion modeling and enables robust novel-view synthesis even under large temporal gaps.

To recover high-frequency details while preserving compactness, our framework further incorporates a diffusion-based rendering enhancement module. The same query-to-Gaussian aggregation is then reused for feature lifting, yielding a feed-forward 4D feature field that supports downstream tasks such as point tracking and dynamic scene understanding.

Qualitative Results

Comparison with Previous Works

Figure 2 comparison

Model Architecture

Our model first extracts timestamp-injected visual features, then uses learnable queries refined by transformer decoding to produce timestamp-conditioned Gaussians. Query tokens gather geometrically coherent signals across frames, and decoded Gaussian positions are modulated by the target timestamp for temporal interpolation. A diffusion-based refinement module improves rendering fidelity, and a feature decoder reuses attention patterns for 4D feature lifting.

Figure architecture

Quantitative Results

Citation

@article{,
  title={Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction},
  author={Kim, Mungyeom and Jeon, Minkyeong and An, Honggyu and Jung, Jaewoo and Ko, Hyunah and Han, Jisang and Yu, Hyeonseo and Shin, Donghwan and Hong, Sunghwan and Narihira, Takuya and Fukuda, Kazumi and Mitsufuji, Yuki and Kim, Seungryong},
  journal={},
  year={2026}
}