D2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes

Jisang Han1*, Honggyu An1*, Jaewoo Jung1*, Takuya Narihira3, Junyoung Seo1,
Kazumi Fukuda3, Chaehyun Kim1, Sunghwan Hong2, Yuki Mitsufuji3,4†, Seungryong Kim1†
1KAIST AI, 2Korea University, 3Sony AI, 4Sony Group Corporation
*Co-first authors, Co-corresponding authors
ArXiv 2025
Given a pair of input views, our D2USt3R accurately establishes dense correspondence not only in static regions but also in dynamic regions, enabling full reconstruction of a dynamic scene via our proposed 4D pointmap. As highlighted by red dots • in the pointmap, DUSt3R and MonST3R align pointmaps solely based on camera motion, causing corresponding 2D pixels to become misaligned in 3D space. We compare the cross-attention maps, established correspondence fields, and estimated depth maps produced by our D2USt3R against baseline methods.

Abstract

We address the task of 3D reconstruction in dynamic scenes, where object motions degrade the quality of 3D pointmap regression methods, such as DUSt3R, originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose D2USt3R that regresses 4D pointmaps that simultaneiously capture both static and dynamic 3D scene geometry in a feed-forward manner. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates spatio-temporal dense correspondence to the proposed 4D poitnmaps, enhancing performance of downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior reconstruction performance across various datasets featuring complex motions.

Motivation

Cross-attention visualization of DUSt3R on static and dynamic scenes. We visualize the attention maps of the source image corresponding to the query point marked in red on the target image, both layer-wise and averaged across all layers. While DUSt3R reliably captures geometric correspondences during 3D reconstruction, it fails to establish correspondences in dynamic regions. This limitation arises because its training signal assumes that both frames are static and can be modeled using rigid camera motion, which does not account for dynamic motion.

Comparison of cross-attention on dynamic scene. We visualize the attention maps of the source and target images from a dynamic scene in the above figure, both layer-wise and averaged across all layers. Although MonST3R is trained on dynamic video, its training signal remains identical to that of DUSt3R. As a result, it fails to achieve reliable alignment between frames, ultimately limiting its 3D reconstruction performance. In contrast, our D2USt3R successfully establishes correspondences between dynamic frames, and thus has a stronger ability to estimate 3D shapes from dynamic motions

Construction of our alignment loss

We propose a pipeline for generating a 4D pointmap while explicitly addressing occlusions caused by dynamic regions. (a) From the input images, we obtain optical flow refined via cycle consistency checks and derive a dynamic mask Mdyn. To align image I2 with image I1, we utilize: 1) the camera pose to align static regions in I2, and 2) optical flow to align dynamic regions. The alignment process is conducted specifically within regions corresponding to the colored pixels. Through this process, we construct a 4D pointmap, enabling alignment in 3D space for all corresponding 2D pixels.

Qualitative results

Pointmap reconstruction

Teaser image

Depth estimation

Teaser image

Correspondences

Teaser image

Optical flow estimation

Teaser image

Quantitative Results

We evaluate our method on depth estimation and camera pose estimation. We additionally evaluate pointmap alignment accuracy in dynamic regions. We compare our method against existing state-of-the-art pointmap regression models, specifically DUSt3R, MASt3R, and MonST3R. *: Reproduced with same dataset as Ours.

Multi-frame depth estimation results

Method image

Single-frame depth estimation results

Method image

Camera pose estimation results

Method image

Evaluation of pointmap alignment accuracy in dynamic objects

Method image