arXiv 2026

GARD: Geometry-Aware Representation Denoisingfor Robust Multi-view 3D Reconstruction

Jin Hyeon Kim^1,*

Jaeeun Lee^1,*

Claire Kim¹

Kyoungjin Oh¹

Paul Hyunbin Cho¹

Jaewon Min¹

Yeji Choi¹

Jihye Park²

Hyunhee Park²

Minkyu Park²

Seungryong Kim^1,†

¹KAIST AI²Samsung Electronics

* Equal contribution · † Corresponding author

Paper Code

Figure 1 — GARD framework: degraded views and depths (left) are transformed into restored views, depths, and a clean 3D point cloud (right). — **GARD** restores degraded multi-view inputs jointly in geometry-aware feature space, recovering accurate 3D point clouds and high-quality RGB images simultaneously.

TL;DR

We propose Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model, enabling simultaneous recovery of 3D scene geometry and high-quality RGB images from degraded multi-view inputs. Our method outperforms existing pixel-space and VAE-based latent-space restoration approaches, demonstrating the effectiveness of geometry-aware feature denoising for robust multi-view 3D reconstruction.

MOTIVATION

Pixel-Space and VAE-Based Restoration Fall Short for 3D Reconstruction

Feed-forward 3D reconstruction models perform well under clean conditions, but fail under real-world degradations such as motion blur.

Where should restoration happen?

⚠️ Pixel-space restoration: Applying a pixel-space denoiser independently to each view fails to leverage multi-view information and cannot enforce cross-view geometry consistency during restoration. This often leads to view-dependent artifacts and inconsistencies that propagate to the reconstructor, resulting in impaired geometry estimation performance.

⚠️ Multi-view restoration in VAE-based latent space: Restoration in compressed VAE-based latent space can introduce an information bottleneck, discarding fine-grained details and high-frequency structures essential for accurate multi-view correspondence and 3D reconstruction.

💡 GARD: Perform denoising directly in the geometry-aware feature space of a feed-forward reconstructor, preserving both cross-view consistency and visual detail, making it more suitable for restoration than pixel space or compressed VAE latents.

METHOD

Overall Framework

Figure 3 — GARD framework overview: (a) denoiser inserted into frozen multi-view encoder, (b) training with interpolated flow matching + attention alignment loss, (c) multi-view DDT encoder-decoder architecture with global and frame attention layers. — The GARD denoiser is inserted at layer K of a frozen multi-view encoder, restoring degraded representations before they propagate to geometry and RGB decoders.

(a) GARD Framework: The GARD denoiser is learned within the representation space of a frozen multi-view encoder to restore degraded intermediate representations into restored representations before they are propagated through the remaining encoder layers. The restored representations are then decoded by their respective decoders to produce geometry predictions and restored RGB images.

(b) GARD Denoiser Training: The GARD denoiser is optimized using an interpolated flow matching loss together with an attention alignment loss, which jointly learns the mapping from degraded to clean feature representations while preserving geometric consistency through explicit alignment of attention maps.

(c) GARD Denoiser Architecture: The GARD denoiser adopts a multi-view latent diffusion architecture, comprising a DDT encoder and a DDT wide decoder, with global attention layers inserted to enable multi-view modeling, thereby facilitating global context aggregation and reconstruction of high-dimensional multi-view representations.

METHOD

Geometry-Aware Feature Analysis

(a) PCK performance on high-quality (HQ) input images. (b) PCK performance under progressively increasing levels of degradation (mild, moderate, and severe), demonstrating robustness to input corruption.

We evaluate the PCK accuracy of three feature cost volumes under two experimental settings to validate the effectiveness of our proposed denoising space. DA3 features preserve geometric structure more effectively than VAE and DINOv2, achieving higher keypoint correspondence accuracy and stronger robustness to degradation, making them well-suited for representation-level denoising and downstream 3D reconstruction.

Cross-view correspondence visualization of feature cost volumes constructed from VAE, DINOv2, and DA3 feature cost volumes under (a) clean high-quality (HQ) multi-view inputs and (b) progressively increasing degradation levels of degraded multi-view inputs (mild, moderate, and heavy).

DA3: Produce the sharpest and most geometrically consistent correspondences across views under both clean and degraded conditions.

DINOv2: Achieves reasonably accurate matching with less precise and less stable response regions.

VAE: Exhibit highly scattered and ambiguous correspondences under both clean and degraded conditions, reflecting limited geometry-aware representation quality caused by heavy latent compression.

💡

DA3 > DINOv2 ≫ VAE in correspondence accuracy and robustness, maintaining strong geometric consistency as degradation increases.

Evaluation

Quantitative Results

Method	HiRoom	ETH3D	DTU	7Scenes	ScanNet++
HQ Input	96.65	84.68	98.70	86.91	92.95
LQ Input	32.90	61.38	66.43	51.39	71.02
Restormer	26.68	57.68	67.80	75.12	76.12
HI-Diff	16.02	52.56	51.13	52.72	71.61
InstructIR	27.68	53.80	85.91	74.22	68.23
MoCE-IR	28.60	63.25	84.61	65.20	72.99
VRT	30.17	58.98	67.01	48.02	72.67
FMA-Net	18.29	53.81	40.62	38.23	55.40
VAE_MVD	28.70	35.20	60.50	76.50	75.00
GARD (Ours)	67.22	74.68	92.37	84.73	87.45

Method	HiRoom	ETH3D	DTU (Overall ↓)	7Scenes	ScanNet++
HQ Input	84.05	60.81	2.475	45.15	50.25
LQ Input	11.74	37.50	6.611	18.40	24.13
Restormer	11.21	33.97	7.272	27.92	30.45
HI-Diff	8.07	35.10	7.758	25.92	25.83
InstructIR	12.41	33.71	5.563	29.80	26.06
MoCE-IR	10.62	37.15	6.120	26.31	25.97
VRT	9.45	35.14	7.570	19.53	26.72
FMA-Net	9.61	35.65	7.415	13.37	19.66
VAE_MVD	11.26	25.64	7.745	28.16	28.38
GARD (Ours)	18.25	45.79	4.760	36.08	35.77

Method	HiRoom	ETH3D	DTU	7Scenes	ScanNet++
Restormer	17.49	20.97	17.73	21.30	21.50
HI-Diff	17.35	20.45	17.39	19.82	20.68
InstructIR	17.51	20.93	20.38	20.93	21.15
MoCE-IR	17.69	21.00	20.38	20.60	21.19
VRT	17.47	20.82	17.61	19.79	20.83
FMA-Net	17.14	20.65	17.13	18.84	19.98
VAE_MVD	19.76	21.37	20.54	21.74	21.19
GARD (Ours)	21.89	21.88	21.25	22.67	22.19

Method	HiRoom	ETH3D	DTU	7Scenes	ScanNet++
HQ Input	99.4	99.7	95.2	93.7	97.5
LQ Input	78.6	90.4	94.5	81.9	89.5
Restormer	74.1	82.2	94.6	85.0	89.1
HI-Diff	70.2	86.3	93.8	82.2	89.2
InstructIR	71.7	81.3	95.0	83.3	83.4
MoCE-IR	76.8	88.3	95.6	84.5	89.6
VRT	77.9	90.2	94.7	80.5	89.8
FMA-Net	70.8	88.0	94.1	76.8	85.5
VAE_MVD	80.2	95.4	95.9	89.0	93.2
GARD (Ours)	97.2	98.4	96.1	92.7	96.7

💡

GARD achieves the best performance across all datasets on pose estimation, 3D reconstruction, and image restoration, demonstrating that restoring in a geometry-aware feature space preserves both structural fidelity and cross-view consistency.

EVALUATION

Qualitative Results

1. Pose Estimation

1 / 6

2. 3D Reconstruction

1 / 4

3. Image Restoration

1 / 5

4. Depth Estimation

1 / 5

💡

Restoring directly in a geometry-aware feature space preserves both visual fidelity and cross-view consistency, leading to sharper images, more accurate geometry, and improved downstream 3D understanding.

Citation

@misc{kim2026geometryawarerepresentationdenoisingrobust,
              title={Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction},
              author={Jin Hyeon Kim and Jaeeun Lee and Claire Kim and Kyoungjin Oh and Paul Hyunbin Cho and Jaewon Min and Yeji Choi and Jihye Park and Hyunhee Park and Minkyu Park and Seungryong Kim},
              year={2026},
              eprint={2605.26230},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2605.26230},
            }