TL;DRPixel-Space and VAE-Based Restoration Fall Short for 3D ReconstructionOverall FrameworkGeometry-Aware Feature AnalysisQuantitative ResultsQualitative ResultsCitation
arXiv 2026

GARD: Geometry-Aware Representation Denoisingfor Robust Multi-view 3D Reconstruction

Jin Hyeon Kim1,*
Jaeeun Lee1,*
Claire Kim1
Kyoungjin Oh1
Paul Hyunbin Cho1
Jaewon Min1
Yeji Choi1
Jihye Park2
Hyunhee Park2
Minkyu Park2
Seungryong Kim1,†
1KAIST AI2Samsung Electronics

* Equal contribution · † Corresponding author

PaperCode
Figure 1 — GARD framework: degraded views and depths (left) are transformed into restored views, depths, and a clean 3D point cloud (right).
GARD restores degraded multi-view inputs jointly in geometry-aware feature space, recovering accurate 3D point clouds and high-quality RGB images simultaneously.
TL;DR

We propose Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model, enabling simultaneous recovery of 3D scene geometry and high-quality RGB images from degraded multi-view inputs. Our method outperforms existing pixel-space and VAE-based latent-space restoration approaches, demonstrating the effectiveness of geometry-aware feature denoising for robust multi-view 3D reconstruction.

MOTIVATION

Pixel-Space and VAE-Based Restoration Fall Short for 3D Reconstruction

Feed-forward 3D reconstruction models perform well under clean conditions, but fail under real-world degradations such as motion blur.

Where should restoration happen?

⚠️ Pixel-space restoration: Applying a pixel-space denoiser independently to each view fails to leverage multi-view information and cannot enforce cross-view geometry consistency during restoration. This often leads to view-dependent artifacts and inconsistencies that propagate to the reconstructor, resulting in impaired geometry estimation performance.

⚠️ Multi-view restoration in VAE-based latent space: Restoration in compressed VAE-based latent space can introduce an information bottleneck, discarding fine-grained details and high-frequency structures essential for accurate multi-view correspondence and 3D reconstruction.

💡 GARD: Perform denoising directly in the geometry-aware feature space of a feed-forward reconstructor, preserving both cross-view consistency and visual detail, making it more suitable for restoration than pixel space or compressed VAE latents.

METHOD

Overall Framework

Figure 3 — GARD framework overview: (a) denoiser inserted into frozen multi-view encoder, (b) training with interpolated flow matching + attention alignment loss, (c) multi-view DDT encoder-decoder architecture with global and frame attention layers.
The GARD denoiser is inserted at layer K of a frozen multi-view encoder, restoring degraded representations before they propagate to geometry and RGB decoders.

(a) GARD Framework: The GARD denoiser is learned within the representation space of a frozen multi-view encoder to restore degraded intermediate representations into restored representations before they are propagated through the remaining encoder layers. The restored representations are then decoded by their respective decoders to produce geometry predictions and restored RGB images.

(b) GARD Denoiser Training: The GARD denoiser is optimized using an interpolated flow matching loss together with an attention alignment loss, which jointly learns the mapping from degraded to clean feature representations while preserving geometric consistency through explicit alignment of attention maps.

(c) GARD Denoiser Architecture: The GARD denoiser adopts a multi-view latent diffusion architecture, comprising a DDT encoder and a DDT wide decoder, with global attention layers inserted to enable multi-view modeling, thereby facilitating global context aggregation and reconstruction of high-dimensional multi-view representations.

METHOD

Geometry-Aware Feature Analysis

(a) PCK performance on high-quality (HQ) input images.   (b) PCK performance under progressively increasing levels of degradation (mild, moderate, and severe), demonstrating robustness to input corruption.
(a) PCK performance on high-quality (HQ) input images. (b) PCK performance under progressively increasing levels of degradation (mild, moderate, and severe), demonstrating robustness to input corruption.

We evaluate the PCK accuracy of three feature cost volumes under two experimental settings to validate the effectiveness of our proposed denoising space. DA3 features preserve geometric structure more effectively than VAE and DINOv2, achieving higher keypoint correspondence accuracy and stronger robustness to degradation, making them well-suited for representation-level denoising and downstream 3D reconstruction.

Cross-view correspondence visualization of feature cost volumes constructed from VAE, DINOv2, and DA3 feature cost volumes  under (a) clean high-quality (HQ) multi-view inputs and (b) progressively increasing degradation levels of degraded multi-view inputs (mild, moderate, and heavy).
Cross-view correspondence visualization of feature cost volumes constructed from VAE, DINOv2, and DA3 feature cost volumes under (a) clean high-quality (HQ) multi-view inputs and (b) progressively increasing degradation levels of degraded multi-view inputs (mild, moderate, and heavy).
1

DA3: Produce the sharpest and most geometrically consistent correspondences across views under both clean and degraded conditions.

2

DINOv2: Achieves reasonably accurate matching with less precise and less stable response regions.

3

VAE: Exhibit highly scattered and ambiguous correspondences under both clean and degraded conditions, reflecting limited geometry-aware representation quality caused by heavy latent compression.

💡

DA3 > DINOv2 ≫ VAE in correspondence accuracy and robustness, maintaining strong geometric consistency as degradation increases.

Evaluation

Quantitative Results

MethodHiRoomETH3DDTU7ScenesScanNet++

HQ Input

96.65

84.68

98.70

86.91

92.95

LQ Input

32.90

61.38

66.43

51.39

71.02

Restormer

26.68

57.68

67.80

75.12

76.12

HI-Diff

16.02

52.56

51.13

52.72

71.61

InstructIR

27.68

53.80

85.91

74.22

68.23

MoCE-IR

28.60

63.25

84.61

65.20

72.99

VRT

30.17

58.98

67.01

48.02

72.67

FMA-Net

18.29

53.81

40.62

38.23

55.40

VAEMVD

28.70

35.20

60.50

76.50

75.00

GARD (Ours)

67.22

74.68

92.37

84.73

87.45

MethodHiRoomETH3DDTU (Overall ↓)7ScenesScanNet++

HQ Input

84.05

60.81

2.475

45.15

50.25

LQ Input

11.74

37.50

6.611

18.40

24.13

Restormer

11.21

33.97

7.272

27.92

30.45

HI-Diff

8.07

35.10

7.758

25.92

25.83

InstructIR

12.41

33.71

5.563

29.80

26.06

MoCE-IR

10.62

37.15

6.120

26.31

25.97

VRT

9.45

35.14

7.570

19.53

26.72

FMA-Net

9.61

35.65

7.415

13.37

19.66

VAEMVD

11.26

25.64

7.745

28.16

28.38

GARD (Ours)

18.25

45.79

4.760

36.08

35.77

MethodHiRoomETH3DDTU7ScenesScanNet++

Restormer

17.49

20.97

17.73

21.30

21.50

HI-Diff

17.35

20.45

17.39

19.82

20.68

InstructIR

17.51

20.93

20.38

20.93

21.15

MoCE-IR

17.69

21.00

20.38

20.60

21.19

VRT

17.47

20.82

17.61

19.79

20.83

FMA-Net

17.14

20.65

17.13

18.84

19.98

VAEMVD

19.76

21.37

20.54

21.74

21.19

GARD (Ours)

21.89

21.88

21.25

22.67

22.19

MethodHiRoomETH3DDTU7ScenesScanNet++

HQ Input

99.4

99.7

95.2

93.7

97.5

LQ Input

78.6

90.4

94.5

81.9

89.5

Restormer

74.1

82.2

94.6

85.0

89.1

HI-Diff

70.2

86.3

93.8

82.2

89.2

InstructIR

71.7

81.3

95.0

83.3

83.4

MoCE-IR

76.8

88.3

95.6

84.5

89.6

VRT

77.9

90.2

94.7

80.5

89.8

FMA-Net

70.8

88.0

94.1

76.8

85.5

VAEMVD

80.2

95.4

95.9

89.0

93.2

GARD (Ours)

97.2

98.4

96.1

92.7

96.7

💡

GARD achieves the best performance across all datasets on pose estimation, 3D reconstruction, and image restoration, demonstrating that restoring in a geometry-aware feature space preserves both structural fidelity and cross-view consistency.

EVALUATION

Qualitative Results

1. Pose Estimation

1 / 6

2. 3D Reconstruction

1 / 4

3. Image Restoration

1 / 5

4. Depth Estimation

1 / 5
💡

Restoring directly in a geometry-aware feature space preserves both visual fidelity and cross-view consistency, leading to sharper images, more accurate geometry, and improved downstream 3D understanding.

Citation

@misc{kim2026geometryawarerepresentationdenoisingrobust,
              title={Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction},
              author={Jin Hyeon Kim and Jaeeun Lee and Claire Kim and Kyoungjin Oh and Paul Hyunbin Cho and Jaewon Min and Yeji Choi and Jihye Park and Hyunhee Park and Minkyu Park and Seungryong Kim},
              year={2026},
              eprint={2605.26230},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2605.26230},
            }