TL;DRGeometric Feature Space as Diffusion LatentA Three-Stage Pipeline Quantitative ResultsQualitative Results3D Reconstruction
ARXIV 2026

Geometric Latent Diffusion: Repurposing Geometric Foundation Models for Multi-view Diffusion

Wooseok Jang1
Seonghu Jeon1
Jisang Han1
Jinhyeok Choi1
Minkyung Kwon1
Seungryong Kim1
Saining Xie2
Sainan Liu3
1KAIST AI2New York University3Intel Labs
PaperarXivCode
Figure 1 — Comparison of VAE latent diffusion vs Geometric Latent Diffusion (GLD), with training efficiency plot showing 4.4× faster convergence.
GLD diffuses in geometric feature space, enabling both RGB and geometry decoding with 4.4× faster convergence.
TL;DR

We propose Geometric Latent Diffusion (GLD), a framework that repurposes the feature space of geometric foundation models (e.g., Depth Anything 3) as the latent space for multi-view diffusion. GLD achieves state-of-the-art novel view synthesis with 4.4× faster training convergence than VAE-based approaches — all trained from scratch without any text-to-image pretraining.

Geometric Feature Space as Diffusion Latent

1

Geometry as latent space. The first framework to repurpose a geometric foundation model's feature space as the diffusion latent space — DA3 features encode both high-fidelity RGB (35.41 dB PSNR) and strong cross-view correspondences (PCK=35.98\text{PCK}=35.98PCK=35.98, surpassing DINOv2's 31.6431.6431.64).

2

State-of-the-art NVS from scratch. Outperforms all baselines on in-domain benchmarks (Re10K, DL3DV) in both 2D and 3D metrics, and remains competitive out-of-domain — without any text-to-image pretraining.

3

Zero-shot geometry for free. Depth maps and 3D point clouds are decoded directly from synthesized latents via the frozen DA3 decoder, with no additional training required.

Method

A Three-Stage Pipeline

Stage 1 — Can DA3 features reconstruct images?

We first validate that the frozen DA3 encoder's feature space is expressive enough for high-fidelity image generation.

1

RGB decoder training. A ViT-based decoder DrgbD_\text{rgb}Drgb​ is trained to reconstruct RGB images from DA3's multi-level features {Fl}\{F_l\}{Fl​} (L=4L=4L=4 levels), extracted by the frozen encoder.

2

Level-wise dropout. Individual feature levels are randomly masked during training, forcing the decoder to reconstruct from partial inputs and improving robustness.

3

Result: 35.41 dB PSNR on Re10K (4,000 test samples) — competitive with the Stable Diffusion VAE (34.53 dB) and SDXL VAE (34.97 dB), confirming the feature space is suitable for diffusion.

Stage 2 — Optimal boundary layer selection

Diffusing all four feature levels is computationally prohibitive, so we identify an optimal boundary layer k: features up to level k are explicitly synthesized by the diffusion model, while deeper features are deterministically derived by propagating through the frozen DA3 encoder.

All models adopt the DiT architecture with flow-matching training, 3D self-attention with RoPE for cross-view reasoning, and Plücker ray conditioning to ensure geometric consistency.

Stage 3 — Cascaded feature generation

With the boundary set at k=1k=1k=1, two multi-view diffusion models generate the explicit feature levels in a cascaded manner.

1

M0\mathcal{M}_0M0​ synthesizes level-1 features. M1→0\mathcal{M}_{1 \to 0}M1→0​ then generates level-0 features conditioned on the synthesized level-1 output, ensuring cross-level alignment.

2

Feature propagation. Deeper features (levels 2–3) are deterministically derived by passing synthesized level-1 features through the frozen DA3 encoder — no additional diffusion needed.

3

RGB & geometry decoding. The full multi-level feature set is decoded into target views via DrgbD_\text{rgb}Drgb​, and into depth maps and camera poses via the frozen DA3 geometric decoder DgeoD_\text{geo}Dgeo​ — zero-shot, without any fine-tuning.

RESULTS

Quantitative Results

MVGenMaster
DINO
VAE
GLD (DA3)
GLD (VGGT)
20406080100PSNR: 19.1SSIM: 54.4LPIPS: 40ATE: 76.8RPE_r: 82.6RPE_t: 75.9Reproj: 74MEt3R: 100PSNR: 0SSIM: 0LPIPS: 12.5ATE: 10.7RPE_r: 19.2RPE_t: 9.6Reproj: 0MEt3R: 0PSNR: 32.9SSIM: 61.4LPIPS: 0ATE: 0RPE_r: 0RPE_t: 0Reproj: 35.4MEt3R: 8.6PSNR: 78.4SSIM: 40.4LPIPS: 100ATE: 100RPE_r: 100RPE_t: 100Reproj: 77.1MEt3R: 68.6PSNR: 100SSIM: 100LPIPS: 95ATE: 94.8RPE_r: 94.7RPE_t: 94.2Reproj: 100MEt3R: 91.4PSNRSSIMLPIPSATERPE_rRPE_tReprojMEt3R
MVGenMaster
DINO
VAE
GLD (DA3)
GLD (VGGT)
20406080100PSNR: 0SSIM: 0LPIPS: 0ATE: 47RPE_r: 100RPE_t: 69.2Reproj: 100MEt3R: 64.3PSNR: 36.3SSIM: 31LPIPS: 29.6ATE: 0RPE_r: 0RPE_t: 0Reproj: 0MEt3R: 100PSNR: 37.4SSIM: 42.9LPIPS: 0ATE: 50RPE_r: 75.4RPE_t: 59.9Reproj: 70.2MEt3R: 0PSNR: 83.2SSIM: 19LPIPS: 100ATE: 96.3RPE_r: 91.8RPE_t: 100Reproj: 96.5MEt3R: 89.3PSNR: 100SSIM: 100LPIPS: 92.6ATE: 100RPE_r: 92.9RPE_t: 98.6Reproj: 84.2MEt3R: 83.9PSNRSSIMLPIPSATERPE_rRPE_tReprojMEt3R
MVGenMaster
DINO
VAE
GLD (DA3)
GLD (VGGT)
20406080100PSNR: 61.7SSIM: 100LPIPS: 84.1ATE: 100RPE_r: 100RPE_t: 100Reproj: 40.3MEt3R: 85.5PSNR: 15.2SSIM: 5.1LPIPS: 13.6ATE: 30.2RPE_r: 31.8RPE_t: 29.8Reproj: 0MEt3R: 9.1PSNR: 38.3SSIM: 23.1LPIPS: 0ATE: 0RPE_r: 0RPE_t: 0Reproj: 42.9MEt3R: 0PSNR: 0SSIM: 0LPIPS: 43.2ATE: 69.4RPE_r: 76.8RPE_t: 62.6Reproj: 68.8MEt3R: 100PSNR: 100SSIM: 59LPIPS: 100ATE: 70.1RPE_r: 79.3RPE_t: 70Reproj: 100MEt3R: 78.2PSNRSSIMLPIPSATERPE_rRPE_tReprojMEt3R
MethodScratchPSNR ↑SSIM ↑LPIPS ↓ATE ↓RPEr ↓RPEt ↓Reproj ↓MEt3R ↓

MVGenMaster

✗

14.56

0.442

0.460

0.281

6.93

0.592

0.637

0.375

Matrix3D

✗

13.33

0.396

0.451

0.459

9.65

0.850

0.667

0.394

CAMEO

✗

12.32

0.371

0.567

1.143

24.76

2.149

0.706

0.404

NVComposer

✗

10.51

0.273

0.646

1.810

55.59

3.098

0.852

0.517

CAT3D†

✗

11.82

0.335

0.594

1.346

31.73

2.473

0.746

0.435

DINO

✓

14.34

0.411

0.471

0.546

13.12

1.050

0.708

0.410

VAE

✓

14.72

0.446

0.476

0.589

15.00

1.116

0.674

0.407

GLD (VGGT)

✓

15.25

0.434

0.436

0.188

5.23

0.426

0.634

0.386

GLD (DA3)

✓

15.49

0.468

0.438

0.209

5.75

0.466

0.612

0.378

MethodScratchPSNR ↑SSIM ↑LPIPS ↓ATE ↓RPEr ↓RPEt ↓Reproj ↓MEt3R ↓

MVGenMaster

✗

15.22

0.588

0.456

0.282

6.42

0.526

0.664

0.339

Matrix3D

✗

14.49

0.580

0.448

0.413

8.93

0.638

0.666

0.344

CAMEO

✗

13.80

0.561

0.522

0.446

12.93

0.790

0.661

0.344

NVComposer

✗

11.14

0.418

0.649

0.829

42.85

1.435

0.860

0.457

CAT3D†

✗

13.35

0.527

0.561

0.496

17.49

0.941

0.719

0.361

DINO

✓

15.63

0.601

0.448

0.345

15.59

0.719

0.721

0.319

VAE

✓

15.65

0.606

0.456

0.278

8.68

0.552

0.681

0.375

GLD (DA3)

✓

16.36

0.630

0.431

0.211

7.07

0.444

0.673

0.328

GLD (VGGT)

✓

16.17

0.596

0.429

0.216

7.17

0.440

0.666

0.325

MethodScratchPSNR ↑SSIM ↑LPIPS ↓ATE ↓RPEr ↓RPEt ↓Reproj ↓MEt3R ↓

MVGenMaster

✗

14.17

0.304

0.511

0.320

10.92

0.587

0.676

0.402

Matrix3D

✗

13.97

0.284

0.483

0.548

13.63

0.948

0.646

0.422

CAMEO

✗

11.90

0.250

0.629

1.623

48.75

3.008

0.684

0.395

NVComposer

✗

12.52

0.217

0.637

1.622

54.26

2.703

0.767

0.526

CAT3D†

✗

11.31

0.214

0.653

1.722

54.65

3.171

0.724

0.453

DINO

✓

13.71

0.267

0.542

0.949

27.57

1.720

0.707

0.444

VAE

✓

13.94

0.274

0.548

1.221

35.34

2.200

0.674

0.449

GLD (DA3)

✓

14.54

0.288

0.504

0.589

15.97

1.071

0.630

0.406

GLD (VGGT)

✓

13.57

0.265

0.529

0.596

16.58

1.190

0.654

0.394

Qualitative Results

Figure 4 — Qualitative comparison on RealEstate10K.
Figure 4 — Qualitative comparison on Mip-NeRF 360.
Figure 4 — Qualitative comparison on DL3DV.

3D Reconstruction

💡

Depth maps and 3D point clouds are decoded directly from synthesized latents via the frozen DA3 geometric decoder DgeoD_\text{geo}Dgeo​ without any additional training or fine-tuning.

ABSTRACT

The latent space of diffusion models fundamentally determines their learning efficiency and generation quality. While recent advances in the latent space have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the feature space of a geometric foundation model as the latent space for multi-view diffusion. We show that the features of the geometric foundation model not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Through experiments, GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, accelerating training by more than 4.4×\mathbf{4.4\times}4.4× compared to the VAE latent space. Notably, {GLD} remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.

CITATION
@article{jang2026repurposing,
  title={Repurposing Geometric Foundation Models for Multi-view Diffusion},
  author={Wooseok Jang and Seonghu Jeon and Jisang Han and Jinhyeok Choi and Minkyung Kwon and Seungryong Kim and Saining Xie and Sainan Liu},
  journal={arXiv preprint arXiv:2603.22275},
  year={2026}
}
Built with Project Page Builder by Seonghu Jeon. Page template design inspired by Nerfies, CAMEO, and ViRAL. Thanks to the original authors.