ARXIV 2026

Geometric Latent Diffusion: Repurposing Geometric Foundation Models for Multi-view Diffusion

¹KAIST AI²New York University³Intel Labs

Figure 1 — Comparison of VAE latent diffusion vs Geometric Latent Diffusion (GLD), with training efficiency plot showing 4.4× faster convergence. — GLD diffuses in geometric feature space, enabling both RGB and geometry decoding with 4.4× faster convergence.

TL;DR

We propose Geometric Latent Diffusion (GLD), a framework that repurposes the feature space of geometric foundation models (e.g., Depth Anything 3) as the latent space for multi-view diffusion. GLD achieves state-of-the-art novel view synthesis with 4.4× faster training convergence than VAE-based approaches — all trained from scratch without any text-to-image pretraining.

Geometric Feature Space as Diffusion Latent

Geometry as latent space. The first framework to repurpose a geometric foundation model's feature space as the diffusion latent space — DA3 features encode both high-fidelity RGB (35.41 dB PSNR) and strong cross-view correspondences ( $\text{PCK}=35.98$ , surpassing DINOv2's $31.64$ ).

State-of-the-art NVS from scratch. Outperforms all baselines on in-domain benchmarks (Re10K, DL3DV) in both 2D and 3D metrics, and remains competitive out-of-domain — without any text-to-image pretraining.

Zero-shot geometry for free. Depth maps and 3D point clouds are decoded directly from synthesized latents via the frozen DA3 decoder, with no additional training required.

Method

A Three-Stage Pipeline

Stage 1 — Can DA3 features reconstruct images?

We first validate that the frozen DA3 encoder's feature space is expressive enough for high-fidelity image generation.

RGB decoder training. A ViT-based decoder $D_\text{rgb}$ is trained to reconstruct RGB images from DA3's multi-level features $\{F_l\}$ ( $L=4$ levels), extracted by the frozen encoder.

Level-wise dropout. Individual feature levels are randomly masked during training, forcing the decoder to reconstruct from partial inputs and improving robustness.

Result: 35.41 dB PSNR on Re10K (4,000 test samples) — competitive with the Stable Diffusion VAE (34.53 dB) and SDXL VAE (34.97 dB), confirming the feature space is suitable for diffusion.

Stage 2 — Optimal boundary layer selection

Diffusing all four feature levels is computationally prohibitive, so we identify an optimal boundary layer k: features up to level k are explicitly synthesized by the diffusion model, while deeper features are deterministically derived by propagating through the frozen DA3 encoder.

All models adopt the DiT architecture with flow-matching training, 3D self-attention with RoPE for cross-view reasoning, and Plücker ray conditioning to ensure geometric consistency.

Stage 3 — Cascaded feature generation

With the boundary set at $k=1$ , two multi-view diffusion models generate the explicit feature levels in a cascaded manner.

$\mathcal{M}_0$ synthesizes level-1 features. $\mathcal{M}_{1 \to 0}$ then generates level-0 features conditioned on the synthesized level-1 output, ensuring cross-level alignment.

Feature propagation. Deeper features (levels 2–3) are deterministically derived by passing synthesized level-1 features through the frozen DA3 encoder — no additional diffusion needed.

RGB & geometry decoding. The full multi-level feature set is decoded into target views via $D_\text{rgb}$ , and into depth maps and camera poses via the frozen DA3 geometric decoder $D_\text{geo}$ — zero-shot, without any fine-tuning.

RESULTS

Quantitative Results

MVGenMaster

DINO

VAE

GLD (DA3)

GLD (VGGT)

Method	Scratch	PSNR ↑	SSIM ↑	LPIPS ↓	ATE ↓	RPEr ↓	RPEt ↓	Reproj ↓	MEt3R ↓
MVGenMaster	✗	14.56	0.442	0.460	0.281	6.93	0.592	0.637	0.375
Matrix3D	✗	13.33	0.396	0.451	0.459	9.65	0.850	0.667	0.394
CAMEO	✗	12.32	0.371	0.567	1.143	24.76	2.149	0.706	0.404
NVComposer	✗	10.51	0.273	0.646	1.810	55.59	3.098	0.852	0.517
CAT3D†	✗	11.82	0.335	0.594	1.346	31.73	2.473	0.746	0.435
DINO	✓	14.34	0.411	0.471	0.546	13.12	1.050	0.708	0.410
VAE	✓	14.72	0.446	0.476	0.589	15.00	1.116	0.674	0.407
GLD (VGGT)	✓	15.25	0.434	0.436	0.188	5.23	0.426	0.634	0.386
GLD (DA3)	✓	15.49	0.468	0.438	0.209	5.75	0.466	0.612	0.378

Method	Scratch	PSNR ↑	SSIM ↑	LPIPS ↓	ATE ↓	RPEr ↓	RPEt ↓	Reproj ↓	MEt3R ↓
MVGenMaster	✗	15.22	0.588	0.456	0.282	6.42	0.526	0.664	0.339
Matrix3D	✗	14.49	0.580	0.448	0.413	8.93	0.638	0.666	0.344
CAMEO	✗	13.80	0.561	0.522	0.446	12.93	0.790	0.661	0.344
NVComposer	✗	11.14	0.418	0.649	0.829	42.85	1.435	0.860	0.457
CAT3D†	✗	13.35	0.527	0.561	0.496	17.49	0.941	0.719	0.361
DINO	✓	15.63	0.601	0.448	0.345	15.59	0.719	0.721	0.319
VAE	✓	15.65	0.606	0.456	0.278	8.68	0.552	0.681	0.375
GLD (DA3)	✓	16.36	0.630	0.431	0.211	7.07	0.444	0.673	0.328
GLD (VGGT)	✓	16.17	0.596	0.429	0.216	7.17	0.440	0.666	0.325

Method	Scratch	PSNR ↑	SSIM ↑	LPIPS ↓	ATE ↓	RPEr ↓	RPEt ↓	Reproj ↓	MEt3R ↓
MVGenMaster	✗	14.17	0.304	0.511	0.320	10.92	0.587	0.676	0.402
Matrix3D	✗	13.97	0.284	0.483	0.548	13.63	0.948	0.646	0.422
CAMEO	✗	11.90	0.250	0.629	1.623	48.75	3.008	0.684	0.395
NVComposer	✗	12.52	0.217	0.637	1.622	54.26	2.703	0.767	0.526
CAT3D†	✗	11.31	0.214	0.653	1.722	54.65	3.171	0.724	0.453
DINO	✓	13.71	0.267	0.542	0.949	27.57	1.720	0.707	0.444
VAE	✓	13.94	0.274	0.548	1.221	35.34	2.200	0.674	0.449
GLD (DA3)	✓	14.54	0.288	0.504	0.589	15.97	1.071	0.630	0.406
GLD (VGGT)	✓	13.57	0.265	0.529	0.596	16.58	1.190	0.654	0.394

Qualitative Results

Figure 4 — Qualitative comparison on RealEstate10K.

Figure 4 — Qualitative comparison on Mip-NeRF 360.

Figure 4 — Qualitative comparison on DL3DV.

3D Reconstruction

💡

Depth maps and 3D point clouds are decoded directly from synthesized latents via the frozen DA3 geometric decoder $D_\text{geo}$ without any additional training or fine-tuning.

ABSTRACT

The latent space of diffusion models fundamentally determines their learning efficiency and generation quality. While recent advances in the latent space have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the feature space of a geometric foundation model as the latent space for multi-view diffusion. We show that the features of the geometric foundation model not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Through experiments, GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, accelerating training by more than $\mathbf{4.4\times}$ compared to the VAE latent space. Notably, {GLD} remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.

CITATION

@article{jang2026repurposing,
  title={Repurposing Geometric Foundation Models for Multi-view Diffusion},
  author={Wooseok Jang and Seonghu Jeon and Jisang Han and Jinhyeok Choi and Minkyung Kwon and Seungryong Kim and Saining Xie and Sainan Liu},
  journal={arXiv preprint arXiv:2603.22275},
  year={2026}
}