ReNoV: Projected Representation Conditioning for High-fidelity Novel View Synthesis

arXiv 2026

Minseop Kwak* Minkyung Kwon* Jinhyeok Choi* Jiho Park Seungryong Kim
KAIST AI
*: Equal contribution †: Corresponding author
TL;DR

ReNoV leverages powerful feature representations as prompts for diffusion-based novel view synthesis. By projecting these representations into 3D, we achieve high-fidelity reconstruction and plausible inpainting.

Abstract

We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV (Representation-guided Novel View synthesis). Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion‐based novel‐view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.

Motivation & Analysis

Unlike previous methods, we interpret novel view synthesis as a warping-and-inpainting problem, requiring models to excel at two tasks: accurate reconstruction of visible regions and consistent inpainting of occluded regions. This motivates the search for a conditioning representation that simultaneously encodes semantic awareness and geometric correspondence.

Cross-view Attention Analysis

Cross-view attention maps of the denoising network. A query pixel (blue dot) is chosen in the warped target view. Inpainting: the wheel is absent in the warped view, so attention shifts to the corresponding wheels in the references. Reconstruction: the suitcase edge is visible, so attention concentrates on the geometrically aligned edges.

Correspondence Capability Probing

We evaluate cross-view similarity to assess geometric correspondence across layers of different visual backbones (VGGT, DA3, DINOv2).

Geometric Correspondence Semantic Correspondence Local vs Distant Similarity

Analysis of visual foundation models. (Left) Geometric correspondence, (Center) Semantic correspondence, & (Right) Local vs. Distant Similarity across feature layers.

Geometric Correspondence Visualization

Geometric correspondence visualization. Deeper layers of VGGT and DA3-L accurately identify geometrically corresponding locations across views, while DINOv2 often attends to semantically similar but geometrically incorrect regions.

Reconstruction Capability Probing

Building on our geometric correspondence analysis, we evaluate the intermediate features' capabilities for reconstruction and inpainting. The optimal feature representation should encapsulate multi-view semantic and geometric information.

Reconstruction Results

Qualitative results for feature analysis. VGGT features consistently synthesize target-view images with more accurate structure and color compared to other representations.

Method

ReNoV Method Overview

ReNoV Architecture. Given \(N\) reference images, we extract visual features, dense point clouds, and camera poses using an external representation model (e.g., VGGT, DA3, or DINOv2). These components undergo projected representation conditioning, where reference features and point clouds are projected into the target camera frustum to form warped representation and point-map planes. The reference network aggregates these multi-view inputs by passing them as keys and values to denoising network. Simultaneously, the denoising network receives the projected feature and point cloud planes as direct conditioning, aggregating reference cues to synthesize the novel view image.

Projected Representation Conditioning

We incorporate a geometry-driven conditioning mechanism based on warping. Specifically, we project the reference pointmaps \(\{P_1, \dots, P_N\}\) and the corresponding external representation features \(\{T_1, \dots, T_N\}\) into the target viewpoint \(\pi_{\text{tgt}}\). These projected signals provide spatial priors that guide the diffusion model toward higher-quality generation results. The projected pointmap \(\mathcal{P}^{\Pi}_{\text{tgt}}\) serves as a sparse geometric condition.

Given the observed multiview-consistent nature of geometric external representations, we unproject them into 3D space by anchoring each pixel-level feature to its corresponding 3D coordinate, forming a 3D feature point cloud. This pointcloud is then projected into the target view, yielding a spatially aligned warped feature map. The projected features \(T_{\text{tgt}}^{\Pi}\) and projected pointmap \(X_{\text{tgt}}^{\Pi}\) are provided as input conditions to the denoising network.

Qualitative Results

DTU (Zero-shot) Comparison

DTU Results

Qualitative results on DTU dataset (Zero-shot evaluation).

RealEstate10K Comparison

RealEstate10K Results

Qualitative results on RealEstate10K dataset.

Quantitative Results

Zero-shot Evaluation on DTU

Views Method Pose-free Far-view Setting Near-view Setting
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
2-view PixelSplat 13.03 0.486 0.414 11.57 0.330 0.634
MVSplat 12.22 0.416 0.423 13.94 0.473 0.385
NoPoSplat 11.43 0.335 0.599 11.44 0.357 0.576
FLARE 13.52 0.407 0.525 13.25 0.381 0.502
LVSM 15.23 0.499 0.415 15.82 0.528 0.346
ReNoV w/ DINOv2 15.13 0.599 0.304 14.70 0.602 0.277
ReNoV w/ VGGT 15.45 0.584 0.297 15.38 0.599 0.274
ReNoV w/ DA3 14.38 0.550 0.303 14.91 0.593 0.259
1-view LucidDreamer 12.96 0.248 0.385 12.09 0.481 0.419
GenWarp 8.69 0.253 0.597 9.54 0.298 0.538
ViewCrafter 14.04 0.390 0.332 13.59 0.382 0.486
ReNoV w/ DINOv2 15.02 0.579 0.328 14.13 0.574 0.304
ReNoV w/ VGGT 14.08 0.536 0.355 13.91 0.542 0.333
ReNoV w/ DA3 14.35 0.534 0.325 14.12 0.550 0.292

In-domain Evaluation on RealEstate10K

Method PSNR↑ SSIM↑ LPIPS↓
PixelSplat 14.01 0.582 0.384
MVSplat 12.13 0.534 0.380
NopoSplat 14.36 0.538 0.389
ReNoV w/ VGGT (Ours) 17.49 0.598 0.247

Ablation Study

Components PSNR↑ SSIM↑ LPIPS↓
(a) Warped image only 16.55 0.559 0.260
(b) Warped image + Pointmap 16.93 0.594 0.243
(c) Pointmap + VGGT Feature 17.50 0.598 0.247

Robustness to Degraded Geometry

Removal % PSNR↑ SSIM↑ LPIPS↓
0% (Original) 12.04 0.509 0.371
30% removal 11.89 0.507 0.366
50% removal 11.92 0.507 0.367

Performance remains stable even with significant point cloud degradation.

Citation

@article{kwak2026renov, title={Projected Representation Conditioning for High-fidelity Novel View Synthesis}, author={Kwak, Minseop and Kwon, Minkyung and Choi, Jinhyeok and Park, Jiho and Kim, Seungryong}, journal={arXiv preprint}, year={2026} }
@article{kwak2026renov,
  title={Projected Representation Conditioning for High-fidelity Novel View Synthesis},
  author={Kwak, Minseop and Kwon, Minkyung and Choi, Jinhyeok and Park, Jiho and Kim, Seungryong},
  journal={arXiv preprint}, 
  year={2026}
}
```