arXiv 2026
We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV (Representation-guided Novel View synthesis). Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion‐based novel‐view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.
Unlike previous methods, we interpret novel view synthesis as a warping-and-inpainting problem, requiring models to excel at two tasks: accurate reconstruction of visible regions and consistent inpainting of occluded regions. This motivates the search for a conditioning representation that simultaneously encodes semantic awareness and geometric correspondence.
Cross-view attention maps of the denoising network. A query pixel (blue dot) is chosen in the warped target view. Inpainting: the wheel is absent in the warped view, so attention shifts to the corresponding wheels in the references. Reconstruction: the suitcase edge is visible, so attention concentrates on the geometrically aligned edges.
We evaluate cross-view similarity to assess geometric correspondence across layers of different visual backbones (VGGT, DA3, DINOv2).
Analysis of visual foundation models. (Left) Geometric correspondence, (Center) Semantic correspondence, & (Right) Local vs. Distant Similarity across feature layers.
Geometric correspondence visualization. Deeper layers of VGGT and DA3-L accurately identify geometrically corresponding locations across views, while DINOv2 often attends to semantically similar but geometrically incorrect regions.
Building on our geometric correspondence analysis, we evaluate the intermediate features' capabilities for reconstruction and inpainting. The optimal feature representation should encapsulate multi-view semantic and geometric information.
Qualitative results for feature analysis. VGGT features consistently synthesize target-view images with more accurate structure and color compared to other representations.
ReNoV Architecture. Given \(N\) reference images, we extract visual features, dense point clouds, and camera poses using an external representation model (e.g., VGGT, DA3, or DINOv2). These components undergo projected representation conditioning, where reference features and point clouds are projected into the target camera frustum to form warped representation and point-map planes. The reference network aggregates these multi-view inputs by passing them as keys and values to denoising network. Simultaneously, the denoising network receives the projected feature and point cloud planes as direct conditioning, aggregating reference cues to synthesize the novel view image.
We incorporate a geometry-driven conditioning mechanism based on warping. Specifically, we project the reference pointmaps \(\{P_1, \dots, P_N\}\) and the corresponding external representation features \(\{T_1, \dots, T_N\}\) into the target viewpoint \(\pi_{\text{tgt}}\). These projected signals provide spatial priors that guide the diffusion model toward higher-quality generation results. The projected pointmap \(\mathcal{P}^{\Pi}_{\text{tgt}}\) serves as a sparse geometric condition.
Given the observed multiview-consistent nature of geometric external representations, we unproject them into 3D space by anchoring each pixel-level feature to its corresponding 3D coordinate, forming a 3D feature point cloud. This pointcloud is then projected into the target view, yielding a spatially aligned warped feature map. The projected features \(T_{\text{tgt}}^{\Pi}\) and projected pointmap \(X_{\text{tgt}}^{\Pi}\) are provided as input conditions to the denoising network.
Qualitative results on DTU dataset (Zero-shot evaluation).
Qualitative results on RealEstate10K dataset.
| Views | Method | Pose-free | Far-view Setting | Near-view Setting | ||||
|---|---|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | |||
| 2-view | PixelSplat | ✗ | 13.03 | 0.486 | 0.414 | 11.57 | 0.330 | 0.634 |
| MVSplat | ✗ | 12.22 | 0.416 | 0.423 | 13.94 | 0.473 | 0.385 | |
| NoPoSplat | ✓ | 11.43 | 0.335 | 0.599 | 11.44 | 0.357 | 0.576 | |
| FLARE | ✗ | 13.52 | 0.407 | 0.525 | 13.25 | 0.381 | 0.502 | |
| LVSM | ✗ | 15.23 | 0.499 | 0.415 | 15.82 | 0.528 | 0.346 | |
| ReNoV w/ DINOv2 | ✓ | 15.13 | 0.599 | 0.304 | 14.70 | 0.602 | 0.277 | |
| ReNoV w/ VGGT | ✓ | 15.45 | 0.584 | 0.297 | 15.38 | 0.599 | 0.274 | |
| ReNoV w/ DA3 | ✓ | 14.38 | 0.550 | 0.303 | 14.91 | 0.593 | 0.259 | |
| 1-view | LucidDreamer | ✓ | 12.96 | 0.248 | 0.385 | 12.09 | 0.481 | 0.419 |
| GenWarp | ✓ | 8.69 | 0.253 | 0.597 | 9.54 | 0.298 | 0.538 | |
| ViewCrafter | ✓ | 14.04 | 0.390 | 0.332 | 13.59 | 0.382 | 0.486 | |
| ReNoV w/ DINOv2 | ✓ | 15.02 | 0.579 | 0.328 | 14.13 | 0.574 | 0.304 | |
| ReNoV w/ VGGT | ✓ | 14.08 | 0.536 | 0.355 | 13.91 | 0.542 | 0.333 | |
| ReNoV w/ DA3 | ✓ | 14.35 | 0.534 | 0.325 | 14.12 | 0.550 | 0.292 | |
| Method | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| PixelSplat | 14.01 | 0.582 | 0.384 |
| MVSplat | 12.13 | 0.534 | 0.380 |
| NopoSplat | 14.36 | 0.538 | 0.389 |
| ReNoV w/ VGGT (Ours) | 17.49 | 0.598 | 0.247 |
| Components | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| (a) Warped image only | 16.55 | 0.559 | 0.260 |
| (b) Warped image + Pointmap | 16.93 | 0.594 | 0.243 |
| (c) Pointmap + VGGT Feature | 17.50 | 0.598 | 0.247 |
| Removal % | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| 0% (Original) | 12.04 | 0.509 | 0.371 |
| 30% removal | 11.89 | 0.507 | 0.366 |
| 50% removal | 11.92 | 0.507 | 0.367 |
Performance remains stable even with significant point cloud degradation.
@article{kwak2026renov,
title={Projected Representation Conditioning for High-fidelity Novel View Synthesis},
author={Kwak, Minseop and Kwon, Minkyung and Choi, Jinhyeok and Park, Jiho and Kim, Seungryong},
journal={arXiv preprint},
year={2026}
}