We introduce aligned novel view image and geometry synthesis model, capable of generating novel view images and corresponding geometries at arbitrary target poses, from unposed input reference images — requiring as few as a single reference image. Our approach leverages cross-modal attention distillation, enabling the spatial attention map from image denoising diffusion models to guide the geometry generation diffusion process.


Aligned Novel View Geometry Generation

We demonstrate our qualitative results, conducting feed-forward novel view synthesis while generating aligned geometry in a robust and consistent manner, flexibly handling various number of reference images at arbitrary target poses.

Ablation

Qualitative ablations across Co3D scenes show progressive improvements: naive baseline lacks spatial and geometric consistency; pointmap conditioning enhances depth awareness; mesh-based proximity adds geometric fidelity; and cross-modal attention distillation yields the best alignment and realism through synergistic image-geometry shared attention.

Architecture


Our method conducts cross-modal attention instillation, replacing the spatial attention maps of geometry denoising U-Net with image denoising U-Net's spatial attention maps, so that the image generation U-Net learns a more robust representation aligned with the geometry completion task. On the other hand, the geometry prediction networks leverage the rich semantics from image features to enhance geometry completion capability.

Abstract

We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping‐and‐inpainting methodology. Unlike prior methods that require dense posed images or pose‐embedded generative models limited to in‐domain views, our method leverages off‐the‐shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel‐view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, dubbed MoAI, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity‐based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion.

Citation