arXiv 2025
Multi-view diffusion models have recently established themselves as a powerful paradigm for novel view synthesis, generating diverse images with high visual fidelity. However, the underlying mechanisms that enable these models to maintain geometric consistency across different viewpoints have remained largely unexplored. In this work, we conduct an in-depth analysis of the 3D self-attention layers within these models. We empirically verify that geometric correspondence naturally emerges in specific attention layers during training, allowing the model to attend to spatially corresponding regions across reference and target views.
Despite this emergent capability, our analysis reveals that the implicit correspondence signal is often incomplete and fragile, particularly degrading under scenarios involving complex geometries or large viewpoint changes. Addressing this limitation, we introduce CAMEO (Correspondence-Attention Alignment), a training framework that explicitly supervises the model's attention maps using dense geometric correspondence priors. By applying this supervision to just a single, optimal attention layer (Layer 10), CAMEO significantly enhances the model's structural understanding. Our experiments demonstrate that CAMEO reduces the training iterations required for convergence by 50% while consistently outperforming baseline models in geometric fidelity on challenging datasets such as RealEstate10K and CO3D.
To understand the internal mechanisms governing geometric consistency in multi-view diffusion models, we analyzed its 3D self-attention maps of CAT3D. We discovered that geometric correspondence naturally emerges specifically in Layer 10, serving as a critical signal for view-consistent generation.
Visualizing Attention Maps. Comparison between Early Layers vs. Layer 10.
Our layer-wise analysis reveals a distinct contrast in behavior. The shallow layers (e.g., Layers 2–6) exhibit significant noise and fail to establish meaningful connections between views. In contrast, Layer 10 (in the CAT3D model) spontaneously demonstrates a strong emergent ability to capture geometric correspondence. Even without explicit supervision, the attention mechanism in this layer naturally focuses on geometrically corresponding points across different views, acting as the primary internal carrier of geometric information.
Perturbation Analysis. Result of perturbing Layer 10 vs. other layers.
To verify the causal relationship between this emergent correspondence and generation quality, we conducted a perturbation analysis. While perturbing earlier layers leaves the output nearly unchanged, perturbing Layer 10 leads to a complete collapse of the scene's geometric structure. This proves that the correspondence signal in Layer 10 is not just an artifact but a critical component for maintaining view consistency.
Quantitative Analysis. (Left) Precision across layers, (Middle) Precision vs. Training Iterations, (Right) Precision under viewpoint rotation.
We quantitatively evaluated the correspondence precision on the NAVI dataset. Our analysis highlights three key findings:
CAMEO Framework Overview. We introduce Correspondence-Attention Alignment, a training framework that explicitly supervises the model's attention maps using dense geometric correspondence priors.
CAMEO minimizes the Cross-Entropy loss between the predicted attention \(A^l_{i,j}\) and the ground-truth correspondence \(P_{i,j}\), to ensure the model attends to physically correct regions across views: \[ \mathcal{L}_{\text{CAMEO}} = \sum_{i,j} \text{CE}(A^l_{i,j}, P_{i,j}) \]
Simple, Targeted & Model-Agnostic: CAMEO introduces a straightforward supervision strategy by aligning the cross-view attention map with geometric correspondence at a single target layer (Layer 10). This simple yet effective approach propagates geometric awareness throughout the network without architectural redesigns. Furthermore, CAMEO is universally applicable, demonstrating consistent improvements in geometric consistency on both UNet-based and Transformer-based models without compromising generation quality.
@article{kwon2025cameo,
title={CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models},
author={Kwon, Minkyung and Choi, Jinhyeok and Park, Jiho and Jeon, Seonghu and Jang, Jinhyuk and Seo, Junyoung and Kwak, Min-Seop and Kim, Jin-Hwa and Kim, Seungryong},
journal={arXiv preprint arXiv:2512.03045}
year={2025}
}