CAMEO : Correspondence-Attention Alignment for Multi-View Diffusion Models

arXiv 2025

Minkyung Kwon*1 Jinhyeok Choi*1 Jiho Park*1 Seonghu Jeon1 Jinhyuk Jang1 Junyoung Seo1 Minseop Kwak1 Jin-Hwa Kim†2,3 Seungryong Kim†1
1KAIST AI 2NAVER AI Lab 3SNU AIIS
*: Equal contribution †: Co-corresponding author
TL;DR

We discover that multi-view diffusion models naturally learn 3D geometry in specific attention layers. CAMEO aligns this internal attention with geometric correspondence,
achieving faster convergence and superior geometric consistency.

Abstract

Multi-view diffusion models have recently established themselves as a powerful paradigm for novel view synthesis, generating diverse images with high visual fidelity. However, the underlying mechanisms that enable these models to maintain geometric consistency across different viewpoints have remained largely unexplored. In this work, we conduct an in-depth analysis of the 3D self-attention layers within these models. We empirically verify that geometric correspondence naturally emerges in specific attention layers during training, allowing the model to attend to spatially corresponding regions across reference and target views.

Despite this emergent capability, our analysis reveals that the implicit correspondence signal is often incomplete and fragile, particularly degrading under scenarios involving complex geometries or large viewpoint changes. Addressing this limitation, we introduce CAMEO (Correspondence-Attention Alignment), a training framework that explicitly supervises the model's attention maps using dense geometric correspondence priors. By applying this supervision to just a single, optimal attention layer (Layer 10), CAMEO significantly enhances the model's structural understanding. Our experiments demonstrate that CAMEO reduces the training iterations required for convergence by 50% while consistently outperforming baseline models in geometric fidelity on challenging datasets such as RealEstate10K and CO3D.

Analysis

To understand the internal mechanisms governing geometric consistency in multi-view diffusion models, we analyzed its 3D self-attention maps of CAT3D. We discovered that geometric correspondence naturally emerges specifically in Layer 10, serving as a critical signal for view-consistent generation.

Emergence of Geometry in Layer 10

Emergence of Geometry in Layer 10

Visualizing Attention Maps. Comparison between Early Layers vs. Layer 10.

Our layer-wise analysis reveals a distinct contrast in behavior. The shallow layers (e.g., Layers 2–6) exhibit significant noise and fail to establish meaningful connections between views. In contrast, Layer 10 (in the CAT3D model) spontaneously demonstrates a strong emergent ability to capture geometric correspondence. Even without explicit supervision, the attention mechanism in this layer naturally focuses on geometrically corresponding points across different views, acting as the primary internal carrier of geometric information.

Causal Role of Layer 10

Causal Role of Layer 10

Perturbation Analysis. Result of perturbing Layer 10 vs. other layers.

To verify the causal relationship between this emergent correspondence and generation quality, we conducted a perturbation analysis. While perturbing earlier layers leaves the output nearly unchanged, perturbing Layer 10 leads to a complete collapse of the scene's geometric structure. This proves that the correspondence signal in Layer 10 is not just an artifact but a critical component for maintaining view consistency.

Quantitative Verification

Quantitative Analysis. (Left) Precision across layers, (Middle) Precision vs. Training Iterations, (Right) Precision under viewpoint rotation.

We quantitatively evaluated the correspondence precision on the NAVI dataset. Our analysis highlights three key findings:

  • Correlation with Quality: As shown in the training dynamics, there is a strong positive correlation between correspondence precision and generation quality (PSNR). As the model learns better correspondence, the visual quality improves, confirming that correspondence underpins synthesis performance.
  • CAMEO Improves Correspondence: Our proposed method, CAMEO, explicitly boosts this correspondence precision, surpassing the baseline and achieving accuracy comparable to dedicated discriminative models like DINOv3.
  • Limitation (Viewpoint Degradation): Despite these strengths, the baseline's implicit correspondence is fragile. As the relative rotation angle increases (e.g., >90°), the precision drops sharply. This "viewpoint degradation" necessitates explicit supervision to maintain robustness in challenging scenarios.

CAMEO

CAMEO Method Overview

CAMEO Framework Overview. We introduce Correspondence-Attention Alignment, a training framework that explicitly supervises the model's attention maps using dense geometric correspondence priors.

Correspondence-Attention Alignment Loss \(\mathcal{L}_{\text{CAMEO}}\)

  • Cross-View Attention Map \(A^l_{i,j} \in \mathbb{R}^{hw \times hw}\): The attention matrix from view \(i\) to view \(j\) at layer \(l\). For a query token \(x_i\), the row \(A^l_{i,j}(x_i)\) represents the predicted probability distribution over all tokens in view \(j\).
  • Geometric Correspondence Map \(P_{i,j} \in \mathbb{R}^{hw \times hw}\): The ground-truth correspondence derived from 3D pointmaps. For a query \(x_i\), \(P_{i,j}(x_i)\) is a one-hot vector where the entry corresponding to the matched token \(x_j\) is 1, and all others are 0.

CAMEO minimizes the Cross-Entropy loss between the predicted attention \(A^l_{i,j}\) and the ground-truth correspondence \(P_{i,j}\), to ensure the model attends to physically correct regions across views: \[ \mathcal{L}_{\text{CAMEO}} = \sum_{i,j} \text{CE}(A^l_{i,j}, P_{i,j}) \]

Simple, Targeted & Model-Agnostic: CAMEO introduces a straightforward supervision strategy by aligning the cross-view attention map with geometric correspondence at a single target layer (Layer 10). This simple yet effective approach propagates geometric awareness throughout the network without architectural redesigns. Furthermore, CAMEO is universally applicable, demonstrating consistent improvements in geometric consistency on both UNet-based and Transformer-based models without compromising generation quality.

Quantitative Results

Qualitative Results

Ablation Study

Citation

@article{kwon2025cameo,
  title={CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models},
  author={Kwon, Minkyung and Choi, Jinhyeok and Park, Jiho and Jeon, Seonghu and Jang, Jinhyuk and Seo, Junyoung and Kwak, Min-Seop and Kim, Jin-Hwa and Kim, Seungryong},
  journal={arXiv preprint arXiv:2512.03045} 
  year={2025}
}
```