GeoFace: Consistent Multi-View Face Generation with Geometry-Constrained Diffusion

arXiv preprint

Yeji Choi¹ Jinhyeok Choi¹ Jaewon Min¹ Minkyung Kwon¹
Jin Hyeon Kim¹ Seungryong Kim^1†

¹KAIST AI

^†Corresponding author

TL;DR

We present GeoFace, a geometry-constrained multi-view diffusion framework for consistent face generation from a single input. By jointly generating multi-view RGB images and a shared canonical UV position map through a dual-stream architecture with geometry-guided attention alignment, GeoFace enforces coherent 3D structure across all generated views — significantly outperforming prior methods, especially under large pose variations.

GeoFace generates geometrically consistent multi-view face images from a single reference image (left), jointly producing a shared canonical UV position map that enforces coherent 3D structure across all viewpoints.

Overview

Generating photorealistic multi-view facial images has broad applications in 3D reconstruction, digital avatars, and immersive content creation. However, generating geometrically consistent images from a single input remains challenging: existing diffusion-based methods lack an explicit mechanism to enforce a shared 3D structure across views, often leading to inconsistent geometry under large pose variations.

To address this, GeoFace proposes a unified dual-stream framework for joint generation of multi-view RGB images and 3D face geometry, where the appearance and geometry streams interact through shared attention layers.

The geometry stream jointly denoises a canonical UV position map — a view-invariant 3D representation in FLAME space — alongside the appearance streams, providing an explicit shared constraint across all generated views.
A geometry-guided attention alignment loss supervises the cross-attention between appearance and geometry tokens with 3D-consistent correspondences, enabling the appearance stream to correctly reference pose-invariant geometric cues.
The jointly generated geometry serves as an effective initialization prior for downstream 3D Gaussian Splatting reconstruction.

Contributions

1 A unified dual-stream diffusion framework where appearance and geometry streams interact through shared attention layers, enforcing cross-view geometric consistency as an intrinsic property of the generative process.
2 A geometry-guided attention alignment loss that supervises cross-attention between geometry and appearance streams with 3D-consistent correspondences, encouraging the two streams to mutually constrain each other.
3 A canonical UV position map in FLAME space as a view-invariant geometry representation that is naturally compatible with 2D diffusion architectures without modification.

Method

Overall architecture of GeoFace. Given a reference image and target camera poses, GeoFace extends a multi-view diffusion backbone with a dual-stream architecture. The appearance stream denoises target view latents conditioned on Plücker ray embeddings, while the geometry stream jointly denoises a geometry latent representing the canonical UV position map. Both streams interact through shared 3D attention layers, enabling geometry to act as an explicit cross-view consistency constraint.

Geometry Stream Motivation

Cross-view feature consistency analysis using MEt3R reveals that appearance-only models (w/o geometry stream) show significantly higher dissimilarity, particularly at facial boundary regions under large pose variations. Our dedicated geometry stream provides explicit structural tokens that anchor the appearance generation to a consistent 3D surface.

Geometry-Guided Attention Alignment

Without alignment supervision, cross-attention maps on the UV position map are diffuse and poorly localized. Our geometry-guided attention alignment loss produces sharper attention that correctly concentrates on the corresponding facial region across all viewpoints, enforcing 3D-consistent geometry-appearance correspondence.

Quantitative Results

Qualitative Results

Downstream 3D Reconstruction

We evaluate the jointly generated geometry as an initialization prior for 3D Gaussian Splatting from 24 generated views. Our mesh-based initialization provides dense and uniform Gaussian point placement over the entire face, producing sharper reconstructions during early optimization and consistently lower LPIPS compared to random and COLMAP-based initialization.

In-the-Wild Results

GeoFace generalizes robustly beyond controlled capture settings. Across diverse input types — including portraits under challenging lighting, heavily made-up faces, 3D-rendered characters, and stylized illustrations — GeoFace consistently produces geometrically coherent novel views while preserving the distinctive appearance of each input.

Citation

@article{choi2026geoface,
  title={GeoFace: Consistent Multi-View Face Generation with Geometry-Constrained Diffusion},
  author={Choi, Yeji and Choi, Jinhyeok and Min, Jaewon and Kwon, Minkyung and Kim, Jin Hyeon and Kim, Seungryong},
  journal={arXiv preprint},
  year={2026}
}