GeoFace: Consistent Multi-View Face Generation with Geometry-Constrained Diffusion

arXiv preprint

Yeji Choi1 Jinhyeok Choi1 Jaewon Min1 Minkyung Kwon1
Jin Hyeon Kim1 Seungryong Kim1†
1KAIST AI

Corresponding author

TL;DR

We present GeoFace, a geometry-constrained multi-view diffusion framework for consistent face generation from a single input. By jointly generating multi-view RGB images and a shared canonical UV position map through a dual-stream architecture with geometry-guided attention alignment, GeoFace enforces coherent 3D structure across all generated views — significantly outperforming prior methods, especially under large pose variations.

GeoFace Teaser

GeoFace generates geometrically consistent multi-view face images from a single reference image (left), jointly producing a shared canonical UV position map that enforces coherent 3D structure across all viewpoints.

Overview

Generating photorealistic multi-view facial images has broad applications in 3D reconstruction, digital avatars, and immersive content creation. However, generating geometrically consistent images from a single input remains challenging: existing diffusion-based methods lack an explicit mechanism to enforce a shared 3D structure across views, often leading to inconsistent geometry under large pose variations.

To address this, GeoFace proposes a unified dual-stream framework for joint generation of multi-view RGB images and 3D face geometry, where the appearance and geometry streams interact through shared attention layers.


Contributions

Method

GeoFace Framework Overview

Overall architecture of GeoFace. Given a reference image and target camera poses, GeoFace extends a multi-view diffusion backbone with a dual-stream architecture. The appearance stream denoises target view latents conditioned on Plücker ray embeddings, while the geometry stream jointly denoises a geometry latent representing the canonical UV position map. Both streams interact through shared 3D attention layers, enabling geometry to act as an explicit cross-view consistency constraint.



Geometry Stream Motivation

Geometry Analysis

Cross-view feature consistency analysis using MEt3R reveals that appearance-only models (w/o geometry stream) show significantly higher dissimilarity, particularly at facial boundary regions under large pose variations. Our dedicated geometry stream provides explicit structural tokens that anchor the appearance generation to a consistent 3D surface.



Geometry-Guided Attention Alignment

Attention Alignment Visualization

Without alignment supervision, cross-attention maps on the UV position map are diffuse and poorly localized. Our geometry-guided attention alignment loss produces sharper attention that correctly concentrates on the corresponding facial region across all viewpoints, enforcing 3D-consistent geometry-appearance correspondence.

Quantitative Results

Qualitative Results

Downstream 3D Reconstruction

3DGS Reconstruction

We evaluate the jointly generated geometry as an initialization prior for 3D Gaussian Splatting from 24 generated views. Our mesh-based initialization provides dense and uniform Gaussian point placement over the entire face, producing sharper reconstructions during early optimization and consistently lower LPIPS compared to random and COLMAP-based initialization.

In-the-Wild Results

In-the-wild Results

GeoFace generalizes robustly beyond controlled capture settings. Across diverse input types — including portraits under challenging lighting, heavily made-up faces, 3D-rendered characters, and stylized illustrations — GeoFace consistently produces geometrically coherent novel views while preserving the distinctive appearance of each input.

Ablation Study

Citation

@article{choi2026geoface,
  title={GeoFace: Consistent Multi-View Face Generation with Geometry-Constrained Diffusion},
  author={Choi, Yeji and Choi, Jinhyeok and Min, Jaewon and Kwon, Minkyung and Kim, Jin Hyeon and Kim, Seungryong},
  journal={arXiv preprint},
  year={2026}
}