Depth-Aware Guidance with Self-Estimated Depth Representations of Diffusion Models

Gyeongnyeon Kim*

Wooseok Jang*

Gyuseong Lee*

Susung Hong

Junyoung Seo

Seungryong Kim†

Korea University, Seoul, Korea
*Equal contribution †Corresponding author

[Paper]

[GitHub]

Qualitative comparisons of images generated without guidance (top) and with our depth-aware guidance (DAG), and their corresponding estimated depths and surface normals. Note that we highlight the scene layouts on the generated images. Our DAG helps better generate geometrically plausible images compared to the baseline.

Abstract

Diffusion models have recently shown significant advancement in the generative models with their impressive fidelity and diversity. The success of these models can be often attributed to their use of sampling guidance techniques, such as classifier or classifier-free guidance, which provide effective mechanisms to trade-off between fidelity and diversity. However, these methods are not capable of guiding a generated image to be aware of its geometric configuration, e.g., depth, which hinders their application to downstream tasks such as scene understanding that require a certain level of depth awareness. To overcome this limitation, we propose a novel sampling guidance method for diffusion models that uses self-predicted depth information derived from the rich intermediate representations of diffusion models. Concretely, we first present a label-efficient depth estimation framework using internal representations of diffusion models. Subsequently, we propose the incorporation of two guidance techniques during the sampling phase. These methods involve using pseudo-labeling and depth-domain diffusion prior to self-condition the generated image using the estimated depth map. Experiments and comprehensive ablation studies demonstrate the effectiveness of our method in guiding the diffusion models toward the generation of geometrically plausible images.

Overview of our Framework

First of all, we train asymmetric pixel-wise depth predictor conditioned on a timestep with respect to pretrained diffusion models in a label-efficient manner. Then, we apply two guidance strategies. First, we extract strong and weak depth maps with this predictor from DDPM network and give a depth consistency guidance. Next, giving the depth map from the strong branch as an input, the pretrained depth diffusion model is utilized to push the model prior into intermediate images to be depth-aware.

Qualitative Results

Depths and Surface Normals on LSUN Bedroom

We visualize (top) the samples without guidance ((a), (c), (e), (g)) and with depth-aware guidance ((b), (d), (f), (h)), and their corresponding depths (middle) and surface normals (bottom).

Point Cloud Visualization on LSUN Bedroom

We compare the results generated from the baseline without (left) and with (right) our guidance by showing images and transforming them into point cloud visualizations.

LSUN Church

The first row is unguided samples from DDIM, and the second row is guided samples using our guidance method, DAG.

Quantative Results

We show quantative results on LSUN Bedroom(left), and LSUN Church(right). dFID denotes the FID score using the estimated depth image. The metrics are computed with 5,000 generated samples. Best results are in bold.

Paper and Supplementary Material

G. Kim, W. Jang, G. Lee, S. Hong, J. Seo, S. Kim
DAG: Depth-Aware Guidance with Denoising Diffusion Probabilistic Models.
Hosted on ArXiv

[Bibtex]

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.