A Noise is Worth Diffusion Guidance

¹Korea University ²KAIST ³Sookmyung Women's University ⁴Hugging Face

^{$\bullet$ $\circ$}Equal Contribution, ^†Co-corresponding Author

Abstract

Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that low-magnitude, low-frequency components significantly enhance the denoising process, removing the need for guidance. Expanding on this, we propose NoiseRefine, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise eliminates the need for guidance.

Architecture

We propose a training methodology to learn a mapping from initial noise to guidance-free noise. Given an initial Gaussian noise $ x_T $, the original diffusion model parameterized by $\theta$ generates an image $ x_{0}^{\text{Guide}} $ using guidance. NoiseRefine refines the initial noise $ x_T $ to produce $\hat{x}_T = g_\phi(x_T)$, which is then input to the original model to generate an image $\hat{x}_0$ without guidance. By minimizing the distance between two images $d(x_{0}^{\text{Guide}}, \hat{x}_0)$, NoiseRefine effectively learns the desired mapping. Note that both NoiseRefine and original model also receive a prompt $ c $ as input, though this is omitted here for simplicity.

Quantitative Results

Quantitative comparison of human preference scores across diverse datasets. Starting from a refined noise (Ours) consistently yields higher human preference scores than starting with Gaussian noise, with scores comparable to the cases using guidance.

(Left) Comparison of FID/IS and inference time. Ours (second row) demonstrates a lower inference time while achieving comparable FID/IS scores in comparison to the using guidance.
(Right) User study results. Starting from refined noise rather than random gaussian noise yields superior quality outcomes.

A Noise is Worth Diffusion Guidance

Diffusion models often fail to generate high-quality images without guidance, such as classifier-free guidance. We propose NoiseRefine, a novel approach to improve image quality in guidance-free generation by learning to map initial noise to a guidance-free noise space.

Abstract

Architecture

Qualitative Results

Quantitative Results