A Noise is Worth Diffusion Guidance

1Korea University    2KAIST    3Sookmyung Women's University    4Hugging Face
$\bullet$ $\circ$ Equal Contribution, Co-corresponding Author


Diffusion models often fail to generate high-quality images without guidance, such as classifier-free guidance. We propose NoiseRefine, a novel approach to improve image quality in guidance-free generation by learning to map initial noise to a guidance-free noise space.

Abstract

Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that low-magnitude, low-frequency components significantly enhance the denoising process, removing the need for guidance. Expanding on this, we propose NoiseRefine, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise eliminates the need for guidance.

Architecture

We propose a training methodology to learn a mapping from initial noise to guidance-free noise. Given an initial Gaussian noise \( x_T \), the original diffusion model parameterized by $\theta$ generates an image \( x_{0}^{\text{Guide}} \) using guidance. NoiseRefine refines the initial noise \( x_T \) to produce $\hat{x}_T = g_\phi(x_T)$, which is then input to the original model to generate an image $\hat{x}_0$ without guidance. By minimizing the distance between two images $d(x_{0}^{\text{Guide}}, \hat{x}_0)$, NoiseRefine effectively learns the desired mapping. Note that both NoiseRefine and original model also receive a prompt \( c \) as input, though this is omitted here for simplicity.

Qualitative Results


Quantitative Results

Quantitative comparison of human preference scores across diverse datasets. Starting from a refined noise (Ours) consistently yields higher human preference scores than starting with Gaussian noise, with scores comparable to the cases using guidance.
    
  • (Left) Comparison of FID/IS and inference time. Ours (second row) demonstrates a lower inference time while achieving comparable FID/IS scores in comparison to the using guidance.
  • (Right) User study results. Starting from refined noise rather than random gaussian noise yields superior quality outcomes.