Fine-Grained Perturbation Guidance via Attention Head Selection

1KAIST AI    2Korea University   
3Krea AI    4Hugging Face    5University of Washington
$\bullet$ $\circ$ Equal Contribution, Co-corresponding Author

Motivation

Attention perturbation guidance exhibits highly diverse effects depending on the perturbed heads, with some heads showing interpretable characteristics.

Head Combination

Combining perturbed heads results in a composition of their corresponding visual concepts.

HeadHunter

By selecting attention heads aligned with the target objective, sampling can be effectively steered toward user-defined goals.

Abstract

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose “HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head’s attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

Qualitative Results

Quantitative Results

Quantitative results of HeadHunter for style-oriented quality improvement. As more heads accumulate, the generated images progressively align better with the target style and exhibit improved visual quality.

Interpolation

Generated images with linear interpolation between attention map A and an identity matrix I. While increasing u enhances quality up to a point, it eventually results in over-saturation and structural over-simplification.

BibTeX


      @article{ahn2025fine,
      title   = {Fine-Grained Perturbation Guidance via Attention Head Selection},
      author  = {Donghoon Ahn and Jiwon Kang and Sanghyun Lee and Minjae Kim and Jaewon Min and Wooseok Jang and Saungwu Lee and Sayak Paul and Susung Hong and Seungryong Kim},
      journal = {arXiv preprint arXiv:2506.10978},
      year    = {2025}
    }