TL;DR: Seg4Diff analyzes internal mechanisms of multi-modal diffusion transformers to identify semantic grounding expert layers that naturally yield high-quality zero-shot segmentation masks, then enhances them through lightweight LoRA fine-tuning to improve both segmentation and generation quality.

Overview

Seg4Diff teaser

We introduce Seg4Diff, a systematic framework designed to analyze and enhance the emergent semantic grounding capabilities of multi-modal diffusion transformer (MM-DiT) blocks in text-to-image diffusion transformers (DiTs). Here, semantic grounding expert refers to a specific MM-DiT block responsible for establishing semantic alignment between text and image features.

Analysis

We conduct an in-depth analysis of the joint attention mechanism in MM-DiT models to understand how text and image tokens interact. We characterize the distribution of attention scores to discover active cross-modal interaction, and complement this with attention feature similarity measures to assess which modality exerts greater influence on output representations.

Attention Score Analysis

Method diagram

Multi-modal attention mechanism: (a) Conceptual visualization of the attention map. (b–c) Ratios of attention assigned to image vs. text tokens. The dotted line denotes the ratio under uniform attention. Higher cross-modal proportions are observed in I2T and T2I attention.

Attention Feature Analysis

Feature PCA analysis

PCA visualization of query, key and value projections. PCA results demonstrate that some layers exhibit strong positional bias, whereas some layers show clear semantic groups.

Attention normalization analysis

Attention feature norm analysis. The L2 norm of the value projection for image and text tokens reveals that certain layers exhibit significantly stronger value magnitudes for text tokens compared to image tokens.

Essential Role of I2T Attention in Text-to-Image Generation

Feature PCA analysis

Effect of I2T attention perturbation. Blurring the I2T regions of specific attention layers severely disrupts text–image alignment, showing that these layers are crucial for injecting semantic content into images.

Feature PCA analysis

Effect of perturbed I2T guidance. By turning this effect into a simple guidance strategy, we achieve images with higher fidelity and stronger alignment to the prompt.

The analysis reveals that specific layers consistently align textual semantics with contiguous image regions, demonstrating emergent semantic grounding capability. These layers naturally yield high-quality zero-shot segmentation masks without explicit training for segmentation tasks.

Zero-shot OVSS Framework

We propose a zero-shot framework for semantic grounding in multi-modal diffusion transformers. Given an input image and text prompt, we extract I2T attention to generate a zero-shot segmentation mask.

Seg4Diff framework illustration

Open-vocabulary semantic segmentation scheme in our framework. We generate segmentation masks by interpreting the I2T attention scores, where the score map for each text token serves as a direct measure of image-text similarity to produce the final prediction.

Seg4Diff framework illustration

Open-vocabulary semantic segmentation performance across layers. Semantic grounding quality varies across MM-DiT layers, peaking in the middle blocks and specifically at the 9th layer. This trend is similar to what we've found in previous analysis, where semantic grounding is stronger in certain layers. We refer to those layer as semantic grounding expert layers.

Decomposing Multi-modal Attention

Seg4Diff framework illustration

Deeper analysis on multi-modal attention mechanism. (a) multi-granularity behavior of token-level and head-level attention, and (b) emergent semantic grouping on <pad> tokens in unconditional generation scenario.

Mask Alignment for Segmentation and Generation (MAGNET)

Training scheme

Lightweight fine-tuning pipeline via mask alignment. We introduce a simple yet effective mask alignment for segmentation and generation (MAGNET) strategy that strengthens the I2T attention maps in the semantic grounding expert layer during additional diffusion fine-tuning with a LoRA adapter.

Experiments

We evaluate Seg4Diff on multiple benchmarks to demonstrate the effectiveness of our method on both segmentation and generation.

Seg4Diff for Segmentation

Qualitative results

Qualitative results of Seg4Diff on segmentation tasks.

Model Arch. Train. VOC20 Object PC59 ADE City
ProxyCLIPCLIP-H/14-83.349.839.624.242.0
CorrCLIPCLIP-H/14-91.852.747.928.849.9
DiffSegmenterSD1.5-66.440.045.924.212.4
iSegSD1.5-82.957.339.224.224.8
Seg4DiffSD3-89.262.049.034.226.5
Seg4DiffSD3.5-86.157.843.430.723.8
Seg4DiffFlux.1-dev-83.150.638.223.917.1
Seg4Diff + MAGNETSD3SA-1B89.162.049.134.725.4
Seg4Diff + MAGNETSD3COCO89.862.951.235.226.0

(a) Open-vocabulary semantic segmentation performance. Cross-modal alignment in the I2T attention maps of the semantic grounding expert layer yields competitive results, further enhanced by mask alignment.

Model Arch. Train. VOC21 PC59 Object Stuff-27 City ADE
ReCOCLIP-L/14-25.119.915.726.319.311.2
MaskCLIPCLIP-B/16-38.823.620.619.610.09.8
MaskCutDINO-B/8-53.843.430.141.718.735.7
DiffSegSD1.5-49.848.823.244.216.837.7
DiffCutSSD-1B-62.054.132.046.128.442.4
Seg4DiffSD3-54.952.638.549.724.244.9
Seg4DiffSD3.5-52.352.936.847.124.241.5
Seg4Diff + MAGNETSD3SA-1B55.152.839.050.824.245.0
Seg4Diff + MAGNETSD3COCO56.153.538.853.524.445.4

(b) Unsupervised segmentation performance. Although not specifically designed for unsupervised semantic segmentation, exploiting the emergent semantic grouping of <pad> tokens in the I2T attention maps achieves competitive results.

Seg4Diff for Image Generation

Attention analysis
Layer analysis
LoRA analysis
Qualitative results
Qualitative results
Qualitative results

Qualitative results of Seg4Diff across different prompts.

Qualitative results with atteention visualization

Qualitative results of Mask Alignment. Mask alignment improves structural coherence and alignment between image and text.

Method Training Pick-a-Pic COCO SA-1B Mean
Baseline--27.025226.063828.342227.1437
+ MAGNETSA-1B27.054726.231828.447627.2447
+ MAGNETCOCO27.040926.231928.555327.2760

(a) CLIPScore on text-to-image generation benchmarks. Mask alignment consistently improves alignment with text prompts across various datasets.

Method Training Attribute binding Object relationships Num. Comp.
Color Shape Texture 2D 3D non
Baseline--0.78640.56440.72000.24350.33180.31240.55660.3719
+ MAGNETSA-1B0.78360.56790.72520.23300.31510.31130.54600.3709
+ MAGNETCOCO0.79190.56870.72600.23010.32340.31200.55840.3735

(b) T2I-Compbench++ performance. Mask alignment enhances attribute binding, object relationships, and compositional understanding compared to the baseline.

Conclusion

In this work, we introduce Seg4Diff, a systematic analysis framework for multi-modal diffusion transformers that identifies semantic grounding expert layers capable of producing high-quality zero-shot segmentation masks. Our comprehensive analysis reveals that semantic alignment is an emergent property of diffusion transformers and can be selectively amplified through lightweight LoRA fine-tuning to improve both dense recognition and generative performance.

Citation

If you use this work or find it helpful, please consider citing:

@misc{kim2025seg4diffunveilingopenvocabularysegmentation,
    title={Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers}, 
    author={Chaehyun Kim and Heeseong Shin and Eunbeen Hong and Heeji Yoon and Anurag Arnab and Paul Hongsuck Seo and Sunghwan Hong and Seungryong Kim},
    year={2025},
    eprint={2509.18096},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.18096}, 
}