AM-Adapter:Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild

1KAIST AI, 2Korea University, 3Hyundai Mobis
ICCV 2025

(a) Given an exemplar image representing the user-intended appearance and a target segmentation map guiding the semantic structure, our AM-Adapter generates high-quality images that preserve the detailed local appearance of the exemplar and maintain the accurate image structure defined by the segmentation map. We demonstrate the versatility of our method in various applications, including (b) image-to-image translation and (c) segmentation-based image editing.


Abstract

Exemplar-based semantic image synthesis generates images aligned with semantic content while preserving the appearance of an exemplar. Conventional structure-guidance models like ControlNet, are limited as they rely solely on text prompts to control appearance and cannot utilize exemplar images as input. Recent tuning-free approaches address this by transferring local appearance via implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, prior works are often restricted to single-object cases or foreground object appearance transfer, struggling with complex scenes involving multiple objects. To overcome this, we propose AM-Adapter (Appearance Matching Adapter) to address exemplar-based semantic image synthesis in-the-wild, enabling multi-object appearance transfer from a single scene-level image. AM-Adapter automatically transfers local appearances from the scene-level input. AM-Adapter alternatively provides controllability to map user-defined object details to specific locations in the synthesized images. Our learnable framework enhances cross-image matching within augmented self-attention by integrating semantic information from segmentation maps. To disentangle generation and matching, we adopt stage-wise training. We first train the structure-guidance and generation networks, followed by training the matching adapter while keeping the others frozen. During inference, we introduce an automated exemplar retrieval method for selecting exemplar image-segmentation pairs efficiently. Despite utilizing minimal learnable parameters, AM-Adapter achieves state-of-the-art performance, excelling in both semantic alignment and local appearance fidelity. Extensive ablations validate our design choices.

Motivation

Motivation image 1

Semantic image synthesis results by (a) ControlNeXt (b) IP-Adapter + ControlNet, (c) Ctrl-X, and (d) AM-Adapter. Compared to prior studies that often fail to capture local exemplar's appearance or struggle with finding accurate matching between exemplar image and target segmentation, our AM-Adapter precisely matches exemplars to targets in content-rich scenarios, preserving local details while maintaining alignment with the target structure.

Architecture

Method image

Overall architecture of the proposed framework

Qualitative Results

Qualitative Comparisons on BDD100K and Cityscapes. Compared to previous methods that fail to transfer local appearance or find accurate matching, AM-Adapter effectively achieves precise matching in content-rich, complex scenes.

Additional Qualitative Results of AM-Adapter. Visualization of results generated by AM-Adapter across various scenarios.

Qualitative Comparisons with Other Models. Additional qualitative comparison on BDD100K dataset with previous methods

Quantitative Results

Main Quantitative Results The best-performing results are highlighted in red, while the second-best results are shaded in yellow.

BibTeX

@misc{jin2025appearancematchingadapterexemplarbased,
      title={Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild}, 
      author={Siyoon Jin and Jisu Nam and Jiyoung Kim and Dahyun Chung and Yeong-Seok Kim and Joonhyung Park and Heonjeong Chu and Seungryong Kim},
      year={2025},
      eprint={2412.03150},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.03150}, 
}