Exemplar-based semantic image synthesis generates images aligned with semantic content while preserving the appearance of an exemplar. Conventional structure-guidance models like ControlNet, are limited as they rely solely on text prompts to control appearance and cannot utilize exemplar images as input. Recent tuning-free approaches address this by transferring local appearance via implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, prior works are often restricted to single-object cases or foreground object appearance transfer, struggling with complex scenes involving multiple objects. To overcome this, we propose AM-Adapter (Appearance Matching Adapter) to address exemplar-based semantic image synthesis in-the-wild, enabling multi-object appearance transfer from a single scene-level image. AM-Adapter automatically transfers local appearances from the scene-level input. AM-Adapter alternatively provides controllability to map user-defined object details to specific locations in the synthesized images. Our learnable framework enhances cross-image matching within augmented self-attention by integrating semantic information from segmentation maps. To disentangle generation and matching, we adopt stage-wise training. We first train the structure-guidance and generation networks, followed by training the matching adapter while keeping the others frozen. During inference, we introduce an automated exemplar retrieval method for selecting exemplar image-segmentation pairs efficiently. Despite utilizing minimal learnable parameters, AM-Adapter achieves state-of-the-art performance, excelling in both semantic alignment and local appearance fidelity. Extensive ablations validate our design choices.
Overall architecture of the proposed framework
Qualitative Comparisons on BDD100K and Cityscapes. Compared to previous methods that fail to transfer local appearance or find accurate matching, AM-Adapter effectively achieves precise matching in content-rich, complex scenes.
Additional Qualitative Results of AM-Adapter. Visualization of results generated by AM-Adapter across various scenarios.
Qualitative Comparisons with Other Models. Additional qualitative comparison on BDD100K dataset with previous methods
@misc{jin2025appearancematchingadapterexemplarbased,
title={Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild},
author={Siyoon Jin and Jisu Nam and Jiyoung Kim and Dahyun Chung and Yeong-Seok Kim and Joonhyung Park and Heonjeong Chu and Seungryong Kim},
year={2025},
eprint={2412.03150},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.03150},
}