Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis

1Korea University, 2KAIST, 3Hyundai Mobis
ArXiv 2024

(a) Given an exemplar image representing the user-intended appearance and a target segmentation map guiding the semantic structure, our AM-Adapter generates high-quality images that preserve the detailed local appearance of the exemplar and maintain the accurate image structure defined by the segmentation map. We demonstrate the versatility of our method in various applications, including (b) image-to-image translation and (c) segmentation-based image editing.


Abstract

Exemplar-based semantic image synthesis aims to generate images aligned with given semantic content while preserving the appearance of an exemplar image. Conventional structure-guidance models, such as ControlNet, are limited in that they cannot directly utilize exemplar images as input, relying instead solely on text prompts to control appearance. Recent tuning-free approaches address this limitation by transferring local appearance from the exemplar image to the synthesized image through implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, these methods face challenges when applied to content-rich scenes with significant geometric deformations, such as driving scenes. In this paper, we propose the Appearance Matching Adapter (AM-Adapter), a learnable framework that enhances cross-image matching within augmented self-attention by incorporating semantic information from segmentation maps. To effectively disentangle generation and matching processes, we adopt a stage-wise training approach. Initially, we train the structure-guidance and generation networks, followed by training the AM-Adapter while keeping the other networks frozen. During inference, we introduce an automated exemplar retrieval method to efficiently select exemplar image-segmentation pairs. Despite utilizing a limited number of learnable parameters, our method achieves state-of-the-art performance, excelling in both semantic alignment preservation and local appearance fidelity. Extensive ablation studies further validate our design choices. Code and pre-trained weights will be publicly available.

Motivation

Motivation image 1

Semantic image synthesis results by (a) ControlNeXt (b) IP-Adapter + ControlNet, (c) Ctrl-X, and (d) AM-Adapter. Compared to prior studies that often fail to capture local exemplar's appearance or struggle with finding accurate matching between exemplar image and target segmentation, our AM-Adapter precisely matches exemplars to targets in content-rich scenarios, preserving local details while maintaining alignment with the target structure.

Architecture

Method image

Overall architecture of the proposed framework

Qualitative Results

Qualitative Comparisons on BDD100K and Cityscapes. Compared to previous methods that fail to transfer local appearance or find accurate matching, AM-Adapter effectively achieves precise matching in content-rich, complex scenes.

Additional Qualitative Results of AM-Adapter. Visualization of results generated by AM-Adapter across various scenarios.

Qualitative Comparisons with Other Models. Additional qualitative comparison on BDD100K dataset with previous methods

Quantitative Results

Main Quantitative Results The best-performing results are highlighted in red, while the second-best results are shaded in yellow.