|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 1: Unlike existing methods that explicitly compute and store discrete matching field defined at low resolution, we implicitly represent a high-dimensional 4D matching field with deep fully-connected networks defined at arbitrary original image resolution. |
Existing pipelines of semantic correspondence commonly include extracting high level semantic features for the invariance against intra-class variations and background clutters. This architecture, however, inevitably results in a low-resolution matching field that additionally requires an ad-hoc interpolation process as a post-processing for converting it into a high-resolution one, certainly limiting the performance of matching results. To overcome this, inspired by recent success of implicit neural representation, we present a novel method for semantic correspondence, called neural matching field (NeMF). However, complicacy and high-dimensionality of a 4D matching field are the major hindrances. To address them, we propose a cost embedding network consisting of convolution and self-attention layers to process the coarse cost volume to obtain cost feature representation, which is used as a guidance for establishing high-precision matching field through the following fully-connected network. Although this may help to better structure the matching field, learning a high-dimensional matching field remains challenging mainly due to computational complexity, since a naïve ex- haustive inference would require querying from all pixels in the 4D space to infer pixel-wise correspondences. To overcome this, in the training phase, we randomly sample matching candidates. In the inference phase, we propose a novel inference approach which iteratively performs PatchMatch-based inference and coordinate optimization at test time. With the proposed method, competitive results are at- tained on several standard benchmarks for semantic correspondence. |
| |
Figure 2: Given a pair of images as an input, we first extract features using CNNs and compute an initial noisy cost volume at low resolution. We feed the noisy cost volume with the proposed encoder consisting of convolution and Transformer, and decode with deep fully connected networks by taking the encoded cost and coordinates as inputs. |
| |
Figure 3: Overview of neural matching field optimization: Given an encoded cost, we randomly sample coordinates from uniform distribution. The random coordinates and ground-truth coordinate are then processed altogether to obtain the matching scores and the cross-entropy loss is computed for the training signal. |
| |
Figure 4: Illustration of the proposed PatchMatch and coordinate optimization: With the learned neural matching field, the proposed PatchMatch injects explicit smoothness and reduces the search range. The subsequent optimization strategy searches for a location that maximizes the score of MLP. |
| |
Figure 5: Visualization of flow maps for different N iterations: (a) source image, (b) target image. As the number of iteration we set increases along (c), (d), (e) and (f) at inference phase, NeMF with trained MLP predicts more precise matching fields by PatchMatch-based sampling and coordinate optimization. |
| |
Figure 6: Visualization of matching fields: (a) source image, where the keypoint is marked as green triangle, (b), (c) 2D contour plots of cost by CATs and the NeMF (ours), respectively, and (d), (e) 3D visualization of cost by CATs and NeMF, with respect to the keypoint in (a). Note that all the visualizations are smoothed by a Gaussian kernel. Compared to CATs, NeMF has higher peak near ground-truth and makes a more accurate prediction. |
| |
Table 1: Quantitative evaluation on standard benchmarks : Higher PCK is better. The best results are in bold, and the second best results are underlined. All results are taken from the papers. Eval. Reso.: Evaluation Resolution, Flow Reso.: Flow Resolution. |
| |
Table 2: Per-class quantitative evaluation on SPair-71k benchmark. |
| |
Figure 7: Qualitative results on PF-PASCAL : keypoint transfer results by (a), (c) CATs and (b), (d) NeMF. Green and red line denote correct and wrong prediction (αimg = 0.1), respectively. Note that correspondences are estimated at the original resolutions of images. |
Acknowledgements |