We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signal offers a good compromise between semantic flow methods, where the amount of training data is limited by the cost of manually selecting point correspondences, and semantic alignment ones, where the regression of a single global geometric transformation between images may be sensitive to image-specific details such as background clutter. We propose a new CNN architecture, dubbed SFNet, which implements this idea. It leverages a new and differentiable version of the argmax function for end-to-end training, with a loss that combines mask and flow consistency with smoothness terms. Experimental results demonstrate the effectiveness of our approach, which significantly outperforms the state of the art on standard benchmarks.
|Title of host publication||Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019|
|Publisher||IEEE Computer Society|
|Number of pages||10|
|Publication status||Published - 2019 Jun|
|Event||32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 - Long Beach, United States|
Duration: 2019 Jun 16 → 2019 Jun 20
|Name||Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition|
|Conference||32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019|
|Period||19/6/16 → 19/6/20|
Bibliographical noteFunding Information:
and conv5-3, and estimate correspondences with different argmax operators. They do not involve any training similar to  that uses off-the-shelf CNN features for semantic correspondence. We can see that applying the soft argmax directly to the baseline model degrades performance severely, since it is highly susceptible to multi-modal distributions. The results in the next three rows are obtained with a single adaptation layer on top of conv4-23. This demonstrates that the adaptation layer extracts features more adequate for pixel-wise semantic correspondences, boosting performance of all baseline models significantly. Particularly, we can see that the kernel soft argmax outperforms others by a large margin, since it enables training our model end-to-end including adaptation layers at a sub-pixel level and is less susceptible to multi-modal distributions. The last three rows suggest that exploiting deeper level of features is important, and using all components with the kernel soft argmax performs best in terms of the average PCK. 5. Conclusion We have presented a CNN model for learning an object-aware semantic flow end-to-end, and introduced the corresponding CNN architecture, dubbed SFNet, with a novel kernel soft argmax layer that outputs differential matches at a sub-pixel level. We have proposed to use binary foreground masks directly to train a model for learning pixel-to-pixel correspondences that are widely available and can be obtained easily compared to pixel-level annotations. The ablation studies clearly demonstrate the effectiveness of each component and loss in our model. Finally, we have shown that the proposed method is robust to distracting details and focuses on establishing dense correspondences between prominent objects, outperforming the state of the art on standard benchmarks by a significant margin. Acknowledgments. This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2017R1C1B2005584), the Louis Vuitton/ENS chair on artificial intelligence and the NYU/Inria collaboration agreement.
© 2019 IEEE.
All Science Journal Classification (ASJC) codes
- Computer Vision and Pattern Recognition