The goal of unsupervised co-localization is to locate the object in a scene under the assumptions that 1) the dataset consists of only one superclass, e.g., birds, and 2) there are no human-annotated labels in the dataset. The most recent method achieves impressive co-localization performance by employing self-supervised representation learning approaches such as predicting rotation. In this paper, we introduce a new contrastive objective directly on the attention maps to enhance co-localization performance. Our contrastive loss function exploits rich information of location, which induces the model to activate the extent of the object effectively. In addition, we propose a pixel-wise attention pooling that selectively aggregates the feature map regarding their magnitudes across channels. Our methods are simple and shown effective by extensive qualitative and quantitative evaluation, achieving state-of-the-art co-localization performances by large margins on four datasets: CUB-200-2011, Stanford Cars, FGVC-Aircraft, and Stanford Dogs. Our code will be publicly available online for the research community.
|Title of host publication||Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||10|
|Publication status||Published - 2021|
|Event||18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, Canada|
Duration: 2021 Oct 11 → 2021 Oct 17
|Name||Proceedings of the IEEE International Conference on Computer Vision|
|Conference||18th IEEE/CVF International Conference on Computer Vision, ICCV 2021|
|Period||21/10/11 → 21/10/17|
Bibliographical noteFunding Information:
Acknowledgements. This work was supported by the National Research Foundation of Korea grant funded by Korean government (No. NRF-2019R1A2C2003760) and Artificial Intelligence Graduate School Program (YONSEI UNIVERSITY) under Grant 2020-0-01361.
© 2021 IEEE
All Science Journal Classification (ASJC) codes
- Computer Vision and Pattern Recognition