Referring expression object segmentation with caption-aware consistency

Yi Wen Chen, Yi Hsuan Tsai, Tiantian Wang, Yen Yu Lin, Ming Hsuan Yang

Research output: Contribution to conferencePaperpeer-review

2 Citations (Scopus)

Abstract

Referring expressions are natural language descriptions that identify a particular object within a scene and are widely used in our daily conversations. In this work, we focus on segmenting the object in an image specified by a referring expression. To this end, we propose an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains. We introduce the spatial-aware dynamic filters to transfer knowledge from text to image, and effectively capture the spatial information of the specified object. To better communicate between the language and visual modules, we employ a caption generation network that takes features shared across both domains as input, and improves both representations via a consistency that enforces the generated sentence to be similar to the given referring expression. We evaluate the proposed framework on two referring expression datasets and show that our method performs favorably against the state-of-the-art algorithms.

Original languageEnglish
Publication statusPublished - 2020
Event30th British Machine Vision Conference, BMVC 2019 - Cardiff, United Kingdom
Duration: 2019 Sep 92019 Sep 12

Conference

Conference30th British Machine Vision Conference, BMVC 2019
Country/TerritoryUnited Kingdom
CityCardiff
Period19/9/919/9/12

Bibliographical note

Funding Information:
Acknowledgments. This work was supported in part by Ministry of Science and Technology (MOST) under grants 107-2628-E-001-005-MY3 and 108-2634-F-007-009.

Publisher Copyright:
© 2019. The copyright of this document resides with its authors.

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition

Cite this