Referring expression is a special kind of verbal expression. The goal of referring expression is to refer to a particular object in some scenarios. Referring expression generation and comprehension are two inverse tasks within the field. Considering the critical role that visual attributes play in distinguishing the referred object from other objects, we propose an attribute-guided attention model to address the two tasks. In our proposed framework, attributes collected from referring expressions are used as explicit supervision signals on the generation and comprehension modules. The online predicted attributes of the visual object can benefit both tasks in two aspects: First, attributes can be directly embedded into the generation and comprehension modules, distinguishing the referred object as additional visual representations. Second, since attributes have their correspondence in both visual and textual space, an attribute-guided attention module is proposed as a bridging part to link the counterparts in visual representation and textual expression. Attention weights learned on both visual feature and word embeddings validate our motivation. We experiment on three standard datasets of RefCOCO, RefCOCO+ and RefCOCOg commonly used in this field. Both quantitative and qualitative results demonstrate the effectiveness of our proposed framework. The experimental results show significant improvements over baseline methods, and are favorably comparable to the state-of-the-art results. Further ablation study and analysis clearly demonstrate the contribution of each module, which could provide useful inspirations to the community.
Bibliographical noteFunding Information:
Manuscript received April 27, 2018; revised March 8, 2019 and December 11, 2019; accepted February 24, 2020. Date of publication March 12, 2020; date of current version March 26, 2020. This work was supported in part by the Major Project for New Generation of AI under Grant 2018AAA0100402, in part by the National Key Research and Development Program of China under Grant 2016YFB1001000, in part by the National Natural Science Foundation of China under Grant 61525306, Grant 61633021, Grant 61721004, Grant 61420106015, Grant 61806194, and Grant U1803261, in part by the Capital Science and Technology Leading Talent Training Project under Grant Z181100006318030, HW2019SOW01, and in part by CAS-AIR. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Sos S. Agaian. (Corresponding author: Jingyu Liu.) Jingyu Liu is with the School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China (e-mail: firstname.lastname@example.org).
© 1992-2012 IEEE.
All Science Journal Classification (ASJC) codes
- Computer Graphics and Computer-Aided Design