Visual Question Answering (VQA) aims to reason an answer, given a textual question and image pair. VQA methods are required to learn the relationship between image region features. These methods have the limitation of inefficient learning that can produce a performance drop. It is because current intra-relationship methods are trying to learn all the intra-relationships, regardless of their importance. In this paper, a novel self-attention based VQA module named Selective Residual learning (SelRes) is proposed. SelRes processes the residual learning selectively in self-attention networks. It measures the importance of the input vectors by the attention map and limits residual learning, except in the selected regions which related to the correct answer. Selective masking is also proposed, which can ensure that the selection in SelRes is preserved in the multi-stack structure of the VQA network. Our full model achieves new state-of-the-art performances on both from-scratch and fine-tuning models.
All Science Journal Classification (ASJC) codes
- Computer Science Applications
- Cognitive Neuroscience
- Artificial Intelligence