Visual question answering (VQA) aims reasoning answers given a pair of textual question and image. Previous approaches for VQA use only the highest layer of a Convolutional Neural Network (CNN) for visual representation, which biases on object classification task. These object-categorization oriented features lose low-level semantics (attribute related questions), e.g., color, texture, and the number of instances. Consequently, conventional VQA methods are vulnerable to low-level semantic questions. On the other hand, low-level layer features retain the low-level semantics. Thus, we suggest that the low-level layer features are superior in low-level semantic questions, and justify it through our experiments. Furthermore, we propose a novel VQA model named Hierarchical Feature Network (HFnet), which exploits intermediate CNN layers to derive various semantics for VQA. In the answer reasoning stage, each hierarchical feature is combined with the attention map and multimodal pooled to consider both high and low level semantic questions. Our proposed model outperforms the existing methods. The qualitative experiments also demonstrate that our proposed HFnet is superior in reasoning attention regions.
Bibliographical noteFunding Information:
This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT ( NRF-2017M3C4A7069370 ).
© 2019 Elsevier B.V.
All Science Journal Classification (ASJC) codes
- Computer Science Applications
- Cognitive Neuroscience
- Artificial Intelligence