Exploiting hierarchical visual features for visual question answering

Jongkwang Hong, Jianlong Fu, Youngjung Uh, Tao Mei, Hyeran Byun

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)


Visual question answering (VQA) aims reasoning answers given a pair of textual question and image. Previous approaches for VQA use only the highest layer of a Convolutional Neural Network (CNN) for visual representation, which biases on object classification task. These object-categorization oriented features lose low-level semantics (attribute related questions), e.g., color, texture, and the number of instances. Consequently, conventional VQA methods are vulnerable to low-level semantic questions. On the other hand, low-level layer features retain the low-level semantics. Thus, we suggest that the low-level layer features are superior in low-level semantic questions, and justify it through our experiments. Furthermore, we propose a novel VQA model named Hierarchical Feature Network (HFnet), which exploits intermediate CNN layers to derive various semantics for VQA. In the answer reasoning stage, each hierarchical feature is combined with the attention map and multimodal pooled to consider both high and low level semantic questions. Our proposed model outperforms the existing methods. The qualitative experiments also demonstrate that our proposed HFnet is superior in reasoning attention regions.

Original languageEnglish
Pages (from-to)187-195
Number of pages9
Publication statusPublished - 2019 Jul 25

Bibliographical note

Funding Information:
This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT ( NRF-2017M3C4A7069370 ).

Publisher Copyright:
© 2019 Elsevier B.V.

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence


Dive into the research topics of 'Exploiting hierarchical visual features for visual question answering'. Together they form a unique fingerprint.

Cite this