Exploiting hierarchical visual features for visual question answering

Jongkwang Hong, Jianlong Fu, Youngjung Uh, Tao Mei, Hyeran Byun

Research output: Contribution to journalArticle

Abstract

Visual question answering (VQA) aims reasoning answers given a pair of textual question and image. Previous approaches for VQA use only the highest layer of a Convolutional Neural Network (CNN) for visual representation, which biases on object classification task. These object-categorization oriented features lose low-level semantics (attribute related questions), e.g., color, texture, and the number of instances. Consequently, conventional VQA methods are vulnerable to low-level semantic questions. On the other hand, low-level layer features retain the low-level semantics. Thus, we suggest that the low-level layer features are superior in low-level semantic questions, and justify it through our experiments. Furthermore, we propose a novel VQA model named Hierarchical Feature Network (HFnet), which exploits intermediate CNN layers to derive various semantics for VQA. In the answer reasoning stage, each hierarchical feature is combined with the attention map and multimodal pooled to consider both high and low level semantic questions. Our proposed model outperforms the existing methods. The qualitative experiments also demonstrate that our proposed HFnet is superior in reasoning attention regions.

Original languageEnglish
Pages (from-to)187-195
Number of pages9
JournalNeurocomputing
Volume351
DOIs
Publication statusPublished - 2019 Jul 25

Fingerprint

Semantics
Neural networks
Network layers
Color
Textures
Experiments

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence

Cite this

Hong, Jongkwang ; Fu, Jianlong ; Uh, Youngjung ; Mei, Tao ; Byun, Hyeran. / Exploiting hierarchical visual features for visual question answering. In: Neurocomputing. 2019 ; Vol. 351. pp. 187-195.
@article{185549bf591c4374a661eff850811ef0,
title = "Exploiting hierarchical visual features for visual question answering",
abstract = "Visual question answering (VQA) aims reasoning answers given a pair of textual question and image. Previous approaches for VQA use only the highest layer of a Convolutional Neural Network (CNN) for visual representation, which biases on object classification task. These object-categorization oriented features lose low-level semantics (attribute related questions), e.g., color, texture, and the number of instances. Consequently, conventional VQA methods are vulnerable to low-level semantic questions. On the other hand, low-level layer features retain the low-level semantics. Thus, we suggest that the low-level layer features are superior in low-level semantic questions, and justify it through our experiments. Furthermore, we propose a novel VQA model named Hierarchical Feature Network (HFnet), which exploits intermediate CNN layers to derive various semantics for VQA. In the answer reasoning stage, each hierarchical feature is combined with the attention map and multimodal pooled to consider both high and low level semantic questions. Our proposed model outperforms the existing methods. The qualitative experiments also demonstrate that our proposed HFnet is superior in reasoning attention regions.",
author = "Jongkwang Hong and Jianlong Fu and Youngjung Uh and Tao Mei and Hyeran Byun",
year = "2019",
month = "7",
day = "25",
doi = "10.1016/j.neucom.2019.03.035",
language = "English",
volume = "351",
pages = "187--195",
journal = "Review of Economic Dynamics",
issn = "1094-2025",
publisher = "Academic Press Inc.",

}

Exploiting hierarchical visual features for visual question answering. / Hong, Jongkwang; Fu, Jianlong; Uh, Youngjung; Mei, Tao; Byun, Hyeran.

In: Neurocomputing, Vol. 351, 25.07.2019, p. 187-195.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Exploiting hierarchical visual features for visual question answering

AU - Hong, Jongkwang

AU - Fu, Jianlong

AU - Uh, Youngjung

AU - Mei, Tao

AU - Byun, Hyeran

PY - 2019/7/25

Y1 - 2019/7/25

N2 - Visual question answering (VQA) aims reasoning answers given a pair of textual question and image. Previous approaches for VQA use only the highest layer of a Convolutional Neural Network (CNN) for visual representation, which biases on object classification task. These object-categorization oriented features lose low-level semantics (attribute related questions), e.g., color, texture, and the number of instances. Consequently, conventional VQA methods are vulnerable to low-level semantic questions. On the other hand, low-level layer features retain the low-level semantics. Thus, we suggest that the low-level layer features are superior in low-level semantic questions, and justify it through our experiments. Furthermore, we propose a novel VQA model named Hierarchical Feature Network (HFnet), which exploits intermediate CNN layers to derive various semantics for VQA. In the answer reasoning stage, each hierarchical feature is combined with the attention map and multimodal pooled to consider both high and low level semantic questions. Our proposed model outperforms the existing methods. The qualitative experiments also demonstrate that our proposed HFnet is superior in reasoning attention regions.

AB - Visual question answering (VQA) aims reasoning answers given a pair of textual question and image. Previous approaches for VQA use only the highest layer of a Convolutional Neural Network (CNN) for visual representation, which biases on object classification task. These object-categorization oriented features lose low-level semantics (attribute related questions), e.g., color, texture, and the number of instances. Consequently, conventional VQA methods are vulnerable to low-level semantic questions. On the other hand, low-level layer features retain the low-level semantics. Thus, we suggest that the low-level layer features are superior in low-level semantic questions, and justify it through our experiments. Furthermore, we propose a novel VQA model named Hierarchical Feature Network (HFnet), which exploits intermediate CNN layers to derive various semantics for VQA. In the answer reasoning stage, each hierarchical feature is combined with the attention map and multimodal pooled to consider both high and low level semantic questions. Our proposed model outperforms the existing methods. The qualitative experiments also demonstrate that our proposed HFnet is superior in reasoning attention regions.

UR - http://www.scopus.com/inward/record.url?scp=85065025601&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85065025601&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2019.03.035

DO - 10.1016/j.neucom.2019.03.035

M3 - Article

AN - SCOPUS:85065025601

VL - 351

SP - 187

EP - 195

JO - Review of Economic Dynamics

JF - Review of Economic Dynamics

SN - 1094-2025

ER -