Abstract
In this work, we introduce VQA 360°, a novel task of visual question answering on 360° images. Unlike a normal field-of-view image, a 360° image captures the entire visual content around the optical center of a camera, demanding more sophisticated spatial understanding and reasoning. To address this problem, we collect the first VQA 360° dataset, containing around 17, 000 real-world image-question-answer triplets for a variety of question types. We then study two different VQA models on VQA 360°, including one conventional model that takes an equirectangular image (with intrinsic distortion) as input and one dedicated model that first projects a 360° image onto cubemaps and subsequently aggregates the information from multiple spatial resolutions. We demonstrate that the cubemap-based model with multi-level fusion and attention diffusion performs favorably against other variants and the equirectangular-based models. Nevertheless, the gap between the humans' and machines' performance reveals the need for more advanced VQA 360° algorithms. We, therefore, expect our dataset and studies to serve as the benchmark for future development in this challenging task. Dataset, code, and pre-trained models are available online.1.
Original language | English |
---|---|
Title of host publication | Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 1596-1605 |
Number of pages | 10 |
ISBN (Electronic) | 9781728165530 |
DOIs | |
Publication status | Published - 2020 Mar |
Event | 2020 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2020 - Snowmass Village, United States Duration: 2020 Mar 1 → 2020 Mar 5 |
Publication series
Name | Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020 |
---|
Conference
Conference | 2020 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2020 |
---|---|
Country | United States |
City | Snowmass Village |
Period | 20/3/1 → 20/3/5 |
Bibliographical note
Funding Information:Acknowledgments. This work is supported in part by NSF CAREER (# 1149783) and MOST 108-2634-F-007-006 Joint Research Center for AI Technology and All Vista Healthcare, Taiwan.
All Science Journal Classification (ASJC) codes
- Computer Science Applications
- Computer Vision and Pattern Recognition