Pano-AVQA: Grounded Audio-Visual Question Answering on 360 Videos

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution


360 videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond predetermined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360 video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages11
ISBN (Electronic)9781665428125
Publication statusPublished - 2021
Event18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, Canada
Duration: 2021 Oct 112021 Oct 17

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499


Conference18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
CityVirtual, Online

Bibliographical note

Funding Information:
Acknowledgement. We thank the anonymous reviewers for their thoughtful suggestions on this work. This work was supported by AIRS Company in Hyundai Motor Company & Kia Corporation through HKMC-SNU AI Consortium Fund, Brain Research Program by National Research Foundation of Korea (NRF) (2017M3C7A1047860) and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-01082, SW StarLab). Gunhee Kim is the corresponding author.

Publisher Copyright:
© 2021 IEEE

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition


Dive into the research topics of 'Pano-AVQA: Grounded Audio-Visual Question Answering on 360 Videos'. Together they form a unique fingerprint.

Cite this