We study the problem of non-factoid QA on instructional videos. Existing work focuses either on visual or textual modality of video content, to find matching answers to the question. However, neither is flexible enough for our problem setting of non-factoid answers with varying lengths. Motivated by this, we propose a two-stage model: (a) multimodal segmentation of video into span candidates and (b) length-adaptive ranking of the candidates to the question. First, for segmentation, we propose Segmenter for generating span candidates of diverse length, considering both textual and visual modality. Second, for ranking, we propose Ranker to score the candidates, dynamically combining the two models with complementary strength for both short and long spans respectively. Experimental result demonstrates that our model achieves state-of-the-art performance.
|Title of host publication||AAAI 2020 - 34th AAAI Conference on Artificial Intelligence|
|Number of pages||8|
|Publication status||Published - 2020|
|Event||34th AAAI Conference on Artificial Intelligence, AAAI 2020 - New York, United States|
Duration: 2020 Feb 7 → 2020 Feb 12
|Name||AAAI 2020 - 34th AAAI Conference on Artificial Intelligence|
|Conference||34th AAAI Conference on Artificial Intelligence, AAAI 2020|
|Period||20/2/7 → 20/2/12|
Bibliographical noteFunding Information:
∗This work was partially done during the first author’s internship in MSR Asia and supported by MSR Asia grant †Corresponding Author Copyright ©c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
All Science Journal Classification (ASJC) codes
- Artificial Intelligence