A joint sequence fusion model for video question answering and retrieval

Youngjae Yu, Jongseok Kim, Gunhee Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JSFusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings
EditorsVittorio Ferrari, Cristian Sminchisescu, Martial Hebert, Yair Weiss
PublisherSpringer Verlag
Pages487-503
Number of pages17
ISBN (Print)9783030012335
DOIs
Publication statusPublished - 2018
Event15th European Conference on Computer Vision, ECCV 2018 - Munich, Germany
Duration: 2018 Sept 82018 Sept 14

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11211 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other15th European Conference on Computer Vision, ECCV 2018
Country/TerritoryGermany
CityMunich
Period18/9/818/9/14

Bibliographical note

Funding Information:
Acknowledgements.. We thank Jisung Kim and Antoine Miech for helpful comments about the model. This research was supported by Brain Research Program by National Research Foundation of Korea (NRF) (2017M3C7A1047860).

Publisher Copyright:
© Springer Nature Switzerland AG 2018.

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'A joint sequence fusion model for video question answering and retrieval'. Together they form a unique fingerprint.

Cite this