Abstract
We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JSFusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.
Original language | English |
---|---|
Title of host publication | Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings |
Editors | Vittorio Ferrari, Cristian Sminchisescu, Martial Hebert, Yair Weiss |
Publisher | Springer Verlag |
Pages | 487-503 |
Number of pages | 17 |
ISBN (Print) | 9783030012335 |
DOIs | |
Publication status | Published - 2018 |
Event | 15th European Conference on Computer Vision, ECCV 2018 - Munich, Germany Duration: 2018 Sept 8 → 2018 Sept 14 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 11211 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Other
Other | 15th European Conference on Computer Vision, ECCV 2018 |
---|---|
Country/Territory | Germany |
City | Munich |
Period | 18/9/8 → 18/9/14 |
Bibliographical note
Funding Information:Acknowledgements.. We thank Jisung Kim and Antoine Miech for helpful comments about the model. This research was supported by Brain Research Program by National Research Foundation of Korea (NRF) (2017M3C7A1047860).
Publisher Copyright:
© Springer Nature Switzerland AG 2018.
All Science Journal Classification (ASJC) codes
- Theoretical Computer Science
- Computer Science(all)