Video Object Segmentation using Kernelized Memory Network with Multiple Kernels

Hongje Seong, Junhyuk Hyun, Euntai Kim

Research output: Contribution to journalArticlepeer-review


Semi-supervised video object segmentation (VOS) is to predict the segment of a target object in a video when a ground truth segmentation mask for the target is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising approach for semi-supervised VOS. However, an important point has been overlooked in applying STM to VOS: The solution (=STM) is non-local, but the problem (=VOS) is predominantly local. To solve this mismatch between STM and VOS, we propose new VOS networks called kernelized memory network (KMN) and KMN with multiple kernels (KMN<sup><em>M</em></sup>). Our networks conduct not only <em>Query-to-Memory</em> matching but also <em>Memory-to-Query</em> matching. In <em>Memory-to-Query</em> matching, a kernel is employed to reduce the degree of non-localness of the STM. In addition, we present a Hide-and-Seek strategy in pre-training to handle occlusions effectively. The proposed networks surpass the state-of-the-art results on standard benchmarks by a significant margin (+4% in <em>J<sub>M</sub></em> on DAVIS 2017 test-dev set). The runtimes of our proposed KMN and KMN<em><sup>M</em></sup> on DAVIS 2016 validation set are 0.12 and 0.13 seconds per frame, respectively, and the two networks have similar computation times to STM. This paper is an extended version of our preliminary work, which was presented in ECCV2020.

Original languageEnglish
JournalIEEE transactions on pattern analysis and machine intelligence
Publication statusAccepted/In press - 2022

Bibliographical note

Publisher Copyright:

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition
  • Computational Theory and Mathematics
  • Artificial Intelligence
  • Applied Mathematics


Dive into the research topics of 'Video Object Segmentation using Kernelized Memory Network with Multiple Kernels'. Together they form a unique fingerprint.

Cite this