We propose a novel end-to-end solution for video instance segmentation (VIS) based on transformers. Recently, the per-clip pipeline shows superior performance over per-frame methods leveraging richer information from multiple frames. However, previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications, limiting practicality. In this work, we propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip. Specifically, we propose to utilize concise memory tokens as a means of conveying information as well as summarizing each frame scene. The features of each frame are enriched and correlated with other frames through exchange of information between the precisely encoded memory tokens. We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on YouTube-VIS 2019 val set using the offline inference) while having a considerably fast runtime (89.4 FPS). Our method can also be applied to near-online inference for processing a video in real-time with only a small delay. The code is available at https://github.com/sukjunhwang/IFC.
|Title of host publication||Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021|
|Editors||Marc'Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy S. Liang, Jenn Wortman Vaughan|
|Publisher||Neural information processing systems foundation|
|Number of pages||12|
|Publication status||Published - 2021|
|Event||35th Conference on Neural Information Processing Systems, NeurIPS 2021 - Virtual, Online|
Duration: 2021 Dec 6 → 2021 Dec 14
|Name||Advances in Neural Information Processing Systems|
|Conference||35th Conference on Neural Information Processing Systems, NeurIPS 2021|
|Period||21/12/6 → 21/12/14|
Bibliographical noteFunding Information:
This research was grant funded by the Artificial Intelligence Graduate School Program of Yonsei University, under Grant 2020-0-01361, Korea Evaluation Institute of Industrial Technology (KEIT) funded by the Ministry of Trade, Industry and Energy (10073129), and also supported by the Advanced Robotics Laboratory, part of the Future Technology Center at LG Electronics.
© 2021 Neural information processing systems foundation. All rights reserved.
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Information Systems
- Signal Processing