While single-image object detectors can be naively applied to videos in a frame-by-frame fashion, the prediction is often temporally inconsistent. Moreover, the computation can be redundant since neighboring frames are inherently similar to each other. In this work we propose to improve video object detection via temporal aggregation. Specifically, a detection model is applied on sparse keyframes to handle new objects, occlusions, and rapid motions. We then use real-time trackers to exploit temporal cues and track the detected objects in the remaining frames, which enhances efficiency and temporal coherence. Object status at the bounding-box level is propagated across frames and updated by our aggregation modules. For keyframe scheduling, we propose adaptive policies using reinforcement learning and simple heuristics. The proposed framework achieves the state-of-the-art performance on the Imagenet VID 2015 dataset while running real-time on CPU. Extensive experiments are done to show the effectiveness of our training strategies and justify the model designs.
|Title of host publication||Computer Vision – ECCV 2020 - 16th European Conference, 2020, Proceedings|
|Editors||Andrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm|
|Publisher||Springer Science and Business Media Deutschland GmbH|
|Number of pages||18|
|Publication status||Published - 2020|
|Event||16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom|
Duration: 2020 Aug 23 → 2020 Aug 28
|Name||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|Conference||16th European Conference on Computer Vision, ECCV 2020|
|Period||20/8/23 → 20/8/28|
Bibliographical noteFunding Information:
This work is supported in part by the NSF CAREER Grant #1149783.
© 2020, Springer Nature Switzerland AG.
All Science Journal Classification (ASJC) codes
- Theoretical Computer Science
- Computer Science(all)