Abstract
Existing methods for video interpolation heavily rely on deep convolution neural networks, and thus suffer from their intrinsic limitations, such as content-agnostic kernel weights and restricted receptive field. To address these issues, we propose a Transformer-based video interpolation framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video interpolation and extend it to the spatial-temporal domain. Furthermore, we propose a space-time separation strategy to save memory usage, which also improves performance. In addition, we develop a multi-scale frame synthesis scheme to fully realize the potential of Transformers. Extensive experiments demonstrate the proposed model performs favorably against the state-of-the-art methods both quantitatively and qualitatively on a variety of benchmark datasets. The code and models are released at https://github.com/zhshi0816/Video-Frame-Interpolation-Transformer.
Original language | English |
---|---|
Title of host publication | Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 |
Publisher | IEEE Computer Society |
Pages | 17461-17470 |
Number of pages | 10 |
ISBN (Electronic) | 9781665469463 |
DOIs | |
Publication status | Published - 2022 |
Event | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States Duration: 2022 Jun 19 → 2022 Jun 24 |
Publication series
Name | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
---|---|
Volume | 2022-June |
ISSN (Print) | 1063-6919 |
Conference
Conference | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 |
---|---|
Country/Territory | United States |
City | New Orleans |
Period | 22/6/19 → 22/6/24 |
Bibliographical note
Funding Information:Similar to most existing kernel-based methods [25, 29, 30,37], we only perform 2× interpolation with VFIT. However, it can be easily extended to multi-frame interpolation by predicting kernels tied to different time steps or even arbitrary-time interpolation by taking time as an extra input similar to [10]. This will be part of our future work. Acknowledgement. M.-H. Yang is supported in part by the NSF CAREER Grant #1149783.
Publisher Copyright:
© 2022 IEEE.
All Science Journal Classification (ASJC) codes
- Software
- Computer Vision and Pattern Recognition