Video Frame Interpolation Transformer

Zhihao Shi, Xiangyu Xu, Xiaohong Liu, Jun Chen, Ming Hsuan Yang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)


Existing methods for video interpolation heavily rely on deep convolution neural networks, and thus suffer from their intrinsic limitations, such as content-agnostic kernel weights and restricted receptive field. To address these issues, we propose a Transformer-based video interpolation framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video interpolation and extend it to the spatial-temporal domain. Furthermore, we propose a space-time separation strategy to save memory usage, which also improves performance. In addition, we develop a multi-scale frame synthesis scheme to fully realize the potential of Transformers. Extensive experiments demonstrate the proposed model performs favorably against the state-of-the-art methods both quantitatively and qualitatively on a variety of benchmark datasets. The code and models are released at

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PublisherIEEE Computer Society
Number of pages10
ISBN (Electronic)9781665469463
Publication statusPublished - 2022
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States
Duration: 2022 Jun 192022 Jun 24

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919


Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/TerritoryUnited States
CityNew Orleans

Bibliographical note

Funding Information:
Similar to most existing kernel-based methods [25, 29, 30,37], we only perform 2× interpolation with VFIT. However, it can be easily extended to multi-frame interpolation by predicting kernels tied to different time steps or even arbitrary-time interpolation by taking time as an extra input similar to [10]. This will be part of our future work. Acknowledgement. M.-H. Yang is supported in part by the NSF CAREER Grant #1149783.

Publisher Copyright:
© 2022 IEEE.

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition


Dive into the research topics of 'Video Frame Interpolation Transformer'. Together they form a unique fingerprint.

Cite this