Eidetic 3D LSTM: A model for video prediction and beyond

Yunbo Wang, Lu Jiang, Ming Hsuan Yang, Li Jia Li, Mingsheng Long, Li Fei-Fei

Research output: Contribution to conferencePaper

Abstract

Spatiotemporal predictive learning, though long considered to be a promising self-supervised feature learning method, seldom shows its effectiveness beyond future video prediction. The reason is that it is difficult to learn good representations for both short-term frame dependency and long-term high-level relations. We present a new model, Eidetic 3D LSTM (E3D-LSTM), that integrates 3D convolutions into RNNs. The encapsulated 3D-Conv makes local perceptrons of RNNs motion-aware and enables the memory cell to store better short-term features. For long-term relations, we make the present memory state interact with its historical records via a gate-controlled self-attention module. We describe this memory transition mechanism eidetic as it is able to effectively recall the stored memories across multiple time stamps even after long periods of disturbance. We first evaluate the E3D-LSTM network on widely-used future video prediction datasets and achieve the state-of-the-art performance. Then we show that the E3D-LSTM network also performs well on the early activity recognition to infer what is happening or what will happen after observing only limited frames of video. This task aligns well with video prediction in modeling action intentions and tendency.

Original languageEnglish
Publication statusPublished - 2019 Jan 1
Event7th International Conference on Learning Representations, ICLR 2019 - New Orleans, United States
Duration: 2019 May 62019 May 9

Conference

Conference7th International Conference on Learning Representations, ICLR 2019
CountryUnited States
CityNew Orleans
Period19/5/619/5/9

Fingerprint

video
Data storage equipment
learning method
Convolution
Neural networks
Prediction
present
learning
performance

All Science Journal Classification (ASJC) codes

  • Education
  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Cite this

Wang, Y., Jiang, L., Yang, M. H., Li, L. J., Long, M., & Fei-Fei, L. (2019). Eidetic 3D LSTM: A model for video prediction and beyond. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
Wang, Yunbo ; Jiang, Lu ; Yang, Ming Hsuan ; Li, Li Jia ; Long, Mingsheng ; Fei-Fei, Li. / Eidetic 3D LSTM : A model for video prediction and beyond. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
@conference{21d334f752bd4f6a8d84669364a3bc3c,
title = "Eidetic 3D LSTM: A model for video prediction and beyond",
abstract = "Spatiotemporal predictive learning, though long considered to be a promising self-supervised feature learning method, seldom shows its effectiveness beyond future video prediction. The reason is that it is difficult to learn good representations for both short-term frame dependency and long-term high-level relations. We present a new model, Eidetic 3D LSTM (E3D-LSTM), that integrates 3D convolutions into RNNs. The encapsulated 3D-Conv makes local perceptrons of RNNs motion-aware and enables the memory cell to store better short-term features. For long-term relations, we make the present memory state interact with its historical records via a gate-controlled self-attention module. We describe this memory transition mechanism eidetic as it is able to effectively recall the stored memories across multiple time stamps even after long periods of disturbance. We first evaluate the E3D-LSTM network on widely-used future video prediction datasets and achieve the state-of-the-art performance. Then we show that the E3D-LSTM network also performs well on the early activity recognition to infer what is happening or what will happen after observing only limited frames of video. This task aligns well with video prediction in modeling action intentions and tendency.",
author = "Yunbo Wang and Lu Jiang and Yang, {Ming Hsuan} and Li, {Li Jia} and Mingsheng Long and Li Fei-Fei",
year = "2019",
month = "1",
day = "1",
language = "English",
note = "7th International Conference on Learning Representations, ICLR 2019 ; Conference date: 06-05-2019 Through 09-05-2019",

}

Wang, Y, Jiang, L, Yang, MH, Li, LJ, Long, M & Fei-Fei, L 2019, 'Eidetic 3D LSTM: A model for video prediction and beyond' Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States, 19/5/6 - 19/5/9, .

Eidetic 3D LSTM : A model for video prediction and beyond. / Wang, Yunbo; Jiang, Lu; Yang, Ming Hsuan; Li, Li Jia; Long, Mingsheng; Fei-Fei, Li.

2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Eidetic 3D LSTM

T2 - A model for video prediction and beyond

AU - Wang, Yunbo

AU - Jiang, Lu

AU - Yang, Ming Hsuan

AU - Li, Li Jia

AU - Long, Mingsheng

AU - Fei-Fei, Li

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Spatiotemporal predictive learning, though long considered to be a promising self-supervised feature learning method, seldom shows its effectiveness beyond future video prediction. The reason is that it is difficult to learn good representations for both short-term frame dependency and long-term high-level relations. We present a new model, Eidetic 3D LSTM (E3D-LSTM), that integrates 3D convolutions into RNNs. The encapsulated 3D-Conv makes local perceptrons of RNNs motion-aware and enables the memory cell to store better short-term features. For long-term relations, we make the present memory state interact with its historical records via a gate-controlled self-attention module. We describe this memory transition mechanism eidetic as it is able to effectively recall the stored memories across multiple time stamps even after long periods of disturbance. We first evaluate the E3D-LSTM network on widely-used future video prediction datasets and achieve the state-of-the-art performance. Then we show that the E3D-LSTM network also performs well on the early activity recognition to infer what is happening or what will happen after observing only limited frames of video. This task aligns well with video prediction in modeling action intentions and tendency.

AB - Spatiotemporal predictive learning, though long considered to be a promising self-supervised feature learning method, seldom shows its effectiveness beyond future video prediction. The reason is that it is difficult to learn good representations for both short-term frame dependency and long-term high-level relations. We present a new model, Eidetic 3D LSTM (E3D-LSTM), that integrates 3D convolutions into RNNs. The encapsulated 3D-Conv makes local perceptrons of RNNs motion-aware and enables the memory cell to store better short-term features. For long-term relations, we make the present memory state interact with its historical records via a gate-controlled self-attention module. We describe this memory transition mechanism eidetic as it is able to effectively recall the stored memories across multiple time stamps even after long periods of disturbance. We first evaluate the E3D-LSTM network on widely-used future video prediction datasets and achieve the state-of-the-art performance. Then we show that the E3D-LSTM network also performs well on the early activity recognition to infer what is happening or what will happen after observing only limited frames of video. This task aligns well with video prediction in modeling action intentions and tendency.

UR - http://www.scopus.com/inward/record.url?scp=85071184688&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071184688&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85071184688

ER -

Wang Y, Jiang L, Yang MH, Li LJ, Long M, Fei-Fei L. Eidetic 3D LSTM: A model for video prediction and beyond. 2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.