Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks

Inwoong Lee, Doyoung Kim, Seoungyoon Kang, Sanghoon Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

75 Citations (Scopus)

Abstract

This paper addresses the problems of feature representation of skeleton joints and the modeling of temporal dynamics to recognize human actions. Traditional methods generally use relative coordinate systems dependent on some joints, and model only the long-term dependency, while excluding short-term and medium term dependencies. Instead of taking raw skeletons as the input, we transform the skeletons into another coordinate system to obtain the robustness to scale, rotation and translation, and then extract salient motion features from them. Considering that Long Shortterm Memory (LSTM) networks with various time-step sizes can model various attributes well, we propose novel ensemble Temporal Sliding LSTM (TS-LSTM) networks for skeleton-based action recognition. The proposed network is composed of multiple parts containing short-term, mediumterm and long-term TS-LSTM networks, respectively. In our network, we utilize an average ensemble among multiple parts as a final feature to capture various temporal dependencies. We evaluate the proposed networks and the additional other architectures to verify the effectiveness of the proposed networks, and also compare them with several other methods on five challenging datasets. The experimental results demonstrate that our network models achieve the state-of-the-art performance through various temporal features. Additionally, we analyze a relation between the recognized actions and the multi-term TS-LSTM features by visualizing the softmax features of multiple parts.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1012-1020
Number of pages9
ISBN (Electronic)9781538610329
DOIs
Publication statusPublished - 2017 Dec 22
Event16th IEEE International Conference on Computer Vision, ICCV 2017 - Venice, Italy
Duration: 2017 Oct 222017 Oct 29

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
Volume2017-October
ISSN (Print)1550-5499

Other

Other16th IEEE International Conference on Computer Vision, ICCV 2017
CountryItaly
CityVenice
Period17/10/2217/10/29

Fingerprint

Data storage equipment
Deep learning

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition

Cite this

Lee, I., Kim, D., Kang, S., & Lee, S. (2017). Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks. In Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017 (pp. 1012-1020). [8237377] (Proceedings of the IEEE International Conference on Computer Vision; Vol. 2017-October). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV.2017.115
Lee, Inwoong ; Kim, Doyoung ; Kang, Seoungyoon ; Lee, Sanghoon. / Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks. Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 1012-1020 (Proceedings of the IEEE International Conference on Computer Vision).
@inproceedings{043e7388e47f4a159b2590c3fc3eb8d6,
title = "Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks",
abstract = "This paper addresses the problems of feature representation of skeleton joints and the modeling of temporal dynamics to recognize human actions. Traditional methods generally use relative coordinate systems dependent on some joints, and model only the long-term dependency, while excluding short-term and medium term dependencies. Instead of taking raw skeletons as the input, we transform the skeletons into another coordinate system to obtain the robustness to scale, rotation and translation, and then extract salient motion features from them. Considering that Long Shortterm Memory (LSTM) networks with various time-step sizes can model various attributes well, we propose novel ensemble Temporal Sliding LSTM (TS-LSTM) networks for skeleton-based action recognition. The proposed network is composed of multiple parts containing short-term, mediumterm and long-term TS-LSTM networks, respectively. In our network, we utilize an average ensemble among multiple parts as a final feature to capture various temporal dependencies. We evaluate the proposed networks and the additional other architectures to verify the effectiveness of the proposed networks, and also compare them with several other methods on five challenging datasets. The experimental results demonstrate that our network models achieve the state-of-the-art performance through various temporal features. Additionally, we analyze a relation between the recognized actions and the multi-term TS-LSTM features by visualizing the softmax features of multiple parts.",
author = "Inwoong Lee and Doyoung Kim and Seoungyoon Kang and Sanghoon Lee",
year = "2017",
month = "12",
day = "22",
doi = "10.1109/ICCV.2017.115",
language = "English",
series = "Proceedings of the IEEE International Conference on Computer Vision",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1012--1020",
booktitle = "Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017",
address = "United States",

}

Lee, I, Kim, D, Kang, S & Lee, S 2017, Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks. in Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017., 8237377, Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, Institute of Electrical and Electronics Engineers Inc., pp. 1012-1020, 16th IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 17/10/22. https://doi.org/10.1109/ICCV.2017.115

Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks. / Lee, Inwoong; Kim, Doyoung; Kang, Seoungyoon; Lee, Sanghoon.

Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 1012-1020 8237377 (Proceedings of the IEEE International Conference on Computer Vision; Vol. 2017-October).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks

AU - Lee, Inwoong

AU - Kim, Doyoung

AU - Kang, Seoungyoon

AU - Lee, Sanghoon

PY - 2017/12/22

Y1 - 2017/12/22

N2 - This paper addresses the problems of feature representation of skeleton joints and the modeling of temporal dynamics to recognize human actions. Traditional methods generally use relative coordinate systems dependent on some joints, and model only the long-term dependency, while excluding short-term and medium term dependencies. Instead of taking raw skeletons as the input, we transform the skeletons into another coordinate system to obtain the robustness to scale, rotation and translation, and then extract salient motion features from them. Considering that Long Shortterm Memory (LSTM) networks with various time-step sizes can model various attributes well, we propose novel ensemble Temporal Sliding LSTM (TS-LSTM) networks for skeleton-based action recognition. The proposed network is composed of multiple parts containing short-term, mediumterm and long-term TS-LSTM networks, respectively. In our network, we utilize an average ensemble among multiple parts as a final feature to capture various temporal dependencies. We evaluate the proposed networks and the additional other architectures to verify the effectiveness of the proposed networks, and also compare them with several other methods on five challenging datasets. The experimental results demonstrate that our network models achieve the state-of-the-art performance through various temporal features. Additionally, we analyze a relation between the recognized actions and the multi-term TS-LSTM features by visualizing the softmax features of multiple parts.

AB - This paper addresses the problems of feature representation of skeleton joints and the modeling of temporal dynamics to recognize human actions. Traditional methods generally use relative coordinate systems dependent on some joints, and model only the long-term dependency, while excluding short-term and medium term dependencies. Instead of taking raw skeletons as the input, we transform the skeletons into another coordinate system to obtain the robustness to scale, rotation and translation, and then extract salient motion features from them. Considering that Long Shortterm Memory (LSTM) networks with various time-step sizes can model various attributes well, we propose novel ensemble Temporal Sliding LSTM (TS-LSTM) networks for skeleton-based action recognition. The proposed network is composed of multiple parts containing short-term, mediumterm and long-term TS-LSTM networks, respectively. In our network, we utilize an average ensemble among multiple parts as a final feature to capture various temporal dependencies. We evaluate the proposed networks and the additional other architectures to verify the effectiveness of the proposed networks, and also compare them with several other methods on five challenging datasets. The experimental results demonstrate that our network models achieve the state-of-the-art performance through various temporal features. Additionally, we analyze a relation between the recognized actions and the multi-term TS-LSTM features by visualizing the softmax features of multiple parts.

UR - http://www.scopus.com/inward/record.url?scp=85041927537&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041927537&partnerID=8YFLogxK

U2 - 10.1109/ICCV.2017.115

DO - 10.1109/ICCV.2017.115

M3 - Conference contribution

AN - SCOPUS:85041927537

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 1012

EP - 1020

BT - Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Lee I, Kim D, Kang S, Lee S. Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks. In Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 1012-1020. 8237377. (Proceedings of the IEEE International Conference on Computer Vision). https://doi.org/10.1109/ICCV.2017.115