Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos

Jungin Park, Jiyoung Lee, Sangryul Jeon, Sunok Kim, Seungryong Kim, Kwanghoon Sohn

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

While recognizing human actions and surrounding scenes addresses different aspects of video understanding, they have strong correlations that can be used to complement the singular information of each other. In this paper, we propose an approach for joint action and scene recognition that is formulated in an end-to-end learning framework based on temporal attention techniques and the fusion of them. By applying temporal attention modules to the generic feature network, action and scene features are extracted efficiently, and then they are composed to a single feature vector through the proposed fusion module. Our experiments on the CoVieW18 dataset show that our model is able to detect temporal attention with only weak supervision, and remarkably improves multi-task action and scene classification accuracies.

Original languageEnglish
Title of host publicationCoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018
PublisherAssociation for Computing Machinery, Inc
Pages21-26
Number of pages6
ISBN (Electronic)9781450359764
DOIs
Publication statusPublished - 2018 Oct 15
Event1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, CoVieW 2018, in conjunction with ACM Multimedia, MM 2018 - Seoul, Korea, Republic of
Duration: 2018 Oct 22 → …

Publication series

NameCoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018

Other

Other1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, CoVieW 2018, in conjunction with ACM Multimedia, MM 2018
CountryKorea, Republic of
CitySeoul
Period18/10/22 → …

Fingerprint

Fusion reactions
Learning
Experiments
Recognition (Psychology)
Datasets

All Science Journal Classification (ASJC) codes

  • Computer Science(all)
  • Health Informatics
  • Media Technology

Cite this

Park, J., Lee, J., Jeon, S., Kim, S., Kim, S., & Sohn, K. (2018). Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos. In CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018 (pp. 21-26). (CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018). Association for Computing Machinery, Inc. https://doi.org/10.1145/3265987.3265989
Park, Jungin ; Lee, Jiyoung ; Jeon, Sangryul ; Kim, Sunok ; Kim, Seungryong ; Sohn, Kwanghoon. / Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos. CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018. Association for Computing Machinery, Inc, 2018. pp. 21-26 (CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018).
@inproceedings{27f70b8f5fe8423aa71244644b5912de,
title = "Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos",
abstract = "While recognizing human actions and surrounding scenes addresses different aspects of video understanding, they have strong correlations that can be used to complement the singular information of each other. In this paper, we propose an approach for joint action and scene recognition that is formulated in an end-to-end learning framework based on temporal attention techniques and the fusion of them. By applying temporal attention modules to the generic feature network, action and scene features are extracted efficiently, and then they are composed to a single feature vector through the proposed fusion module. Our experiments on the CoVieW18 dataset show that our model is able to detect temporal attention with only weak supervision, and remarkably improves multi-task action and scene classification accuracies.",
author = "Jungin Park and Jiyoung Lee and Sangryul Jeon and Sunok Kim and Seungryong Kim and Kwanghoon Sohn",
year = "2018",
month = "10",
day = "15",
doi = "10.1145/3265987.3265989",
language = "English",
series = "CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018",
publisher = "Association for Computing Machinery, Inc",
pages = "21--26",
booktitle = "CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018",

}

Park, J, Lee, J, Jeon, S, Kim, S, Kim, S & Sohn, K 2018, Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos. in CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018. CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018, Association for Computing Machinery, Inc, pp. 21-26, 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, CoVieW 2018, in conjunction with ACM Multimedia, MM 2018, Seoul, Korea, Republic of, 18/10/22. https://doi.org/10.1145/3265987.3265989

Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos. / Park, Jungin; Lee, Jiyoung; Jeon, Sangryul; Kim, Sunok; Kim, Seungryong; Sohn, Kwanghoon.

CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018. Association for Computing Machinery, Inc, 2018. p. 21-26 (CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos

AU - Park, Jungin

AU - Lee, Jiyoung

AU - Jeon, Sangryul

AU - Kim, Sunok

AU - Kim, Seungryong

AU - Sohn, Kwanghoon

PY - 2018/10/15

Y1 - 2018/10/15

N2 - While recognizing human actions and surrounding scenes addresses different aspects of video understanding, they have strong correlations that can be used to complement the singular information of each other. In this paper, we propose an approach for joint action and scene recognition that is formulated in an end-to-end learning framework based on temporal attention techniques and the fusion of them. By applying temporal attention modules to the generic feature network, action and scene features are extracted efficiently, and then they are composed to a single feature vector through the proposed fusion module. Our experiments on the CoVieW18 dataset show that our model is able to detect temporal attention with only weak supervision, and remarkably improves multi-task action and scene classification accuracies.

AB - While recognizing human actions and surrounding scenes addresses different aspects of video understanding, they have strong correlations that can be used to complement the singular information of each other. In this paper, we propose an approach for joint action and scene recognition that is formulated in an end-to-end learning framework based on temporal attention techniques and the fusion of them. By applying temporal attention modules to the generic feature network, action and scene features are extracted efficiently, and then they are composed to a single feature vector through the proposed fusion module. Our experiments on the CoVieW18 dataset show that our model is able to detect temporal attention with only weak supervision, and remarkably improves multi-task action and scene classification accuracies.

UR - http://www.scopus.com/inward/record.url?scp=85058144673&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058144673&partnerID=8YFLogxK

U2 - 10.1145/3265987.3265989

DO - 10.1145/3265987.3265989

M3 - Conference contribution

AN - SCOPUS:85058144673

T3 - CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018

SP - 21

EP - 26

BT - CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018

PB - Association for Computing Machinery, Inc

ER -

Park J, Lee J, Jeon S, Kim S, Kim S, Sohn K. Learning to detect, associate, and recognize human actions and surrounding scenes in untrimmed videos. In CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018. Association for Computing Machinery, Inc. 2018. p. 21-26. (CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018). https://doi.org/10.1145/3265987.3265989