Multi-task joint learning for videos in the wild

Yong Won Hong, Hoseong Kim, Hyeran Byun

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Most of the conventional state-of-the-art methods for video analysis achieve outstanding performance by combining two or more different inputs, e.g. an RGB image, a motion image, or an audio signal, in a two-stream manner. Although these approaches generate pronounced performance, it underlines that each considered feature is tantamount in the classification of the video. This dilutes the nature of each class that every class depends on the different levels of information from different features. To incorporate the nature of each class, we present the class nature specific fusion that combines the features with a different level of weights for the optimal class result. In this work, we first represent each frame-level video feature as a spectral image to train convolutional neural networks (CNNs) on the RGB and audio features. We then revise the conventional two-stream fusion method to form a class nature specific one by combining features in different weight for different classes. We evaluate our method on the Comprehensive Video Understanding in the Wild dataset to understand how each class reacted on each feature in wild videos. Our experimental results not only show the advantage over conventional two-stream fusion, but also illustrate the correlation of two features: RGB and audio signal for each class.

Original languageEnglish
Title of host publicationCoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018
PublisherAssociation for Computing Machinery, Inc
Pages27-30
Number of pages4
ISBN (Electronic)9781450359764
DOIs
Publication statusPublished - 2018 Oct 15
Event1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, CoVieW 2018, in conjunction with ACM Multimedia, MM 2018 - Seoul, Korea, Republic of
Duration: 2018 Oct 22 → …

Publication series

NameCoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018

Other

Other1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, CoVieW 2018, in conjunction with ACM Multimedia, MM 2018
CountryKorea, Republic of
CitySeoul
Period18/10/22 → …

Fingerprint

Fusion reactions
Joints
Learning
Weights and Measures
Neural networks
Datasets

All Science Journal Classification (ASJC) codes

  • Computer Science(all)
  • Health Informatics
  • Media Technology

Cite this

Hong, Y. W., Kim, H., & Byun, H. (2018). Multi-task joint learning for videos in the wild. In CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018 (pp. 27-30). (CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018). Association for Computing Machinery, Inc. https://doi.org/10.1145/3265987.3265988
Hong, Yong Won ; Kim, Hoseong ; Byun, Hyeran. / Multi-task joint learning for videos in the wild. CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018. Association for Computing Machinery, Inc, 2018. pp. 27-30 (CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018).
@inproceedings{88036104e135439aa5ec69aec62cfa27,
title = "Multi-task joint learning for videos in the wild",
abstract = "Most of the conventional state-of-the-art methods for video analysis achieve outstanding performance by combining two or more different inputs, e.g. an RGB image, a motion image, or an audio signal, in a two-stream manner. Although these approaches generate pronounced performance, it underlines that each considered feature is tantamount in the classification of the video. This dilutes the nature of each class that every class depends on the different levels of information from different features. To incorporate the nature of each class, we present the class nature specific fusion that combines the features with a different level of weights for the optimal class result. In this work, we first represent each frame-level video feature as a spectral image to train convolutional neural networks (CNNs) on the RGB and audio features. We then revise the conventional two-stream fusion method to form a class nature specific one by combining features in different weight for different classes. We evaluate our method on the Comprehensive Video Understanding in the Wild dataset to understand how each class reacted on each feature in wild videos. Our experimental results not only show the advantage over conventional two-stream fusion, but also illustrate the correlation of two features: RGB and audio signal for each class.",
author = "Hong, {Yong Won} and Hoseong Kim and Hyeran Byun",
year = "2018",
month = "10",
day = "15",
doi = "10.1145/3265987.3265988",
language = "English",
series = "CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018",
publisher = "Association for Computing Machinery, Inc",
pages = "27--30",
booktitle = "CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018",

}

Hong, YW, Kim, H & Byun, H 2018, Multi-task joint learning for videos in the wild. in CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018. CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018, Association for Computing Machinery, Inc, pp. 27-30, 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, CoVieW 2018, in conjunction with ACM Multimedia, MM 2018, Seoul, Korea, Republic of, 18/10/22. https://doi.org/10.1145/3265987.3265988

Multi-task joint learning for videos in the wild. / Hong, Yong Won; Kim, Hoseong; Byun, Hyeran.

CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018. Association for Computing Machinery, Inc, 2018. p. 27-30 (CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Multi-task joint learning for videos in the wild

AU - Hong, Yong Won

AU - Kim, Hoseong

AU - Byun, Hyeran

PY - 2018/10/15

Y1 - 2018/10/15

N2 - Most of the conventional state-of-the-art methods for video analysis achieve outstanding performance by combining two or more different inputs, e.g. an RGB image, a motion image, or an audio signal, in a two-stream manner. Although these approaches generate pronounced performance, it underlines that each considered feature is tantamount in the classification of the video. This dilutes the nature of each class that every class depends on the different levels of information from different features. To incorporate the nature of each class, we present the class nature specific fusion that combines the features with a different level of weights for the optimal class result. In this work, we first represent each frame-level video feature as a spectral image to train convolutional neural networks (CNNs) on the RGB and audio features. We then revise the conventional two-stream fusion method to form a class nature specific one by combining features in different weight for different classes. We evaluate our method on the Comprehensive Video Understanding in the Wild dataset to understand how each class reacted on each feature in wild videos. Our experimental results not only show the advantage over conventional two-stream fusion, but also illustrate the correlation of two features: RGB and audio signal for each class.

AB - Most of the conventional state-of-the-art methods for video analysis achieve outstanding performance by combining two or more different inputs, e.g. an RGB image, a motion image, or an audio signal, in a two-stream manner. Although these approaches generate pronounced performance, it underlines that each considered feature is tantamount in the classification of the video. This dilutes the nature of each class that every class depends on the different levels of information from different features. To incorporate the nature of each class, we present the class nature specific fusion that combines the features with a different level of weights for the optimal class result. In this work, we first represent each frame-level video feature as a spectral image to train convolutional neural networks (CNNs) on the RGB and audio features. We then revise the conventional two-stream fusion method to form a class nature specific one by combining features in different weight for different classes. We evaluate our method on the Comprehensive Video Understanding in the Wild dataset to understand how each class reacted on each feature in wild videos. Our experimental results not only show the advantage over conventional two-stream fusion, but also illustrate the correlation of two features: RGB and audio signal for each class.

UR - http://www.scopus.com/inward/record.url?scp=85058174479&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058174479&partnerID=8YFLogxK

U2 - 10.1145/3265987.3265988

DO - 10.1145/3265987.3265988

M3 - Conference contribution

AN - SCOPUS:85058174479

T3 - CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018

SP - 27

EP - 30

BT - CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018

PB - Association for Computing Machinery, Inc

ER -

Hong YW, Kim H, Byun H. Multi-task joint learning for videos in the wild. In CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018. Association for Computing Machinery, Inc. 2018. p. 27-30. (CoVieW 2018 - Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, co-located with MM 2018). https://doi.org/10.1145/3265987.3265988