Audio-visual attention networks for emotion recognition

Jiyoung Lee, Sunok Kim, Seungryong Kim, Kwanghoon Sohn

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a spatiotemporal attention based multimodal deep neural networks for dimensional emotion recognition in multimodal audio-visual video sequence. To learn the temporal attention that discriminatively focuses on emotional sailient parts within speech audios, we formulate the temporal attention network using deep neural networks (DNNs). In addition, to learn the spatiotemporal attention that selectively focuses on emotional sailient parts within facial videos, the spatiotemporal encoder-decoder network is formulated using Convolutional LSTM (ConvLSTM) modules, and learned implicitly without any pixel-level annotations. By leveraging the spatiotemporal attention, the 3D convolutional neural networks (3D-CNNs) is also formulated to robustly recognize the dimensional emotion in facial videos. Furthermore, to exploit multimodal information, we fuse the audio and video features to emotion regression model. The experimental results show that our method can achieve the state-of-the-art results in dimensional emotion recognition with the highest concordance correlation coefficient (CCC) on AV+EC 2017 dataset.

Original languageEnglish
Title of host publicationAVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018
PublisherAssociation for Computing Machinery, Inc
Pages27-32
Number of pages6
ISBN (Electronic)9781450359771
DOIs
Publication statusPublished - 2018 Oct 26
Event2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, AVSU 2018, co-located with MM 2018 - Seoul, Korea, Republic of
Duration: 2018 Oct 26 → …

Publication series

NameAVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018

Conference

Conference2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, AVSU 2018, co-located with MM 2018
CountryKorea, Republic of
CitySeoul
Period18/10/26 → …

All Science Journal Classification (ASJC) codes

  • Software
  • Media Technology
  • Computer Graphics and Computer-Aided Design
  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'Audio-visual attention networks for emotion recognition'. Together they form a unique fingerprint.

  • Cite this

    Lee, J., Kim, S., Kim, S., & Sohn, K. (2018). Audio-visual attention networks for emotion recognition. In AVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018 (pp. 27-32). (AVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018). Association for Computing Machinery, Inc. https://doi.org/10.1145/3264869.3264873