Learning to Localize Sound Source in Visual Scenes

Arda Senocak, Tae Hyun Oh, Junsik Kim, Ming Hsuan Yang, In So Kweon

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene pairs like human? In this paper, we propose a novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes. A two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization. Moreover, although our network is formulated within the unsupervised learning framework, it can be extended to a unified architecture with a simple modification for the supervised and semi-supervised learning settings as well. Meanwhile, a new sound source dataset is developed for performance evaluation. Our empirical evaluation shows that the unsupervised method eventually go through false conclusion in some cases. We also show that even with a few supervision, i.e., semi-supervised setup, false conclusion is able to be corrected effectively.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
PublisherIEEE Computer Society
Pages4358-4366
Number of pages9
ISBN (Electronic)9781538664209
DOIs
Publication statusPublished - 2018 Dec 14
Event31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 - Salt Lake City, United States
Duration: 2018 Jun 182018 Jun 22

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
CountryUnited States
CitySalt Lake City
Period18/6/1818/6/22

Fingerprint

Acoustic waves
Unsupervised learning
Supervised learning

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition

Cite this

Senocak, A., Oh, T. H., Kim, J., Yang, M. H., & Kweon, I. S. (2018). Learning to Localize Sound Source in Visual Scenes. In Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 (pp. 4358-4366). [8578556] (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00458
Senocak, Arda ; Oh, Tae Hyun ; Kim, Junsik ; Yang, Ming Hsuan ; Kweon, In So. / Learning to Localize Sound Source in Visual Scenes. Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society, 2018. pp. 4358-4366 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).
@inproceedings{fef9c9759e91435c82a84bbb51334ae3,
title = "Learning to Localize Sound Source in Visual Scenes",
abstract = "Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene pairs like human? In this paper, we propose a novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes. A two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization. Moreover, although our network is formulated within the unsupervised learning framework, it can be extended to a unified architecture with a simple modification for the supervised and semi-supervised learning settings as well. Meanwhile, a new sound source dataset is developed for performance evaluation. Our empirical evaluation shows that the unsupervised method eventually go through false conclusion in some cases. We also show that even with a few supervision, i.e., semi-supervised setup, false conclusion is able to be corrected effectively.",
author = "Arda Senocak and Oh, {Tae Hyun} and Junsik Kim and Yang, {Ming Hsuan} and Kweon, {In So}",
year = "2018",
month = "12",
day = "14",
doi = "10.1109/CVPR.2018.00458",
language = "English",
series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",
publisher = "IEEE Computer Society",
pages = "4358--4366",
booktitle = "Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018",
address = "United States",

}

Senocak, A, Oh, TH, Kim, J, Yang, MH & Kweon, IS 2018, Learning to Localize Sound Source in Visual Scenes. in Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018., 8578556, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 4358-4366, 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, United States, 18/6/18. https://doi.org/10.1109/CVPR.2018.00458

Learning to Localize Sound Source in Visual Scenes. / Senocak, Arda; Oh, Tae Hyun; Kim, Junsik; Yang, Ming Hsuan; Kweon, In So.

Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society, 2018. p. 4358-4366 8578556 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Learning to Localize Sound Source in Visual Scenes

AU - Senocak, Arda

AU - Oh, Tae Hyun

AU - Kim, Junsik

AU - Yang, Ming Hsuan

AU - Kweon, In So

PY - 2018/12/14

Y1 - 2018/12/14

N2 - Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene pairs like human? In this paper, we propose a novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes. A two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization. Moreover, although our network is formulated within the unsupervised learning framework, it can be extended to a unified architecture with a simple modification for the supervised and semi-supervised learning settings as well. Meanwhile, a new sound source dataset is developed for performance evaluation. Our empirical evaluation shows that the unsupervised method eventually go through false conclusion in some cases. We also show that even with a few supervision, i.e., semi-supervised setup, false conclusion is able to be corrected effectively.

AB - Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene pairs like human? In this paper, we propose a novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes. A two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization. Moreover, although our network is formulated within the unsupervised learning framework, it can be extended to a unified architecture with a simple modification for the supervised and semi-supervised learning settings as well. Meanwhile, a new sound source dataset is developed for performance evaluation. Our empirical evaluation shows that the unsupervised method eventually go through false conclusion in some cases. We also show that even with a few supervision, i.e., semi-supervised setup, false conclusion is able to be corrected effectively.

UR - http://www.scopus.com/inward/record.url?scp=85053843910&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85053843910&partnerID=8YFLogxK

U2 - 10.1109/CVPR.2018.00458

DO - 10.1109/CVPR.2018.00458

M3 - Conference contribution

AN - SCOPUS:85053843910

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 4358

EP - 4366

BT - Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018

PB - IEEE Computer Society

ER -

Senocak A, Oh TH, Kim J, Yang MH, Kweon IS. Learning to Localize Sound Source in Visual Scenes. In Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society. 2018. p. 4358-4366. 8578556. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). https://doi.org/10.1109/CVPR.2018.00458