Unsupervised sound localization via iterative contrastive learning

Yan Bo Lin, Hung Yu Tseng, Hsin Ying Lee, Yen Yu Lin, Ming Hsuan Yang

Research output: Contribution to journalArticlepeer-review

Abstract

Sound localization aims to find the source of the audio signal in the visual scene. However, it is labor-intensive to annotate the correlations between the signals sampled from the audio and visual modalities, thus making it difficult to supervise the learning of a machine for this task. In this work, we propose an iterative contrastive learning framework that requires no data annotations. At each iteration, the proposed method takes the (1) localization results in images predicted in the previous iteration, and (2) semantic relationships inferred from the audio signals as the pseudo-labels. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video (intra-frame sampling) as well as the association between those extracted across videos (inter-frame relation). Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio. Quantitative and qualitative experimental results demonstrate that the proposed framework performs favorably against existing unsupervised and weakly-supervised methods on the sound localization task.

Original languageEnglish
Article number103602
JournalComputer Vision and Image Understanding
Volume227
DOIs
Publication statusPublished - 2023 Jan

Bibliographical note

Funding Information:
This work was supported in part by the Ministry of Science and Technology under grants 109-2221-E-009-113-MY3 , 110-2628-E-A49-008 , and 110-2634-F007-015 . We are also grateful to the National Center for High-performance Computing for providing computational resources and facilities.

Publisher Copyright:
© 2022 Elsevier Inc.

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Unsupervised sound localization via iterative contrastive learning'. Together they form a unique fingerprint.

Cite this