Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers, Yejin Choi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning. Prevailing learning paradigms of audio-text connections have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose vip-AnT that induces Audio-Text alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a trimodal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about 221 ≈ 2M supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.

Original languageEnglish
Title of host publicationNAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics
Subtitle of host publicationHuman Language Technologies, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages4492-4507
Number of pages16
ISBN (Electronic)9781955917711
Publication statusPublished - 2022
Event2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022 - Seattle, United States
Duration: 2022 Jul 102022 Jul 15

Publication series

NameNAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

Conference

Conference2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022
Country/TerritoryUnited States
CitySeattle
Period22/7/1022/7/15

Bibliographical note

Funding Information:
We would like to thank the AI2 Mosaic team for discussions, the AI2 Beaker team for computing support, and the anonymous reviewers for their suggestions. Yanpeng would like to thank Ivan Titov for his comments on the draft. The work was partially supported by the European Research Council (ERC Starting Grant BroadSem 678254), the Dutch National Science Foundation (NWO VIDI 639.022.518), DARPA MCS program through NIWC Pacific (N66001-19-2-4031), DARPA SemaFor program, and Google Cloud Compute.

Publisher Copyright:
© 2022 Association for Computational Linguistics.

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer'. Together they form a unique fingerprint.

Cite this