This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronisation. Here, we set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant audio segment given a short video clip. The method builds on the recent advances in learning representations from cross-modal self-supervision. The main contributions of this paper are as follows: (1) we propose a new learning strategy where the embeddings are learnt via a multi-way matching problem, as opposed to a binary classification (matching or non-matching) problem as proposed by recent papers; (2) we demonstrate that performance of this method far exceeds the existing baselines on the synchronisation task; (3) we use the learnt embeddings for visual speech recognition in self-supervision, and show that the performance matches the representations learnt end-to-end in a fully-supervised manner.
|Title of host publication||2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||5|
|Publication status||Published - 2019 May|
|Event||44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom|
Duration: 2019 May 12 → 2019 May 17
|Name||ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings|
|Conference||44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019|
|Period||19/5/12 → 19/5/17|
Bibliographical notePublisher Copyright:
© 2019 IEEE.
All Science Journal Classification (ASJC) codes
- Signal Processing
- Electrical and Electronic Engineering