Light-Weight Speaker Verification with Global Context Information

Miseul Kim, Zhenyu Piao, Seyun Um, Ran Lee, Jaemin Joh, Seungshin Lee, Hong Goo Kang

Research output: Contribution to journalConference articlepeer-review


In this paper, we propose a light-weight speaker verification (SV) system that utilizes the characteristics of utterance-level global features. Many recent SV tasks employ convolutional neural networks (CNNs) to extract representative speaker features from the given input utterances. However, their inherent receptive field size on the feature extraction process is limited by the localized structure of the convolutional layers. To effectively extract utterance-level global speaker representations, we introduce a novel architecture combining a CNN with a self-attention network that is able to utilize the relationship between local and aggregated global features. The global features are continuously updated at every analysis block using a point-wise attentive summation to the local features. We also adopt a densely connected CNN structure (DenseNet) to reliably estimate speaker-related local features with a small number of model parameters. Our proposed model shows higher speaker verification performance with EER 1.935% with significantly small number of parameters, 1.2M, which is 16% reduced model size than the baseline models.

Original languageEnglish
Pages (from-to)5105-5109
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2022
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: 2022 Sept 182022 Sept 22

Bibliographical note

Funding Information:
In this paper, we proposed a novel light-weight automatic speaker recognition model to capture both localized and global context features simultaneously to better model speaker-discriminative characteristics. We adopted a DenseNet-based architecture to extract localized speaker characteristics and utilized a self-attention mechanism to capture long-term utterance level characteristics. From the energy distribution of the local and global features, we verified the effectiveness of our local and global feature modules. Our model achieved strong performance even with only a small number of parameters, and when implemented with a similar number of parameters as the reference models, it achieved the best recognition performance from all of our experiments, thereby demonstrating the effectiveness of our strategy for extracting and incorporating global utterance-level features. Acknowledgement. This work was supported by Hyundai Motors Co.

Publisher Copyright:
Copyright © 2022 ISCA.

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation


Dive into the research topics of 'Light-Weight Speaker Verification with Global Context Information'. Together they form a unique fingerprint.

Cite this