In this paper, we propose a light-weight speaker verification (SV) system that utilizes the characteristics of utterance-level global features. Many recent SV tasks employ convolutional neural networks (CNNs) to extract representative speaker features from the given input utterances. However, their inherent receptive field size on the feature extraction process is limited by the localized structure of the convolutional layers. To effectively extract utterance-level global speaker representations, we introduce a novel architecture combining a CNN with a self-attention network that is able to utilize the relationship between local and aggregated global features. The global features are continuously updated at every analysis block using a point-wise attentive summation to the local features. We also adopt a densely connected CNN structure (DenseNet) to reliably estimate speaker-related local features with a small number of model parameters. Our proposed model shows higher speaker verification performance with EER 1.935% with significantly small number of parameters, 1.2M, which is 16% reduced model size than the baseline models.
|Number of pages||5|
|Journal||Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH|
|Publication status||Published - 2022|
|Event||23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of|
Duration: 2022 Sept 18 → 2022 Sept 22
Bibliographical noteFunding Information:
In this paper, we proposed a novel light-weight automatic speaker recognition model to capture both localized and global context features simultaneously to better model speaker-discriminative characteristics. We adopted a DenseNet-based architecture to extract localized speaker characteristics and utilized a self-attention mechanism to capture long-term utterance level characteristics. From the energy distribution of the local and global features, we verified the effectiveness of our local and global feature modules. Our model achieved strong performance even with only a small number of parameters, and when implemented with a similar number of parameters as the reference models, it achieved the best recognition performance from all of our experiments, thereby demonstrating the effectiveness of our strategy for extracting and incorporating global utterance-level features. Acknowledgement. This work was supported by Hyundai Motors Co.
Copyright © 2022 ISCA.
All Science Journal Classification (ASJC) codes
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Modelling and Simulation