In this paper, we propose an effective way of providing conditional features for a flow-based neural vocoder. Most conventional approaches utilize mel-spectrograms for conditioning neural vocoders, but this significantly increases the size of neural networks due to their high dimensional behavior. We show that the network size of a flow-based generative model can be reduced when we use acoustic parameters for a sinusoidal speech analysis-and-synthesis framework such as voiced/unvoiced flag, fundamental frequency, mel-cepstral coefficients, and energy of each analysis frame. We also conclude that training becomes much easier if we feed the fundamental frequency by an embedded vector representation after quantizing it with a small number of bits. Experimental results verify that the performance of the proposed algorithm is comparable to that of flow-based neural vocoders conditioned on mel-spectrograms while the required information for the feature representations and network complexity for generating speech become lower.
|Title of host publication||Conference Record of the 54th Asilomar Conference on Signals, Systems and Computers, ACSSC 2020|
|Editors||Michael B. Matthews|
|Publisher||IEEE Computer Society|
|Number of pages||5|
|Publication status||Published - 2020 Nov 1|
|Event||54th Asilomar Conference on Signals, Systems and Computers, ACSSC 2020 - Pacific Grove, United States|
Duration: 2020 Nov 1 → 2020 Nov 5
|Name||Conference Record - Asilomar Conference on Signals, Systems and Computers|
|Conference||54th Asilomar Conference on Signals, Systems and Computers, ACSSC 2020|
|Period||20/11/1 → 20/11/5|
Bibliographical notePublisher Copyright:
© 2020 IEEE.
All Science Journal Classification (ASJC) codes
- Signal Processing
- Computer Networks and Communications