This paper investigates how the perceptual quality of the synthesized speech is affected by reconstruction errors in excitation signals generated by a deep learning-based statistical model. In this framework, the excitation signal obtained by an LPC inverse filter is first decomposed into harmonic and noise components using an improved time-frequency trajectory excitation (ITFTE) scheme, then they are trained and generated by a deep long short-term memory (DLSTM)-based speech synthesis system. By controlling the parametric dimension of the ITFTE vocoder, we analyze the impact of the harmonic and noise components to the perceptual quality of the synthesized speech. Both objective and subjective experimental results confirm that the maximum perceptually allowable spectral distortion for the harmonic spectrum of the generated excitation is ∼0.08 dB. On the other hand, the absolute spectral distortion in the noise components is meaningless, and only the spectral envelope is relevant to the perceptual quality.
|Title of host publication||2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||6|
|Publication status||Published - 2018 Jan 24|
|Event||2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan|
Duration: 2017 Dec 16 → 2017 Dec 20
|Name||2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings|
|Other||2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017|
|Period||17/12/16 → 17/12/20|
Bibliographical noteFunding Information:
This research was supported by Microsoft Research and the MSIP (The Ministry of Science, ICT and Future Planning), Korea, under ICT/SW Creative research program supervised by the IITP (Institute for Information & Communications Technology Promotion). The authors would like to thank Feng-Long Xie, Microsoft Research Asia, Beijing, China, for conducting the listening tests.
© 2017 IEEE.
All Science Journal Classification (ASJC) codes
- Computer Vision and Pattern Recognition
- Human-Computer Interaction