Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems

Eunwoo Song, Frank K. Soong, Hong Goo Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

This paper investigates how the perceptual quality of the synthesized speech is affected by reconstruction errors in excitation signals generated by a deep learning-based statistical model. In this framework, the excitation signal obtained by an LPC inverse filter is first decomposed into harmonic and noise components using an improved time-frequency trajectory excitation (ITFTE) scheme, then they are trained and generated by a deep long short-term memory (DLSTM)-based speech synthesis system. By controlling the parametric dimension of the ITFTE vocoder, we analyze the impact of the harmonic and noise components to the perceptual quality of the synthesized speech. Both objective and subjective experimental results confirm that the maximum perceptually allowable spectral distortion for the harmonic spectrum of the generated excitation is ∼0.08 dB. On the other hand, the absolute spectral distortion in the noise components is meaningless, and only the spectral envelope is relevant to the perceptual quality.

Original languageEnglish
Title of host publication2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages671-676
Number of pages6
ISBN (Electronic)9781509047888
DOIs
Publication statusPublished - 2018 Jan 24
Event2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan
Duration: 2017 Dec 162017 Dec 20

Publication series

Name2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
Volume2018-January

Other

Other2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
Country/TerritoryJapan
CityOkinawa
Period17/12/1617/12/20

Bibliographical note

Funding Information:
This research was supported by Microsoft Research and the MSIP (The Ministry of Science, ICT and Future Planning), Korea, under ICT/SW Creative research program supervised by the IITP (Institute for Information & Communications Technology Promotion). The authors would like to thank Feng-Long Xie, Microsoft Research Asia, Beijing, China, for conducting the listening tests.

Publisher Copyright:
© 2017 IEEE.

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems'. Together they form a unique fingerprint.

Cite this