Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems

Eunwoo Song, Frank K. Soong, Hong Goo Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

This paper investigates how the perceptual quality of the synthesized speech is affected by reconstruction errors in excitation signals generated by a deep learning-based statistical model. In this framework, the excitation signal obtained by an LPC inverse filter is first decomposed into harmonic and noise components using an improved time-frequency trajectory excitation (ITFTE) scheme, then they are trained and generated by a deep long short-term memory (DLSTM)-based speech synthesis system. By controlling the parametric dimension of the ITFTE vocoder, we analyze the impact of the harmonic and noise components to the perceptual quality of the synthesized speech. Both objective and subjective experimental results confirm that the maximum perceptually allowable spectral distortion for the harmonic spectrum of the generated excitation is ∼0.08 dB. On the other hand, the absolute spectral distortion in the noise components is meaningless, and only the spectral envelope is relevant to the perceptual quality.

Original languageEnglish
Title of host publication2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages671-676
Number of pages6
ISBN (Electronic)9781509047888
DOIs
Publication statusPublished - 2018 Jan 24
Event2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan
Duration: 2017 Dec 162017 Dec 20

Publication series

Name2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
Volume2018-January

Other

Other2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
CountryJapan
CityOkinawa
Period17/12/1617/12/20

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction

Fingerprint Dive into the research topics of 'Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems'. Together they form a unique fingerprint.

  • Cite this

    Song, E., Soong, F. K., & Kang, H. G. (2018). Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings (pp. 671-676). (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings; Vol. 2018-January). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU.2017.8269001