Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems

Eunwoo Song, Frank K. Soong, Hong-Goo Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper investigates how the perceptual quality of the synthesized speech is affected by reconstruction errors in excitation signals generated by a deep learning-based statistical model. In this framework, the excitation signal obtained by an LPC inverse filter is first decomposed into harmonic and noise components using an improved time-frequency trajectory excitation (ITFTE) scheme, then they are trained and generated by a deep long short-term memory (DLSTM)-based speech synthesis system. By controlling the parametric dimension of the ITFTE vocoder, we analyze the impact of the harmonic and noise components to the perceptual quality of the synthesized speech. Both objective and subjective experimental results confirm that the maximum perceptually allowable spectral distortion for the harmonic spectrum of the generated excitation is ∼0.08 dB. On the other hand, the absolute spectral distortion in the noise components is meaningless, and only the spectral envelope is relevant to the perceptual quality.

Original languageEnglish
Title of host publication2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages671-676
Number of pages6
Volume2018-January
ISBN (Electronic)9781509047888
DOIs
Publication statusPublished - 2018 Jan 24
Event2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan
Duration: 2017 Dec 162017 Dec 20

Other

Other2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
CountryJapan
CityOkinawa
Period17/12/1617/12/20

Fingerprint

Speech synthesis
Trajectories
Long short-term memory
Statistical Models
Deep learning

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction

Cite this

Song, E., Soong, F. K., & Kang, H-G. (2018). Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings (Vol. 2018-January, pp. 671-676). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU.2017.8269001
Song, Eunwoo ; Soong, Frank K. ; Kang, Hong-Goo. / Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Vol. 2018-January Institute of Electrical and Electronics Engineers Inc., 2018. pp. 671-676
@inproceedings{5015e312fc734617be8ab1b18ac289d7,
title = "Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems",
abstract = "This paper investigates how the perceptual quality of the synthesized speech is affected by reconstruction errors in excitation signals generated by a deep learning-based statistical model. In this framework, the excitation signal obtained by an LPC inverse filter is first decomposed into harmonic and noise components using an improved time-frequency trajectory excitation (ITFTE) scheme, then they are trained and generated by a deep long short-term memory (DLSTM)-based speech synthesis system. By controlling the parametric dimension of the ITFTE vocoder, we analyze the impact of the harmonic and noise components to the perceptual quality of the synthesized speech. Both objective and subjective experimental results confirm that the maximum perceptually allowable spectral distortion for the harmonic spectrum of the generated excitation is ∼0.08 dB. On the other hand, the absolute spectral distortion in the noise components is meaningless, and only the spectral envelope is relevant to the perceptual quality.",
author = "Eunwoo Song and Soong, {Frank K.} and Hong-Goo Kang",
year = "2018",
month = "1",
day = "24",
doi = "10.1109/ASRU.2017.8269001",
language = "English",
volume = "2018-January",
pages = "671--676",
booktitle = "2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

Song, E, Soong, FK & Kang, H-G 2018, Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems. in 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. vol. 2018-January, Institute of Electrical and Electronics Engineers Inc., pp. 671-676, 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, 17/12/16. https://doi.org/10.1109/ASRU.2017.8269001

Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems. / Song, Eunwoo; Soong, Frank K.; Kang, Hong-Goo.

2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Vol. 2018-January Institute of Electrical and Electronics Engineers Inc., 2018. p. 671-676.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems

AU - Song, Eunwoo

AU - Soong, Frank K.

AU - Kang, Hong-Goo

PY - 2018/1/24

Y1 - 2018/1/24

N2 - This paper investigates how the perceptual quality of the synthesized speech is affected by reconstruction errors in excitation signals generated by a deep learning-based statistical model. In this framework, the excitation signal obtained by an LPC inverse filter is first decomposed into harmonic and noise components using an improved time-frequency trajectory excitation (ITFTE) scheme, then they are trained and generated by a deep long short-term memory (DLSTM)-based speech synthesis system. By controlling the parametric dimension of the ITFTE vocoder, we analyze the impact of the harmonic and noise components to the perceptual quality of the synthesized speech. Both objective and subjective experimental results confirm that the maximum perceptually allowable spectral distortion for the harmonic spectrum of the generated excitation is ∼0.08 dB. On the other hand, the absolute spectral distortion in the noise components is meaningless, and only the spectral envelope is relevant to the perceptual quality.

AB - This paper investigates how the perceptual quality of the synthesized speech is affected by reconstruction errors in excitation signals generated by a deep learning-based statistical model. In this framework, the excitation signal obtained by an LPC inverse filter is first decomposed into harmonic and noise components using an improved time-frequency trajectory excitation (ITFTE) scheme, then they are trained and generated by a deep long short-term memory (DLSTM)-based speech synthesis system. By controlling the parametric dimension of the ITFTE vocoder, we analyze the impact of the harmonic and noise components to the perceptual quality of the synthesized speech. Both objective and subjective experimental results confirm that the maximum perceptually allowable spectral distortion for the harmonic spectrum of the generated excitation is ∼0.08 dB. On the other hand, the absolute spectral distortion in the noise components is meaningless, and only the spectral envelope is relevant to the perceptual quality.

UR - http://www.scopus.com/inward/record.url?scp=85050636605&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050636605&partnerID=8YFLogxK

U2 - 10.1109/ASRU.2017.8269001

DO - 10.1109/ASRU.2017.8269001

M3 - Conference contribution

VL - 2018-January

SP - 671

EP - 676

BT - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Song E, Soong FK, Kang H-G. Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Vol. 2018-January. Institute of Electrical and Electronics Engineers Inc. 2018. p. 671-676 https://doi.org/10.1109/ASRU.2017.8269001