Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems

Eunwoo Song, Frank K. Soong, Hong Goo Kang

Research output: Contribution to journalArticlepeer-review

31 Citations (Scopus)

Abstract

In this paper, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve the perceptual quality of statistical parameter-based TTS systems in our prior works. However, a simple feed-forward deep neural network (DNN) with a finite window length is inadequate to capture the time evolution of the ITFTE parameters. We propose to use the LSTM to exploit the time-varying nature of both trajectories of the excitation and filter parameters, where the LSTM is implemented to use the linguistic text input and to predict both ITFTE and LPC parameters holistically. In the case of LPC parameters, we further enhance the generated spectrum by applying LP bandwidth expansion and line spectral frequency-sharpening filters. These filters are not only beneficial for reducing unstable synthesis filter conditions but also advantageous toward minimizing the muffling problem in the generated spectrum. Experimental results have shown that the proposed LSTM-RNN system with the ITFTE vocoder significantly outperforms both similarly configured band aperiodicity-based systems and our best prior DNN-trainecounterpart, both objectively and subjectively.

Original languageEnglish
Pages (from-to)2152-2161
Number of pages10
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume25
Issue number11
DOIs
Publication statusPublished - 2017 Nov

Bibliographical note

Funding Information:
Manuscript received March 29, 2017; revised July 12, 2017; accepted August 12, 2017. Date of publication August 29, 2017; date of current version September 20, 2017. This work was supported by Microsoft Research and the MSIP (The Ministry of Science, ICT and Future Planning), Korea, under ICT/SW Creative research program supervised by the IITP (Institute for Information & Communications Technology Promotion). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Sin-Horng Chen. (Corresponding author: Hong Goo Kang.) E. Song was with the Microsoft Research Asia, Beijing 100080, China. He is now with the Department of Electrical and Electronic Engineering, Yonsei University, Seoul 120-749 South Korea, and also with NAVER Corp., Seongnam 13561, Korea (e-mail: sewplay@dsp.yonsei.ac.kr).

Publisher Copyright:
© 2014 IEEE.

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems'. Together they form a unique fingerprint.

Cite this