Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems

Eunwoo Song, Frank K. Soong, Hong Goo Kang

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

In this paper, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve the perceptual quality of statistical parameter-based TTS systems in our prior works. However, a simple feed-forward deep neural network (DNN) with a finite window length is inadequate to capture the time evolution of the ITFTE parameters. We propose to use the LSTM to exploit the time-varying nature of both trajectories of the excitation and filter parameters, where the LSTM is implemented to use the linguistic text input and to predict both ITFTE and LPC parameters holistically. In the case of LPC parameters, we further enhance the generated spectrum by applying LP bandwidth expansion and line spectral frequency-sharpening filters. These filters are not only beneficial for reducing unstable synthesis filter conditions but also advantageous toward minimizing the muffling problem in the generated spectrum. Experimental results have shown that the proposed LSTM-RNN system with the ITFTE vocoder significantly outperforms both similarly configured band aperiodicity-based systems and our best prior DNN-trainecounterpart, both objectively and subjectively.

Original languageEnglish
Pages (from-to)2152-2161
Number of pages10
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume25
Issue number11
DOIs
Publication statusPublished - 2017 Nov

Fingerprint

Speech Synthesis
Speech synthesis
Long-Term Memory
Memory Term
Recurrent neural networks
Recurrent Neural Networks
Short-Term Memory
neural network
Excitation
Trajectories
Trajectory
trajectories
synthesis
Modeling
Filter
excitation
Text-to-speech
filters
Feedforward Neural Networks
linguistics

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Cite this

@article{b0e85b8e520e4e58856ec9868f562592,
title = "Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems",
abstract = "In this paper, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve the perceptual quality of statistical parameter-based TTS systems in our prior works. However, a simple feed-forward deep neural network (DNN) with a finite window length is inadequate to capture the time evolution of the ITFTE parameters. We propose to use the LSTM to exploit the time-varying nature of both trajectories of the excitation and filter parameters, where the LSTM is implemented to use the linguistic text input and to predict both ITFTE and LPC parameters holistically. In the case of LPC parameters, we further enhance the generated spectrum by applying LP bandwidth expansion and line spectral frequency-sharpening filters. These filters are not only beneficial for reducing unstable synthesis filter conditions but also advantageous toward minimizing the muffling problem in the generated spectrum. Experimental results have shown that the proposed LSTM-RNN system with the ITFTE vocoder significantly outperforms both similarly configured band aperiodicity-based systems and our best prior DNN-trainecounterpart, both objectively and subjectively.",
author = "Eunwoo Song and Soong, {Frank K.} and Kang, {Hong Goo}",
year = "2017",
month = "11",
doi = "10.1109/TASLP.2017.2746264",
language = "English",
volume = "25",
pages = "2152--2161",
journal = "IEEE/ACM Transactions on Speech and Language Processing",
issn = "2329-9290",
publisher = "IEEE Advancing Technology for Humanity",
number = "11",

}

Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems. / Song, Eunwoo; Soong, Frank K.; Kang, Hong Goo.

In: IEEE/ACM Transactions on Audio Speech and Language Processing, Vol. 25, No. 11, 11.2017, p. 2152-2161.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems

AU - Song, Eunwoo

AU - Soong, Frank K.

AU - Kang, Hong Goo

PY - 2017/11

Y1 - 2017/11

N2 - In this paper, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve the perceptual quality of statistical parameter-based TTS systems in our prior works. However, a simple feed-forward deep neural network (DNN) with a finite window length is inadequate to capture the time evolution of the ITFTE parameters. We propose to use the LSTM to exploit the time-varying nature of both trajectories of the excitation and filter parameters, where the LSTM is implemented to use the linguistic text input and to predict both ITFTE and LPC parameters holistically. In the case of LPC parameters, we further enhance the generated spectrum by applying LP bandwidth expansion and line spectral frequency-sharpening filters. These filters are not only beneficial for reducing unstable synthesis filter conditions but also advantageous toward minimizing the muffling problem in the generated spectrum. Experimental results have shown that the proposed LSTM-RNN system with the ITFTE vocoder significantly outperforms both similarly configured band aperiodicity-based systems and our best prior DNN-trainecounterpart, both objectively and subjectively.

AB - In this paper, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve the perceptual quality of statistical parameter-based TTS systems in our prior works. However, a simple feed-forward deep neural network (DNN) with a finite window length is inadequate to capture the time evolution of the ITFTE parameters. We propose to use the LSTM to exploit the time-varying nature of both trajectories of the excitation and filter parameters, where the LSTM is implemented to use the linguistic text input and to predict both ITFTE and LPC parameters holistically. In the case of LPC parameters, we further enhance the generated spectrum by applying LP bandwidth expansion and line spectral frequency-sharpening filters. These filters are not only beneficial for reducing unstable synthesis filter conditions but also advantageous toward minimizing the muffling problem in the generated spectrum. Experimental results have shown that the proposed LSTM-RNN system with the ITFTE vocoder significantly outperforms both similarly configured band aperiodicity-based systems and our best prior DNN-trainecounterpart, both objectively and subjectively.

UR - http://www.scopus.com/inward/record.url?scp=85028712802&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85028712802&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2017.2746264

DO - 10.1109/TASLP.2017.2746264

M3 - Article

AN - SCOPUS:85028712802

VL - 25

SP - 2152

EP - 2161

JO - IEEE/ACM Transactions on Speech and Language Processing

JF - IEEE/ACM Transactions on Speech and Language Processing

SN - 2329-9290

IS - 11

ER -