Improved time-frequency trajectory excitation vocoder for DNN-based speech synthesis

Eunwoo Song, Frank K. Soong, Hong Goo Kang

Research output: Contribution to journalConference article

4 Citations (Scopus)

Abstract

We investigate an improved time-frequency trajectory excitation (ITFTE) vocoder for deep neural network (DNN)-based statistical parametric speech synthesis (SPSS) systems. The ITFTE is a linear predictive coding-based vocoder, where a pitch-dependent excitation signal is represented by a periodicity distribution in a time-frequency domain. The proposed method significantly improves the parameterization efficiency of ITFTE vocoder for the DNN-based SPSS system, even if its dimension changes due to the inherent nature of pitch variation. By utilizing an orthogonality property of discrete cosine transform, we not only accurately reconstruct the ITFTE parameters but also improve the perceptual quality of synthesized speech. Objective and subjective test results confirm that the proposed method provides superior synthesized speech compared to the previous system.

Original languageEnglish
Pages (from-to)2253-2257
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume08-12-September-2016
DOIs
Publication statusPublished - 2016 Jan 1
Event17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States
Duration: 2016 Sep 82016 Sep 16

Fingerprint

Speech Synthesis
Speech synthesis
Excitation
Trajectories
Neural Networks
Trajectory
Discrete cosine transforms
Parameterization
Discrete Cosine Transform
Orthogonality
Periodicity
Frequency Domain
Time Domain
Coding
Deep neural networks
Dependent

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

@article{15644933777349ccbef087514107275b,
title = "Improved time-frequency trajectory excitation vocoder for DNN-based speech synthesis",
abstract = "We investigate an improved time-frequency trajectory excitation (ITFTE) vocoder for deep neural network (DNN)-based statistical parametric speech synthesis (SPSS) systems. The ITFTE is a linear predictive coding-based vocoder, where a pitch-dependent excitation signal is represented by a periodicity distribution in a time-frequency domain. The proposed method significantly improves the parameterization efficiency of ITFTE vocoder for the DNN-based SPSS system, even if its dimension changes due to the inherent nature of pitch variation. By utilizing an orthogonality property of discrete cosine transform, we not only accurately reconstruct the ITFTE parameters but also improve the perceptual quality of synthesized speech. Objective and subjective test results confirm that the proposed method provides superior synthesized speech compared to the previous system.",
author = "Eunwoo Song and Soong, {Frank K.} and Kang, {Hong Goo}",
year = "2016",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2016-230",
language = "English",
volume = "08-12-September-2016",
pages = "2253--2257",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

Improved time-frequency trajectory excitation vocoder for DNN-based speech synthesis. / Song, Eunwoo; Soong, Frank K.; Kang, Hong Goo.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 08-12-September-2016, 01.01.2016, p. 2253-2257.

Research output: Contribution to journalConference article

TY - JOUR

T1 - Improved time-frequency trajectory excitation vocoder for DNN-based speech synthesis

AU - Song, Eunwoo

AU - Soong, Frank K.

AU - Kang, Hong Goo

PY - 2016/1/1

Y1 - 2016/1/1

N2 - We investigate an improved time-frequency trajectory excitation (ITFTE) vocoder for deep neural network (DNN)-based statistical parametric speech synthesis (SPSS) systems. The ITFTE is a linear predictive coding-based vocoder, where a pitch-dependent excitation signal is represented by a periodicity distribution in a time-frequency domain. The proposed method significantly improves the parameterization efficiency of ITFTE vocoder for the DNN-based SPSS system, even if its dimension changes due to the inherent nature of pitch variation. By utilizing an orthogonality property of discrete cosine transform, we not only accurately reconstruct the ITFTE parameters but also improve the perceptual quality of synthesized speech. Objective and subjective test results confirm that the proposed method provides superior synthesized speech compared to the previous system.

AB - We investigate an improved time-frequency trajectory excitation (ITFTE) vocoder for deep neural network (DNN)-based statistical parametric speech synthesis (SPSS) systems. The ITFTE is a linear predictive coding-based vocoder, where a pitch-dependent excitation signal is represented by a periodicity distribution in a time-frequency domain. The proposed method significantly improves the parameterization efficiency of ITFTE vocoder for the DNN-based SPSS system, even if its dimension changes due to the inherent nature of pitch variation. By utilizing an orthogonality property of discrete cosine transform, we not only accurately reconstruct the ITFTE parameters but also improve the perceptual quality of synthesized speech. Objective and subjective test results confirm that the proposed method provides superior synthesized speech compared to the previous system.

UR - http://www.scopus.com/inward/record.url?scp=84994259552&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994259552&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-230

DO - 10.21437/Interspeech.2016-230

M3 - Conference article

AN - SCOPUS:84994259552

VL - 08-12-September-2016

SP - 2253

EP - 2257

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -