An effective style token weight control technique for end-to-end emotional speech synthesis

Ohsung Kwon, Inseon Jang, Chung Hyun Ahn, Hong Goo Kang

Research output: Contribution to journalArticlepeer-review

4 Citations (Scopus)

Abstract

In this letter, we propose a high-quality emotional speech synthesis system, using emotional vector space, i.e., the weighted sum of global style tokens (GSTs). Our previous research verified the feasibility of GST-based emotional speech synthesis in an end-to-end text-to-speech synthesis framework. However, selecting appropriate reference audio (RA) signals to extract emotion embedding vectors to the specific types of target emotions remains problematic. To ameliorate the selection problem, we propose an effective way of generating emotion embedding vectors by utilizing the trained GSTs. By assuming that the trained GSTs represent an emotional vector space, we first investigate the distribution of all the training samples depending on the type of each emotion. We then regard the centroid of the distribution as an emotion-specific weighting value, which effectively controls the expressiveness of synthesized speech, even without using the RA for guidance, as it did before. Finally, we confirm that the proposed controlled weight-based method is superior to the conventional emotion label-based methods in terms of perceptual quality and emotion classification accuracy.

Original languageEnglish
Article number8778667
Pages (from-to)1383-1387
Number of pages5
JournalIEEE Signal Processing Letters
Volume26
Issue number9
DOIs
Publication statusPublished - 2019 Sep

Bibliographical note

Funding Information:
Manuscript received May 26, 2019; revised July 9, 2019; accepted July 13, 2019. Date of publication July 29, 2019; date of current version August 8, 2019. This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (2019-0-00447, Development of emotional expression service to support hearing/visually impaired). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Odette Scharenborg. (Corresponding author: Hong-Goo Kang.) O. Kwon and H.-G. Kang are with the Department of Electrical and Electronics, Yonsei University, Seoul 03722, South Korea (e-mail: osungv@dsp. yonsei.ac.kr; hgkang@yonsei.ac.kr).

Publisher Copyright:
© 2019 IEEE.

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Electrical and Electronic Engineering
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'An effective style token weight control technique for end-to-end emotional speech synthesis'. Together they form a unique fingerprint.

Cite this