Speaker-Adaptive Neural Vocoders for Parametric Speech Synthesis Systems

Eunwoo Song, Jin Seob Kim, Kyungguen Byun, Hong Goo Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper proposes speaker-adaptive neural vocoders for parametric text-to-speech (TTS) systems. Recently proposed WaveNet-based neural vocoding systems successfully generate a time sequence of speech signal with an autoregressive framework. However, it remains a challenge to synthesize high-quality speech when the amount of a target speaker's training data is insufficient. To generate more natural speech signals with the constraint of limited training data, we propose a speaker adaptation task with an effective variation of neural vocoding models. In the proposed method, a speaker-independent training method is applied to capture universal attributes embedded in multiple speakers, and the trained model is then optimized to represent the specific characteristics of the target speaker. Experimental results verify that the proposed TTS systems with speaker-adaptive neural vocoders outperform those with traditional source-filter model-based vocoders and those with WaveNet vocoders, trained either speaker-dependently or speaker-independently. In particular, our TTS system achieves 3.80 and 3.77 MOS for the Korean male and Korean female speakers, respectively, even though we use only ten minutes' speech corpus for training the model.

Original languageEnglish
Title of host publicationIEEE 22nd International Workshop on Multimedia Signal Processing, MMSP 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728193205
DOIs
Publication statusPublished - 2020 Sep 21
Event22nd IEEE International Workshop on Multimedia Signal Processing, MMSP 2020 - Virtual, Tampere, Finland
Duration: 2020 Sep 212020 Sep 24

Publication series

NameIEEE 22nd International Workshop on Multimedia Signal Processing, MMSP 2020

Conference

Conference22nd IEEE International Workshop on Multimedia Signal Processing, MMSP 2020
CountryFinland
CityVirtual, Tampere
Period20/9/2120/9/24

Bibliographical note

Publisher Copyright:
© 2020 IEEE.

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Media Technology

Fingerprint Dive into the research topics of 'Speaker-Adaptive Neural Vocoders for Parametric Speech Synthesis Systems'. Together they form a unique fingerprint.

Cite this