Excitation-by-SampleRNN Model for Text-to-Speech

Kyungguen Byun, Eunwoo Song, Jinseob Kim, Jae Min Kim, Hong-Goo Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we propose a neural vocoder-based text-to-speech (TTS) system that effectively utilizes a source-filter modeling framework. Although neural vocoder algorithms such as SampleRNN and WaveNet are well-known to generate high-quality speech, its generation speed is too slow to be used for real-world applications. By decomposing a speech signal into spectral and excitation components based on a source-filter framework, we train those two components separately, i.e. training the spectrum or acoustic parameters with a long short-term memory model and the excitation component with a SampleRNN-based generative model. Unlike the conventional generative model that needs to represent the complicated probabilistic distribution of speech waveform, the proposed approach needs to generate only the glottal movement of human production mechanism. Therefore, it is possible to obtain high-quality speech signals using a small-size of the pitch interval-oriented SampleRNN network. The objective and subjective test results confirm the superiority of the proposed system over a glottal modeling-based parametric and original SampleRNN-based speech synthesis systems.

Original languageEnglish
Title of host publication34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728132716
DOIs
Publication statusPublished - 2019 Jun 1
Event34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019 - JeJu, Korea, Republic of
Duration: 2019 Jun 232019 Jun 26

Publication series

Name34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019

Conference

Conference34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019
CountryKorea, Republic of
CityJeJu
Period19/6/2319/6/26

Fingerprint

Speech synthesis
Acoustics
Long short-term memory

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Electrical and Electronic Engineering
  • Artificial Intelligence
  • Computer Networks and Communications
  • Hardware and Architecture

Cite this

Byun, K., Song, E., Kim, J., Kim, J. M., & Kang, H-G. (2019). Excitation-by-SampleRNN Model for Text-to-Speech. In 34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019 [8793459] (34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ITC-CSCC.2019.8793459
Byun, Kyungguen ; Song, Eunwoo ; Kim, Jinseob ; Kim, Jae Min ; Kang, Hong-Goo. / Excitation-by-SampleRNN Model for Text-to-Speech. 34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019. Institute of Electrical and Electronics Engineers Inc., 2019. (34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019).
@inproceedings{5bde179c42f84e5da11eef9df1347b2b,
title = "Excitation-by-SampleRNN Model for Text-to-Speech",
abstract = "In this paper, we propose a neural vocoder-based text-to-speech (TTS) system that effectively utilizes a source-filter modeling framework. Although neural vocoder algorithms such as SampleRNN and WaveNet are well-known to generate high-quality speech, its generation speed is too slow to be used for real-world applications. By decomposing a speech signal into spectral and excitation components based on a source-filter framework, we train those two components separately, i.e. training the spectrum or acoustic parameters with a long short-term memory model and the excitation component with a SampleRNN-based generative model. Unlike the conventional generative model that needs to represent the complicated probabilistic distribution of speech waveform, the proposed approach needs to generate only the glottal movement of human production mechanism. Therefore, it is possible to obtain high-quality speech signals using a small-size of the pitch interval-oriented SampleRNN network. The objective and subjective test results confirm the superiority of the proposed system over a glottal modeling-based parametric and original SampleRNN-based speech synthesis systems.",
author = "Kyungguen Byun and Eunwoo Song and Jinseob Kim and Kim, {Jae Min} and Hong-Goo Kang",
year = "2019",
month = "6",
day = "1",
doi = "10.1109/ITC-CSCC.2019.8793459",
language = "English",
series = "34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019",
address = "United States",

}

Byun, K, Song, E, Kim, J, Kim, JM & Kang, H-G 2019, Excitation-by-SampleRNN Model for Text-to-Speech. in 34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019., 8793459, 34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019, Institute of Electrical and Electronics Engineers Inc., 34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019, JeJu, Korea, Republic of, 19/6/23. https://doi.org/10.1109/ITC-CSCC.2019.8793459

Excitation-by-SampleRNN Model for Text-to-Speech. / Byun, Kyungguen; Song, Eunwoo; Kim, Jinseob; Kim, Jae Min; Kang, Hong-Goo.

34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019. Institute of Electrical and Electronics Engineers Inc., 2019. 8793459 (34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Excitation-by-SampleRNN Model for Text-to-Speech

AU - Byun, Kyungguen

AU - Song, Eunwoo

AU - Kim, Jinseob

AU - Kim, Jae Min

AU - Kang, Hong-Goo

PY - 2019/6/1

Y1 - 2019/6/1

N2 - In this paper, we propose a neural vocoder-based text-to-speech (TTS) system that effectively utilizes a source-filter modeling framework. Although neural vocoder algorithms such as SampleRNN and WaveNet are well-known to generate high-quality speech, its generation speed is too slow to be used for real-world applications. By decomposing a speech signal into spectral and excitation components based on a source-filter framework, we train those two components separately, i.e. training the spectrum or acoustic parameters with a long short-term memory model and the excitation component with a SampleRNN-based generative model. Unlike the conventional generative model that needs to represent the complicated probabilistic distribution of speech waveform, the proposed approach needs to generate only the glottal movement of human production mechanism. Therefore, it is possible to obtain high-quality speech signals using a small-size of the pitch interval-oriented SampleRNN network. The objective and subjective test results confirm the superiority of the proposed system over a glottal modeling-based parametric and original SampleRNN-based speech synthesis systems.

AB - In this paper, we propose a neural vocoder-based text-to-speech (TTS) system that effectively utilizes a source-filter modeling framework. Although neural vocoder algorithms such as SampleRNN and WaveNet are well-known to generate high-quality speech, its generation speed is too slow to be used for real-world applications. By decomposing a speech signal into spectral and excitation components based on a source-filter framework, we train those two components separately, i.e. training the spectrum or acoustic parameters with a long short-term memory model and the excitation component with a SampleRNN-based generative model. Unlike the conventional generative model that needs to represent the complicated probabilistic distribution of speech waveform, the proposed approach needs to generate only the glottal movement of human production mechanism. Therefore, it is possible to obtain high-quality speech signals using a small-size of the pitch interval-oriented SampleRNN network. The objective and subjective test results confirm the superiority of the proposed system over a glottal modeling-based parametric and original SampleRNN-based speech synthesis systems.

UR - http://www.scopus.com/inward/record.url?scp=85071450581&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071450581&partnerID=8YFLogxK

U2 - 10.1109/ITC-CSCC.2019.8793459

DO - 10.1109/ITC-CSCC.2019.8793459

M3 - Conference contribution

AN - SCOPUS:85071450581

T3 - 34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019

BT - 34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Byun K, Song E, Kim J, Kim JM, Kang H-G. Excitation-by-SampleRNN Model for Text-to-Speech. In 34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019. Institute of Electrical and Electronics Engineers Inc. 2019. 8793459. (34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019). https://doi.org/10.1109/ITC-CSCC.2019.8793459