Efficient deep neural networks for speech synthesis using bottleneck features

Young Sun Joo, Won Suk Jun, Hong-Goo Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper proposes a cascading deep neural network (DNN) structure for speech synthesis system that consists of text-to-bottleneck (TTB) and bottleneck-to-speech (BTS) models. Unlike conventional single structure that requires a large database to find complicated mapping rules between linguistic and acoustic features, the proposed structure is very effective even if the available training database is inadequate. The bottle-neck feature utilized in the proposed approach represents the characteristics of linguistic features and its average acoustic features of several speakers. Therefore, it is more efficient to learn a mapping rule between bottleneck and acoustic features than to learn directly a mapping rule between linguistic and acoustic features. Experimental results show that the learning capability of the proposed structure is much higher than that of the conventional structures. Objective and subjective listening test results also verify the superiority of the proposed structure.

Original languageEnglish
Title of host publication2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9789881476821
DOIs
Publication statusPublished - 2017 Jan 17
Event2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 - Jeju, Korea, Republic of
Duration: 2016 Dec 132016 Dec 16

Other

Other2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
CountryKorea, Republic of
CityJeju
Period16/12/1316/12/16

Fingerprint

Speech synthesis
Acoustics
Linguistics
Bottles
Deep neural networks

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Science Applications
  • Information Systems
  • Signal Processing

Cite this

Joo, Y. S., Jun, W. S., & Kang, H-G. (2017). Efficient deep neural networks for speech synthesis using bottleneck features. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 [7820721] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPA.2016.7820721
Joo, Young Sun ; Jun, Won Suk ; Kang, Hong-Goo. / Efficient deep neural networks for speech synthesis using bottleneck features. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017.
@inproceedings{565d1a8227c344fa9d5eab60b11e9368,
title = "Efficient deep neural networks for speech synthesis using bottleneck features",
abstract = "This paper proposes a cascading deep neural network (DNN) structure for speech synthesis system that consists of text-to-bottleneck (TTB) and bottleneck-to-speech (BTS) models. Unlike conventional single structure that requires a large database to find complicated mapping rules between linguistic and acoustic features, the proposed structure is very effective even if the available training database is inadequate. The bottle-neck feature utilized in the proposed approach represents the characteristics of linguistic features and its average acoustic features of several speakers. Therefore, it is more efficient to learn a mapping rule between bottleneck and acoustic features than to learn directly a mapping rule between linguistic and acoustic features. Experimental results show that the learning capability of the proposed structure is much higher than that of the conventional structures. Objective and subjective listening test results also verify the superiority of the proposed structure.",
author = "Joo, {Young Sun} and Jun, {Won Suk} and Hong-Goo Kang",
year = "2017",
month = "1",
day = "17",
doi = "10.1109/APSIPA.2016.7820721",
language = "English",
booktitle = "2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

Joo, YS, Jun, WS & Kang, H-G 2017, Efficient deep neural networks for speech synthesis using bottleneck features. in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016., 7820721, Institute of Electrical and Electronics Engineers Inc., 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016, Jeju, Korea, Republic of, 16/12/13. https://doi.org/10.1109/APSIPA.2016.7820721

Efficient deep neural networks for speech synthesis using bottleneck features. / Joo, Young Sun; Jun, Won Suk; Kang, Hong-Goo.

2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017. 7820721.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Efficient deep neural networks for speech synthesis using bottleneck features

AU - Joo, Young Sun

AU - Jun, Won Suk

AU - Kang, Hong-Goo

PY - 2017/1/17

Y1 - 2017/1/17

N2 - This paper proposes a cascading deep neural network (DNN) structure for speech synthesis system that consists of text-to-bottleneck (TTB) and bottleneck-to-speech (BTS) models. Unlike conventional single structure that requires a large database to find complicated mapping rules between linguistic and acoustic features, the proposed structure is very effective even if the available training database is inadequate. The bottle-neck feature utilized in the proposed approach represents the characteristics of linguistic features and its average acoustic features of several speakers. Therefore, it is more efficient to learn a mapping rule between bottleneck and acoustic features than to learn directly a mapping rule between linguistic and acoustic features. Experimental results show that the learning capability of the proposed structure is much higher than that of the conventional structures. Objective and subjective listening test results also verify the superiority of the proposed structure.

AB - This paper proposes a cascading deep neural network (DNN) structure for speech synthesis system that consists of text-to-bottleneck (TTB) and bottleneck-to-speech (BTS) models. Unlike conventional single structure that requires a large database to find complicated mapping rules between linguistic and acoustic features, the proposed structure is very effective even if the available training database is inadequate. The bottle-neck feature utilized in the proposed approach represents the characteristics of linguistic features and its average acoustic features of several speakers. Therefore, it is more efficient to learn a mapping rule between bottleneck and acoustic features than to learn directly a mapping rule between linguistic and acoustic features. Experimental results show that the learning capability of the proposed structure is much higher than that of the conventional structures. Objective and subjective listening test results also verify the superiority of the proposed structure.

UR - http://www.scopus.com/inward/record.url?scp=85013783309&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85013783309&partnerID=8YFLogxK

U2 - 10.1109/APSIPA.2016.7820721

DO - 10.1109/APSIPA.2016.7820721

M3 - Conference contribution

BT - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Joo YS, Jun WS, Kang H-G. Efficient deep neural networks for speech synthesis using bottleneck features. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc. 2017. 7820721 https://doi.org/10.1109/APSIPA.2016.7820721