Abstract
In this paper, we propose a lightweight end-to-end text-to-speech model that can generate high-quality speech at breakneck speed. In our proposed model, a feature prediction module and a waveform generation module are combined within a single framework. The feature prediction module, which consists of two independent sub-modules, estimates latent space embeddings for input text and prosodic information, and the waveform generation module generates speech waveforms by conditioning on the estimated latent space embeddings. Unlike conventional approaches that estimate prosodic information using a pre-trained model, our model jointly trains the prosodic embedding network with the speech waveform generation task using an effective domain transfer technique. Experimental results show that our proposed model can generate samples 7 times faster than real-time, and about 1.6 times faster than FastSpeech 2, as we use only 13.4 million parameters. We confirm that the generated speech quality is still of a high standard as evaluated by mean opinion scores.
Original language | English |
---|---|
Title of host publication | 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 |
Publisher | International Speech Communication Association |
Pages | 3551-3555 |
Number of pages | 5 |
ISBN (Electronic) | 9781713836902 |
DOIs | |
Publication status | Published - 2021 |
Event | 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic Duration: 2021 Aug 30 → 2021 Sept 3 |
Publication series
Name | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
---|---|
Volume | 5 |
ISSN (Print) | 2308-457X |
ISSN (Electronic) | 1990-9772 |
Conference
Conference | 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 |
---|---|
Country/Territory | Czech Republic |
City | Brno |
Period | 21/8/30 → 21/9/3 |
Bibliographical note
Funding Information:This work was supported by Clova Voice, NAVER Corp., Seongnam, Korea.
Publisher Copyright:
© 2021 ISCA
All Science Journal Classification (ASJC) codes
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation