In this paper, we propose a method to effectively determine the representative style embedding of each emotion class to improve the global style token-based end-to-end speech synthesis system. The emotion expressiveness of conventional approach was limited because it utilized only one style representative per each emotion. We overcome the problem by extracting multiple number of representatives per each emotion using a k-means clustering algorithm. Through the results of listening tests, it is proved that the proposed method clearly express each emotion while distinguishing one emotion from others.
Bibliographical notePublisher Copyright:
© 2019 Acoustical Society of Korea. All rights reserved.
All Science Journal Classification (ASJC) codes
- Acoustics and Ultrasonics
- Applied Mathematics
- Signal Processing
- Speech and Hearing