A joint learning algorithm for complex-valued t-f masks in deep learning-based single-channel speech enhancement systems

Jinkyu Lee, Hong Goo Kang

Research output: Contribution to journalArticlepeer-review

16 Citations (Scopus)


This paper presents a joint learning algorithm for complex-valued time-frequency (T-F) masks in single-channel speech enhancement systems. Most speech enhancement algorithms operating in a single-channel microphone environment aim to enhance the magnitude component in a T-F domain, while the input noisy phase component is used directly without any processing. Consequently, the mismatch between the processed magnitude and the unprocessed phase degrades the sound quality. To address this issue, a learning method of targeting a T-F mask that is defined in a complex domain has recently been proposed. However, due to a wide dynamic range and an irregular spectrogram pattern of the complex-valued T-F mask, the learning process is difficult even with a large-scale deep learning network. Moreover, the learning process targeting the T-F mask itself does not directly minimize the distortion in spectra or time domains. In order to address these concerns, we focus on three issues: 1) an effective estimation of complex numbers with a wide dynamic range; 2) a learning method that is directly related to speech enhancement performance; and 3) a way to resolve the mismatch between the estimated magnitude and phase spectra. In this study, we propose objective functions that can solve each of these issues and train the network by minimizing them with a joint learning framework. The evaluation results demonstrate that the proposed learning algorithm achieves significant performance improvement in various objective measures and subjective preference listening test.

Original languageEnglish
Article number8691424
Pages (from-to)1098-1109
Number of pages12
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Issue number6
Publication statusPublished - 2019 Jun

Bibliographical note

Funding Information:
Manuscript received October 4, 2018; revised January 19, 2019 and March 11, 2019; accepted April 3, 2019. Date of publication April 15, 2019; date of current version April 24, 2019. This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government (MSIT) (No. NRF-2018-11-0337). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Andy W. H. Khong. (Corresponding author: Hong-Goo Kang.) The authors are with the Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, South Korea (e-mail: lejk25@yonsei.ac.kr; hgkang@yonsei.ac.kr). Digital Object Identifier 10.1109/TASLP.2019.2910638

Publisher Copyright:
© 2014 IEEE.

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering


Dive into the research topics of 'A joint learning algorithm for complex-valued t-f masks in deep learning-based single-channel speech enhancement systems'. Together they form a unique fingerprint.

Cite this