This letter presents a phase-sensitive joint learning algorithm for single-channel speech enhancement. Although a deep learning framework that estimates the time-frequency (T-F) domain ideal ratio masks demonstrates a strong performance, it is limited in the sense that the enhancement process is performed only in the magnitude domain, while the phase spectra remain unchanged. Thus, recent studies have been conducted to involve phase spectra in speech enhancement systems. A phase-sensitive mask (PSM) is a T-F mask that implicitly represents phase-related information. However, since the PSM has an unbounded value, the networks are trained to target its truncated values rather than directly estimating it. To effectively train the PSM, we first approximate it to have a bounded dynamic range under the assumption that speech and noise are uncorrelated. We then propose a joint learning algorithm that trains the approximated value through its parameterized variables in order to minimize the inevitable error caused by the truncation process. Specifically, we design a network that explicitly targets three parameterized variables: 1) speech magnitude spectra; 2) noise magnitude spectra; and 3) phase difference of clean to noisy spectra. To further improve the performance, we also investigate how the dynamic range of magnitude spectra controlled by a warping function affects the final performance in joint learning algorithms. Finally, we examined how the proposed additional constraint that preserves the sum of the estimated speech and noise power spectra affects the overall system performance. The experimental results show that the proposed learning algorithm outperforms the conventional learning algorithm with the truncated phase-sensitive approximation.
Bibliographical noteFunding Information:
Manuscript received February 21, 2018; revised May 28, 2018; accepted June 13, 2018. Date of publication June 21, 2018; date of current version July 13, 2018. This work was supported by Google Inc. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Hakan Erdogan. (Corresponding author: Hong-Goo Kang.) J. Lee is with the Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, South Korea (e-mail:,email@example.com). J. Skoglund and T. Shabestary are with Google, Inc., Mountain View, CA 94043 USA (e-mail:,firstname.lastname@example.org; email@example.com). H.-G. Kang is with the Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, South Korea (e-mail:, hgkang@yonsei. ac.kr). This letter has supplementary downloadable material available at http://ieeexplore.ieee.org. Color versions of one or more of the figures in this letter are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2018.2849578
© 1994-2012 IEEE.
All Science Journal Classification (ASJC) codes
- Signal Processing
- Electrical and Electronic Engineering
- Applied Mathematics