Phase-sensitive joint learning algorithms for deep learning-based speech enhancement

Jinkyu Lee, Jan Skoglund, Turaj Shabestary, Hong Goo Kang

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

This letter presents a phase-sensitive joint learning algorithm for single-channel speech enhancement. Although a deep learning framework that estimates the time-frequency (T-F) domain ideal ratio masks demonstrates a strong performance, it is limited in the sense that the enhancement process is performed only in the magnitude domain, while the phase spectra remain unchanged. Thus, recent studies have been conducted to involve phase spectra in speech enhancement systems. A phase-sensitive mask (PSM) is a T-F mask that implicitly represents phase-related information. However, since the PSM has an unbounded value, the networks are trained to target its truncated values rather than directly estimating it. To effectively train the PSM, we first approximate it to have a bounded dynamic range under the assumption that speech and noise are uncorrelated. We then propose a joint learning algorithm that trains the approximated value through its parameterized variables in order to minimize the inevitable error caused by the truncation process. Specifically, we design a network that explicitly targets three parameterized variables: 1) speech magnitude spectra; 2) noise magnitude spectra; and 3) phase difference of clean to noisy spectra. To further improve the performance, we also investigate how the dynamic range of magnitude spectra controlled by a warping function affects the final performance in joint learning algorithms. Finally, we examined how the proposed additional constraint that preserves the sum of the estimated speech and noise power spectra affects the overall system performance. The experimental results show that the proposed learning algorithm outperforms the conventional learning algorithm with the truncated phase-sensitive approximation.

Original languageEnglish
Pages (from-to)1276-1280
Number of pages5
JournalIEEE Signal Processing Letters
Volume25
Issue number8
DOIs
Publication statusPublished - 2018 Aug

Fingerprint

Speech Enhancement
Speech enhancement
Learning algorithms
Masks
Learning Algorithm
Mask
Dynamic Range
Power spectrum
Target
Learning
Deep learning
Warping
Phase Difference
Power Spectrum
Truncation
Frequency Domain
Time Domain
System Performance
Enhancement
Minimise

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Electrical and Electronic Engineering
  • Applied Mathematics

Cite this

Lee, Jinkyu ; Skoglund, Jan ; Shabestary, Turaj ; Kang, Hong Goo. / Phase-sensitive joint learning algorithms for deep learning-based speech enhancement. In: IEEE Signal Processing Letters. 2018 ; Vol. 25, No. 8. pp. 1276-1280.
@article{6e656aec1e264a2fa8085a67f3ab5660,
title = "Phase-sensitive joint learning algorithms for deep learning-based speech enhancement",
abstract = "This letter presents a phase-sensitive joint learning algorithm for single-channel speech enhancement. Although a deep learning framework that estimates the time-frequency (T-F) domain ideal ratio masks demonstrates a strong performance, it is limited in the sense that the enhancement process is performed only in the magnitude domain, while the phase spectra remain unchanged. Thus, recent studies have been conducted to involve phase spectra in speech enhancement systems. A phase-sensitive mask (PSM) is a T-F mask that implicitly represents phase-related information. However, since the PSM has an unbounded value, the networks are trained to target its truncated values rather than directly estimating it. To effectively train the PSM, we first approximate it to have a bounded dynamic range under the assumption that speech and noise are uncorrelated. We then propose a joint learning algorithm that trains the approximated value through its parameterized variables in order to minimize the inevitable error caused by the truncation process. Specifically, we design a network that explicitly targets three parameterized variables: 1) speech magnitude spectra; 2) noise magnitude spectra; and 3) phase difference of clean to noisy spectra. To further improve the performance, we also investigate how the dynamic range of magnitude spectra controlled by a warping function affects the final performance in joint learning algorithms. Finally, we examined how the proposed additional constraint that preserves the sum of the estimated speech and noise power spectra affects the overall system performance. The experimental results show that the proposed learning algorithm outperforms the conventional learning algorithm with the truncated phase-sensitive approximation.",
author = "Jinkyu Lee and Jan Skoglund and Turaj Shabestary and Kang, {Hong Goo}",
year = "2018",
month = "8",
doi = "10.1109/LSP.2018.2849578",
language = "English",
volume = "25",
pages = "1276--1280",
journal = "IEEE Signal Processing Letters",
issn = "1070-9908",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "8",

}

Phase-sensitive joint learning algorithms for deep learning-based speech enhancement. / Lee, Jinkyu; Skoglund, Jan; Shabestary, Turaj; Kang, Hong Goo.

In: IEEE Signal Processing Letters, Vol. 25, No. 8, 08.2018, p. 1276-1280.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Phase-sensitive joint learning algorithms for deep learning-based speech enhancement

AU - Lee, Jinkyu

AU - Skoglund, Jan

AU - Shabestary, Turaj

AU - Kang, Hong Goo

PY - 2018/8

Y1 - 2018/8

N2 - This letter presents a phase-sensitive joint learning algorithm for single-channel speech enhancement. Although a deep learning framework that estimates the time-frequency (T-F) domain ideal ratio masks demonstrates a strong performance, it is limited in the sense that the enhancement process is performed only in the magnitude domain, while the phase spectra remain unchanged. Thus, recent studies have been conducted to involve phase spectra in speech enhancement systems. A phase-sensitive mask (PSM) is a T-F mask that implicitly represents phase-related information. However, since the PSM has an unbounded value, the networks are trained to target its truncated values rather than directly estimating it. To effectively train the PSM, we first approximate it to have a bounded dynamic range under the assumption that speech and noise are uncorrelated. We then propose a joint learning algorithm that trains the approximated value through its parameterized variables in order to minimize the inevitable error caused by the truncation process. Specifically, we design a network that explicitly targets three parameterized variables: 1) speech magnitude spectra; 2) noise magnitude spectra; and 3) phase difference of clean to noisy spectra. To further improve the performance, we also investigate how the dynamic range of magnitude spectra controlled by a warping function affects the final performance in joint learning algorithms. Finally, we examined how the proposed additional constraint that preserves the sum of the estimated speech and noise power spectra affects the overall system performance. The experimental results show that the proposed learning algorithm outperforms the conventional learning algorithm with the truncated phase-sensitive approximation.

AB - This letter presents a phase-sensitive joint learning algorithm for single-channel speech enhancement. Although a deep learning framework that estimates the time-frequency (T-F) domain ideal ratio masks demonstrates a strong performance, it is limited in the sense that the enhancement process is performed only in the magnitude domain, while the phase spectra remain unchanged. Thus, recent studies have been conducted to involve phase spectra in speech enhancement systems. A phase-sensitive mask (PSM) is a T-F mask that implicitly represents phase-related information. However, since the PSM has an unbounded value, the networks are trained to target its truncated values rather than directly estimating it. To effectively train the PSM, we first approximate it to have a bounded dynamic range under the assumption that speech and noise are uncorrelated. We then propose a joint learning algorithm that trains the approximated value through its parameterized variables in order to minimize the inevitable error caused by the truncation process. Specifically, we design a network that explicitly targets three parameterized variables: 1) speech magnitude spectra; 2) noise magnitude spectra; and 3) phase difference of clean to noisy spectra. To further improve the performance, we also investigate how the dynamic range of magnitude spectra controlled by a warping function affects the final performance in joint learning algorithms. Finally, we examined how the proposed additional constraint that preserves the sum of the estimated speech and noise power spectra affects the overall system performance. The experimental results show that the proposed learning algorithm outperforms the conventional learning algorithm with the truncated phase-sensitive approximation.

UR - http://www.scopus.com/inward/record.url?scp=85048874563&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85048874563&partnerID=8YFLogxK

U2 - 10.1109/LSP.2018.2849578

DO - 10.1109/LSP.2018.2849578

M3 - Article

AN - SCOPUS:85048874563

VL - 25

SP - 1276

EP - 1280

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

SN - 1070-9908

IS - 8

ER -