Multi-objective long-short term memory recurrent neural networks for speech enhancement

Saleem, Nasir; Khattak, Muhammad Irfan; Al-Hasan, Mu’ath; Jan, Atif

doi:10.1007/s12652-020-02598-4

Multi-objective long-short term memory recurrent neural networks for speech enhancement

Original Research
Published: 16 October 2020

Volume 12, pages 9037–9052, (2021)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

411 Accesses
9 Citations
Explore all metrics

Abstract

Speech-in-noise perception is an important research problem in many real-world multimedia applications. The noise-reduction methods contributed significantly; however rely on a priori information about the noise signals. Deep learning approaches are developed for enhancing the speech signals in nonstationary noisy backgrounds and their benefits are evaluated for the perceived speech quality and intelligibility. In this paper, a multi-objective speech enhancement based on the Long-Short Term Memory (LSTM) recurrent neural network (RNN) is proposed to simultaneously estimate the magnitude and phase spectra of clean speech. During training, the noisy phase spectrum is incorporated as a target and the unstructured phase spectrum is transformed to its derivative that has an identical structure to corresponding magnitude spectrum. Critical Band Importance Functions (CBIFs) are used in training process to further improve the network performance. The results verified that the proposed multi-objective LSTM (MO-LSTM) successfully outscored the standard magnitude-aware LSTM (MA-LSTM), magnitude-aware DNN (MA-DNN), phase-aware DNN (PA-DNN), magnitude-aware GNN (MA-GNN) and magnitude-aware CNN (MA-CNN). Moreover, the proposed speech enhancement considerably improved the speech quality, intelligibility, noise-reduction and automatic speech recognition in changing noisy backgrounds, which is confirmed by the ANalysis Of VAriance (ANOVA) statistical analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hybrid Approach for Deep Noise Suppression Using Deep Neural Networks

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Assessing the Robustness of Recurrent Neural Networks to Enhance the Spectrum of Reverberated Speech

References

American National Standards Institute (1997) American National Standard: methods for calculation of the speech intelligibility index. Acoustical Society of America, New York
Boll S (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 27(2):113–120
Article Google Scholar
Chen J, Wang D (2017) Long short-term memory for speaker generalization in supervised speech separation. J Acoust Soc Am 141(6):4705–4714
Article Google Scholar
Cohen I, Berdugo B (2001) Speech enhancement for non-stationary noise environments. Signal Process 81(11):2403–2418
Article MATH Google Scholar
Ephraim Y, Malah D (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 33(2):443–445
Article Google Scholar
Ephraim Y, Van Trees HL (1995) A signal subspace approach for speech enhancement. IEEE Trans Speech Audio Process 3(4):251–266
Article Google Scholar
Erdogan H, Hershey JR, Watanabe S, Le Roux J (2015) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Brisbane, QLD, pp 708–712
Chapter Google Scholar
Févotte C, Le Roux J, Hershey JR (2013) Non-negative dynamical system with application to speech and audio. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, Vancouver, BC, pp 3158–3162
Chapter Google Scholar
Friedman D (1985) Instantaneous-frequency distribution vs. time: an interpretation of the phase structure of speech. In: ICASSP’85. IEEE international conference on acoustics, speech, and signal processing, vol 10. IEEE, Tampa, FL, pp 1121–1124
Chapter Google Scholar
Gao T, Du J, Dai LR, Lee CH (2016) SNR-based progressive learning of deep neural network for speech enhancement. In: INTERSPEECH, pp 3713–3717. https://doi.org/10.21437/Interspeech.2016-224
Gerkmann T, Krawczyk-Becker M, Le Roux J (2015) Phase processing for single-channel speech enhancement: history and recent advances. IEEE Signal Process Mag 32(2):55–66
Article Google Scholar
Google (2017) Cloud speech API. https://cloud.google.com/speech/
Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Article Google Scholar
Han K, Wang D (2012) A classification based approach to speech segregation. J Acoust Soc Am 132(5):3475–3483
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang PS, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans Audio Speech Lang Process 23(12):2136–2147
Article Google Scholar
Huang Y, Tian K, Wu A, Zhang G (2019) Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J Ambient Intell Humaniz Comput 10(5):1787–1798
Article Google Scholar
Jensen J, Taal CH (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Trans Audio Speech Lang Process 24(11):2009–2022
Article Google Scholar
Jin Z, Wang D (2009) A supervised learning approach to monaural segregation of reverberant speech. IEEE Trans Audio Speech Lang Process 17(4):625–638
Article Google Scholar
Kolbæk M, Tan ZH, Jensen J (2016) Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Trans Audio Speech Lang Process 25(1):153–167
Article Google Scholar
Krawczyk M, Gerkmann T (2014) STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 22(12):1931–1940
Article Google Scholar
Kulmer J, Mowlaee P (2014) Phase estimation in single channel speech enhancement using phase decomposition. IEEE Signal Process Lett 22(5):598–602
Article Google Scholar
Kwon K, Shin JW, Kim NS (2014) NMF-based speech enhancement using bases update. IEEE Signal Process Lett 22(4):450–454
Article Google Scholar
Lai YH, Zheng WZ (2019) Multi-objective learning based speech enhancement method to increase speech quality and intelligibility for hearing aid device users. Biomed Signal Process Control 48:35–45
Article Google Scholar
Le Roux J (2011) Phase-controlled sound transfer based on maximally-inconsistent spectrograms. Signal 5:10
Google Scholar
Le Roux J, Vincent E (2012) Consistent Wiener filtering for audio source separation. IEEE Signal Process Lett 20(3):217–220
Article Google Scholar
Liang R, Kong F, Xie Y, Tang G, Cheng J (2020) Real-time speech enhancement algorithm based on attention LSTM. IEEE Access 8:48464–48476
Article Google Scholar
Liu Y, Wang D (2019) Divide and conquer: a deep casa approach to talker-independent monaural speaker separation. IEEE/ACM Trans Audio Speech Lang Process 27(12):2092–2102
Article Google Scholar
Liu HP, Tsao Y, Fuh CS (2018) Bone-conducted speech enhancement using deep denoising autoencoder. Speech Commun 104:106–112
Article Google Scholar
Loizou PC (2013) Speech enhancement: theory and practice. CRC Press, Boca Raton
Book Google Scholar
Maia R, Stylianou Y (2016) Iterative estimation of phase using complex cepstrum representation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Shanghai, pp 4990–4994
Chapter Google Scholar
Mamun N, Khorram S, Hansen JH (2019) Convolutional neural network-based speech enhancement for cochlear implant recipients. arXiv preprint. arXiv:1907.02526
Min G, Zhang X, Zou X, Sun M (2016) Mask estimate through Itakura-Saito nonnegative RPCA for speech enhancement. In: 2016 IEEE international workshop on acoustic signal enhancement (IWAENC). IEEE, Xi’an, pp 1–5
Google Scholar
Mowlaee P, Kulmer J (2015) Phase estimation in single-channel speech enhancement: limits-potential. IEEE/ACM Trans Audio Speech Lang Process 23(8):1283–1294
Article Google Scholar
Mowlaee P, Saeidi R, Stylianou Y (2016) Advances in phase-aware signal processing in speech communication. Speech Commun 81:1–29
Article Google Scholar
Nicolson A, Paliwal KK (2019) Deep learning for minimum mean-square error approaches to speech enhancement. Speech Commun 111:44–55
Article Google Scholar
Paliwal K, Wójcicki K, Shannon B (2011) The importance of phase in speech enhancement. Speech Commun 53(4):465–494
Article Google Scholar
Pandey A, Wang D (2019) A new framework for CNN-based speech enhancement in the time domain. IEEE/ACM Trans Audio Speech Lang Process 27(7):1179–1188
Article Google Scholar
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318
Rix AW, Hollier MP, Hekstra AP, Beerends JG (2002) Perceptual evaluation of speech quality (PESQ) the new ITU standard for end-to-end speech quality assessment part I-time-delay compensation. J Audio Eng Soc 50(10):755–764
Google Scholar
Saleem N, Irfan M (2018) Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain. Circuits Syst Signal Process 37(6):2591–2612
Article Google Scholar
Saleem N, Khattak MI (2019) Deep neural networks for speech enhancement in complex-noisy environments. Int J Interact Multimed Artif Intell. https://doi.org/10.9781/ijimai.2019.06.001
Article Google Scholar
Saleem N, Khattak MI, Perez EV (2019a) Spectral phase estimation based on deep neural networks for single channel speech enhancement. J Commun Technol Electron 64(12):1372–1382
Article Google Scholar
Saleem N, Khattak MI, Witjaksono G, Ahmad G (2019b) Variance based time-frequency mask estimation for unsupervised speech enhancement. Multimed Tools Appl 78(22):31867–31891
Article Google Scholar
Saleem N, Irfan Khattak M, Ali MY, Shafi M (2019c) Deep neural network for supervised single-channel speech enhancement. Arch Acoust 44:3–12
Google Scholar
Saleem N, Khattak MI, Qazi AB (2019d) Supervised speech enhancement based on deep neural network. J Intell Fuzzy Syst 37(4):5187–5201
Article Google Scholar
Samui S, Chakrabarti I, Ghosh SK (2019) Time–frequency masking based supervised speech enhancement framework using fuzzy deep belief network. Appl Soft Comput 74:583–602
Article Google Scholar
Shoba S, Rajavel R (2020) A new Genetic Algorithm based fusion scheme in monaural CASA system to improve the performance of the speech. J Ambient Intell Humaniz Comput 11(1):433–446
Article Google Scholar
Singh S, Mutawa AM, Gupta M, Tripathy M, Anand RS (2017) Phase based single-channel speech enhancement using phase ratio. In: 2017 6th International conference on computer applications in electrical engineering-recent advances (CERA). IEEE, Roorkee, pp 393–396
Chapter Google Scholar
Soni MH, Shah N, Patil HA (2018) Time-frequency masking-based speech enhancement using generative adversarial network. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Calgary, AB, pp 5039–5043
Chapter Google Scholar
Stark AP, Paliwal KK (2008) Speech analysis using instantaneous frequency deviation. In: Ninth annual conference of the international speech communication association
Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19(7):2125–2136
Article Google Scholar
Vincent E, Watanabe S, Nugraha AA, Barker J, Marxer R (2017) An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Comput Speech Lang 46:535–557
Article Google Scholar
Wakabayashi Y, Fukumori T, Nakayama M, Nishiura T, Yamashita Y (2018) Single-channel speech enhancement with phase reconstruction based on phase distortion averaging. IEEE/ACM Trans Audio Speech Lang Process 26(9):1559–1569
Article Google Scholar
Wang D, Brown GJ (2006) Computational auditory scene analysis: principles, algorithms, and applications. Wiley-IEEE Press, Piscataway
Book Google Scholar
Wang Y, Han K, Wang D (2012) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio Speech Lang Process 21(2):270–279
Article Google Scholar
Wang Y, Narayanan A, Wang D (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858
Article Google Scholar
Weninger F, Eyben F, Schuller B (2014) Single-channel speech separation with memory-enhanced recurrent neural networks. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Florence, pp 3709–3713
Chapter Google Scholar
Weninger F, Erdogan H, Watanabe S, Vincent E, Le Roux J, Hershey JR, Schuller B (2015) Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: International conference on latent variable analysis and signal separation. Springer, Cham, pp 91–99
Chapter Google Scholar
Williamson DS, Wang D (2017) Time–frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 25(7):1492–1501
Article Google Scholar
Wu J, Hua Y, Yang S, Qin H, Qin H (2019) Speech enhancement using generative adversarial network by distilling knowledge from statistical method. Appl Sci 9(16):3396
Article Google Scholar
Xia Y, Wang J (2015) Low-dimensional recurrent neural network-based Kalman filter for speech enhancement. Neural Netw 67:131–139
Article Google Scholar
Xu Y, Du J, Dai LR, Lee CH (2013) An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett 21(1):65–68
Article Google Scholar
Xu Y, Du J, Dai LR, Lee CH (2014) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 23(1):7–19
Article Google Scholar
Yegnanarayana B, Murthy HA (1992) Significance of group delay functions in spectrum estimation. IEEE Trans Signal Process 40(9):2281–2289
Article MATH Google Scholar
Zao L, Coelho R, Flandrin P (2014) Speech enhancement with EMD and hurst-based mode selection. IEEE/ACM Trans Audio Speech Lang Process 22(5):899–911
Article Google Scholar
Zhang XL, Wang D (2016) A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans Audio Speech Lang Process 24(5):967–977
Article Google Scholar
Zhao Y, Wang D, Merks I, Zhang T (2016) DNN-based enhancement of noisy and reverberant speech. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Shanghai, pp 6525–6529
Chapter Google Scholar
Zheng N, Zhang XL (2018) Phase-aware speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 27(1):63–76
Article Google Scholar

Download references

Funding

This study was supported by Abu-Dhabi Department of Education and Knowledge (ADEK) Award for Research Excellence 2019 (Grant AARE19-245).

Author information

Authors and Affiliations

Department of Electrical Engineering, University of Engineering and Technology, Peshawar, 25000, KPK, Pakistan
Nasir Saleem, Muhammad Irfan Khattak & Atif Jan
Department of Electrical Engineering, FET, Gomal University, Dera Ismail Khan, 29050, KPK, Pakistan
Nasir Saleem
College of Engineering, Al Ain University, Al Ain, United Arab Emirates
Mu’ath Al-Hasan

Authors

Nasir Saleem
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Irfan Khattak
View author publications
You can also search for this author in PubMed Google Scholar
Mu’ath Al-Hasan
View author publications
You can also search for this author in PubMed Google Scholar
Atif Jan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nasir Saleem.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Saleem, N., Khattak, M.I., Al-Hasan, M. et al. Multi-objective long-short term memory recurrent neural networks for speech enhancement. J Ambient Intell Human Comput 12, 9037–9052 (2021). https://doi.org/10.1007/s12652-020-02598-4

Download citation

Received: 25 July 2020
Accepted: 03 October 2020
Published: 16 October 2020
Issue Date: October 2021
DOI: https://doi.org/10.1007/s12652-020-02598-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-objective long-short term memory recurrent neural networks for speech enhancement

Abstract

Access this article

Similar content being viewed by others

A Hybrid Approach for Deep Noise Suppression Using Deep Neural Networks

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Assessing the Robustness of Recurrent Neural Networks to Enhance the Spectrum of Reverberated Speech

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-objective long-short term memory recurrent neural networks for speech enhancement

Abstract

Access this article

Similar content being viewed by others

A Hybrid Approach for Deep Noise Suppression Using Deep Neural Networks

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Assessing the Robustness of Recurrent Neural Networks to Enhance the Spectrum of Reverberated Speech

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation