Skip to main content
Log in

Multi-objective long-short term memory recurrent neural networks for speech enhancement

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Speech-in-noise perception is an important research problem in many real-world multimedia applications. The noise-reduction methods contributed significantly; however rely on a priori information about the noise signals. Deep learning approaches are developed for enhancing the speech signals in nonstationary noisy backgrounds and their benefits are evaluated for the perceived speech quality and intelligibility. In this paper, a multi-objective speech enhancement based on the Long-Short Term Memory (LSTM) recurrent neural network (RNN) is proposed to simultaneously estimate the magnitude and phase spectra of clean speech. During training, the noisy phase spectrum is incorporated as a target and the unstructured phase spectrum is transformed to its derivative that has an identical structure to corresponding magnitude spectrum. Critical Band Importance Functions (CBIFs) are used in training process to further improve the network performance. The results verified that the proposed multi-objective LSTM (MO-LSTM) successfully outscored the standard magnitude-aware LSTM (MA-LSTM), magnitude-aware DNN (MA-DNN), phase-aware DNN (PA-DNN), magnitude-aware GNN (MA-GNN) and magnitude-aware CNN (MA-CNN). Moreover, the proposed speech enhancement considerably improved the speech quality, intelligibility, noise-reduction and automatic speech recognition in changing noisy backgrounds, which is confirmed by the ANalysis Of VAriance (ANOVA) statistical analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • American National Standards Institute (1997) American National Standard: methods for calculation of the speech intelligibility index. Acoustical Society of America, New York

  • Boll S (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 27(2):113–120

    Article  Google Scholar 

  • Chen J, Wang D (2017) Long short-term memory for speaker generalization in supervised speech separation. J Acoust Soc Am 141(6):4705–4714

    Article  Google Scholar 

  • Cohen I, Berdugo B (2001) Speech enhancement for non-stationary noise environments. Signal Process 81(11):2403–2418

    Article  MATH  Google Scholar 

  • Ephraim Y, Malah D (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 33(2):443–445

    Article  Google Scholar 

  • Ephraim Y, Van Trees HL (1995) A signal subspace approach for speech enhancement. IEEE Trans Speech Audio Process 3(4):251–266

    Article  Google Scholar 

  • Erdogan H, Hershey JR, Watanabe S, Le Roux J (2015) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Brisbane, QLD, pp 708–712

    Chapter  Google Scholar 

  • Févotte C, Le Roux J, Hershey JR (2013) Non-negative dynamical system with application to speech and audio. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, Vancouver, BC, pp 3158–3162

    Chapter  Google Scholar 

  • Friedman D (1985) Instantaneous-frequency distribution vs. time: an interpretation of the phase structure of speech. In: ICASSP’85. IEEE international conference on acoustics, speech, and signal processing, vol 10. IEEE, Tampa, FL, pp 1121–1124

    Chapter  Google Scholar 

  • Gao T, Du J, Dai LR, Lee CH (2016) SNR-based progressive learning of deep neural network for speech enhancement. In: INTERSPEECH, pp 3713–3717. https://doi.org/10.21437/Interspeech.2016-224

  • Gerkmann T, Krawczyk-Becker M, Le Roux J (2015) Phase processing for single-channel speech enhancement: history and recent advances. IEEE Signal Process Mag 32(2):55–66

    Article  Google Scholar 

  • Google (2017) Cloud speech API. https://cloud.google.com/speech/

  • Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243

    Article  Google Scholar 

  • Han K, Wang D (2012) A classification based approach to speech segregation. J Acoust Soc Am 132(5):3475–3483

    Article  Google Scholar 

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Huang PS, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans Audio Speech Lang Process 23(12):2136–2147

    Article  Google Scholar 

  • Huang Y, Tian K, Wu A, Zhang G (2019) Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J Ambient Intell Humaniz Comput 10(5):1787–1798

    Article  Google Scholar 

  • Jensen J, Taal CH (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Trans Audio Speech Lang Process 24(11):2009–2022

    Article  Google Scholar 

  • Jin Z, Wang D (2009) A supervised learning approach to monaural segregation of reverberant speech. IEEE Trans Audio Speech Lang Process 17(4):625–638

    Article  Google Scholar 

  • Kolbæk M, Tan ZH, Jensen J (2016) Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Trans Audio Speech Lang Process 25(1):153–167

    Article  Google Scholar 

  • Krawczyk M, Gerkmann T (2014) STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 22(12):1931–1940

    Article  Google Scholar 

  • Kulmer J, Mowlaee P (2014) Phase estimation in single channel speech enhancement using phase decomposition. IEEE Signal Process Lett 22(5):598–602

    Article  Google Scholar 

  • Kwon K, Shin JW, Kim NS (2014) NMF-based speech enhancement using bases update. IEEE Signal Process Lett 22(4):450–454

    Article  Google Scholar 

  • Lai YH, Zheng WZ (2019) Multi-objective learning based speech enhancement method to increase speech quality and intelligibility for hearing aid device users. Biomed Signal Process Control 48:35–45

    Article  Google Scholar 

  • Le Roux J (2011) Phase-controlled sound transfer based on maximally-inconsistent spectrograms. Signal 5:10

    Google Scholar 

  • Le Roux J, Vincent E (2012) Consistent Wiener filtering for audio source separation. IEEE Signal Process Lett 20(3):217–220

    Article  Google Scholar 

  • Liang R, Kong F, Xie Y, Tang G, Cheng J (2020) Real-time speech enhancement algorithm based on attention LSTM. IEEE Access 8:48464–48476

    Article  Google Scholar 

  • Liu Y, Wang D (2019) Divide and conquer: a deep casa approach to talker-independent monaural speaker separation. IEEE/ACM Trans Audio Speech Lang Process 27(12):2092–2102

    Article  Google Scholar 

  • Liu HP, Tsao Y, Fuh CS (2018) Bone-conducted speech enhancement using deep denoising autoencoder. Speech Commun 104:106–112

    Article  Google Scholar 

  • Loizou PC (2013) Speech enhancement: theory and practice. CRC Press, Boca Raton

    Book  Google Scholar 

  • Maia R, Stylianou Y (2016) Iterative estimation of phase using complex cepstrum representation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Shanghai, pp 4990–4994

    Chapter  Google Scholar 

  • Mamun N, Khorram S, Hansen JH (2019) Convolutional neural network-based speech enhancement for cochlear implant recipients. arXiv preprint. arXiv:1907.02526

  • Min G, Zhang X, Zou X, Sun M (2016) Mask estimate through Itakura-Saito nonnegative RPCA for speech enhancement. In: 2016 IEEE international workshop on acoustic signal enhancement (IWAENC). IEEE, Xi’an, pp 1–5

    Google Scholar 

  • Mowlaee P, Kulmer J (2015) Phase estimation in single-channel speech enhancement: limits-potential. IEEE/ACM Trans Audio Speech Lang Process 23(8):1283–1294

    Article  Google Scholar 

  • Mowlaee P, Saeidi R, Stylianou Y (2016) Advances in phase-aware signal processing in speech communication. Speech Commun 81:1–29

    Article  Google Scholar 

  • Nicolson A, Paliwal KK (2019) Deep learning for minimum mean-square error approaches to speech enhancement. Speech Commun 111:44–55

    Article  Google Scholar 

  • Paliwal K, Wójcicki K, Shannon B (2011) The importance of phase in speech enhancement. Speech Commun 53(4):465–494

    Article  Google Scholar 

  • Pandey A, Wang D (2019) A new framework for CNN-based speech enhancement in the time domain. IEEE/ACM Trans Audio Speech Lang Process 27(7):1179–1188

    Article  Google Scholar 

  • Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318

  • Rix AW, Hollier MP, Hekstra AP, Beerends JG (2002) Perceptual evaluation of speech quality (PESQ) the new ITU standard for end-to-end speech quality assessment part I-time-delay compensation. J Audio Eng Soc 50(10):755–764

    Google Scholar 

  • Saleem N, Irfan M (2018) Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain. Circuits Syst Signal Process 37(6):2591–2612

    Article  Google Scholar 

  • Saleem N, Khattak MI (2019) Deep neural networks for speech enhancement in complex-noisy environments. Int J Interact Multimed Artif Intell. https://doi.org/10.9781/ijimai.2019.06.001

    Article  Google Scholar 

  • Saleem N, Khattak MI, Perez EV (2019a) Spectral phase estimation based on deep neural networks for single channel speech enhancement. J Commun Technol Electron 64(12):1372–1382

    Article  Google Scholar 

  • Saleem N, Khattak MI, Witjaksono G, Ahmad G (2019b) Variance based time-frequency mask estimation for unsupervised speech enhancement. Multimed Tools Appl 78(22):31867–31891

    Article  Google Scholar 

  • Saleem N, Irfan Khattak M, Ali MY, Shafi M (2019c) Deep neural network for supervised single-channel speech enhancement. Arch Acoust 44:3–12

    Google Scholar 

  • Saleem N, Khattak MI, Qazi AB (2019d) Supervised speech enhancement based on deep neural network. J Intell Fuzzy Syst 37(4):5187–5201

    Article  Google Scholar 

  • Samui S, Chakrabarti I, Ghosh SK (2019) Time–frequency masking based supervised speech enhancement framework using fuzzy deep belief network. Appl Soft Comput 74:583–602

    Article  Google Scholar 

  • Shoba S, Rajavel R (2020) A new Genetic Algorithm based fusion scheme in monaural CASA system to improve the performance of the speech. J Ambient Intell Humaniz Comput 11(1):433–446

    Article  Google Scholar 

  • Singh S, Mutawa AM, Gupta M, Tripathy M, Anand RS (2017) Phase based single-channel speech enhancement using phase ratio. In: 2017 6th International conference on computer applications in electrical engineering-recent advances (CERA). IEEE, Roorkee, pp 393–396

    Chapter  Google Scholar 

  • Soni MH, Shah N, Patil HA (2018) Time-frequency masking-based speech enhancement using generative adversarial network. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Calgary, AB, pp 5039–5043

    Chapter  Google Scholar 

  • Stark AP, Paliwal KK (2008) Speech analysis using instantaneous frequency deviation. In: Ninth annual conference of the international speech communication association

  • Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19(7):2125–2136

    Article  Google Scholar 

  • Vincent E, Watanabe S, Nugraha AA, Barker J, Marxer R (2017) An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Comput Speech Lang 46:535–557

    Article  Google Scholar 

  • Wakabayashi Y, Fukumori T, Nakayama M, Nishiura T, Yamashita Y (2018) Single-channel speech enhancement with phase reconstruction based on phase distortion averaging. IEEE/ACM Trans Audio Speech Lang Process 26(9):1559–1569

    Article  Google Scholar 

  • Wang D, Brown GJ (2006) Computational auditory scene analysis: principles, algorithms, and applications. Wiley-IEEE Press, Piscataway

    Book  Google Scholar 

  • Wang Y, Han K, Wang D (2012) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio Speech Lang Process 21(2):270–279

    Article  Google Scholar 

  • Wang Y, Narayanan A, Wang D (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858

    Article  Google Scholar 

  • Weninger F, Eyben F, Schuller B (2014) Single-channel speech separation with memory-enhanced recurrent neural networks. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Florence, pp 3709–3713

    Chapter  Google Scholar 

  • Weninger F, Erdogan H, Watanabe S, Vincent E, Le Roux J, Hershey JR, Schuller B (2015) Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: International conference on latent variable analysis and signal separation. Springer, Cham, pp 91–99

    Chapter  Google Scholar 

  • Williamson DS, Wang D (2017) Time–frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 25(7):1492–1501

    Article  Google Scholar 

  • Wu J, Hua Y, Yang S, Qin H, Qin H (2019) Speech enhancement using generative adversarial network by distilling knowledge from statistical method. Appl Sci 9(16):3396

    Article  Google Scholar 

  • Xia Y, Wang J (2015) Low-dimensional recurrent neural network-based Kalman filter for speech enhancement. Neural Netw 67:131–139

    Article  Google Scholar 

  • Xu Y, Du J, Dai LR, Lee CH (2013) An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett 21(1):65–68

    Article  Google Scholar 

  • Xu Y, Du J, Dai LR, Lee CH (2014) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 23(1):7–19

    Article  Google Scholar 

  • Yegnanarayana B, Murthy HA (1992) Significance of group delay functions in spectrum estimation. IEEE Trans Signal Process 40(9):2281–2289

    Article  MATH  Google Scholar 

  • Zao L, Coelho R, Flandrin P (2014) Speech enhancement with EMD and hurst-based mode selection. IEEE/ACM Trans Audio Speech Lang Process 22(5):899–911

    Article  Google Scholar 

  • Zhang XL, Wang D (2016) A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans Audio Speech Lang Process 24(5):967–977

    Article  Google Scholar 

  • Zhao Y, Wang D, Merks I, Zhang T (2016) DNN-based enhancement of noisy and reverberant speech. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Shanghai, pp 6525–6529

    Chapter  Google Scholar 

  • Zheng N, Zhang XL (2018) Phase-aware speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 27(1):63–76

    Article  Google Scholar 

Download references

Funding

This study was supported by Abu-Dhabi Department of Education and Knowledge (ADEK) Award for Research Excellence 2019 (Grant AARE19-245).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nasir Saleem.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saleem, N., Khattak, M.I., Al-Hasan, M. et al. Multi-objective long-short term memory recurrent neural networks for speech enhancement. J Ambient Intell Human Comput 12, 9037–9052 (2021). https://doi.org/10.1007/s12652-020-02598-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-020-02598-4

Keywords

Navigation