Skip to main content
Log in

Deep Learning-Based Empirical and Sub-Space Decomposition for Speech Enhancement

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This research presents a single-channel speech enhancement approach based on the combination of the adaptive empirical wavelet transform and the improved sub-space decomposition method followed by a deep learning network. The adaptive empirical wavelet transform is used to determine the boundaries of the segments, then we decompose the obtained spectrogram of the noisy speech into three sub-spaces to determine the low-rank matrix and the sparse matrix of the spectrogram under the perturbation of the residual matrix. The residual noise affecting the speech quality is avoided by the low-rank decomposition using the nonnegative factorization. Then, a cross-domain learning framework is developed to specify the correlations along the frequency and time axes and avoid the disadvantages of the time–frequency domain. Experimental results show that the proposed approach outperforms several competing speech enhancement methods and achieves the highest PESQ, Cov and STOI under different types of noise and at low SNR values in the two datasets. The proposed model is tested on a hardware-level manual design to accelerate the execution of the developed deep learning model on an FPGA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The data, namely the TIMIT corpus, the NOISEX-92 dataset and Voice Bank dataset, are publicly available.

Code Availability

Code will be made available under the GitHub platform and upon request.

Notes

  1. https://nl.mathworks.com/matlabcentral/fileexchange/42141-empirical-wavelet.

  2. https://github.com/ShunChi100/RobustPCA.

  3. https://github.com/XiaoyuBIE1994/DVAE_SE.

  4. https://github.com/huyanxin/DeepComplexCRN.

  5. https://github.com/JupiterEthan/CRN-causal.

References

  1. J.P. Amezquita-Sanchez, H. Adeli, A new music-empirical wavelet transform methodology for time–frequency analysis of noisy nonlinear and non-stationary signals. Digit. Signal Process. 45, 55–68 (2015)

    Article  Google Scholar 

  2. H. Avetisyan, J. Holub, Subjective speech quality measurement with and without parallel task: laboratory test results. J. Plos One 5, e0199787 (2018)

    Article  Google Scholar 

  3. M.A. Ben Messaoud, A. Bouzid, Sparse representations for single channel speech enhancement based on voiced/unvoiced classification. Circuits Syst. Signal Process. 36, 1912–1933 (2017)

    Article  Google Scholar 

  4. S.M. Bhuiyan, R.R. Adhami, J.F. Khan, Fast and adaptive bidimensional empirical mode decomposition using order-statistics filter based envelope estimation. EURASIP J. Adv. Signal Process. 2008(1), 728356 (2008)

    Article  Google Scholar 

  5. S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. ASSP. 27(2), 113–120 (1979)

    Article  Google Scholar 

  6. E.J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? J. ACM 58, 11–37 (2011)

    Article  MathSciNet  Google Scholar 

  7. I. Daubechies et al., Ten lectures on wavelets, presented at the CBMS-NSF regional conference series in applied mathematics, vol. 61 (1991)

  8. A. Gabbay, A. Ephrat, T. Halperin, S. Peleg, Seeing through noise: visually driven speaker separation and enhancement, in Computer Vision and Pattern Recognition, arXiv:1708.06767 (2018)

  9. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT acoustic-phonetic continuous speech corpus, in Linguistic Data Consortium, p. 11 (1992)

  10. J. Gilles et al., Empirical wavelet transform. IEEE Trans. Signal Process. 61(16), 3999–4010 (2013)

    Article  MathSciNet  ADS  Google Scholar 

  11. J. Gilles, G. Tran, S. Osher, 2D empirical transforms. Wavelets, ridgelets, and curvelets revisited. SIAM J. Imag. Sci. 7(1), 157–186 (2014)

    Article  MathSciNet  Google Scholar 

  12. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, presented at the 2016, in IEEE International Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  13. L. He, M. Lech, N.C. Maddage, N.B. Allen, Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech. Biomed. Signal Process. Control 6(2), 139–146 (2011)

    Article  Google Scholar 

  14. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, L. Xie, DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement, Interspeech 2020, (2020)

  15. P.S. Huang, S.D. Chen, P. Smaragdis, M. Hasegawa Johnson, Singing-voice separation from monaural recordings using robust principal component analysis, ICASSP 2012, (2012)

  16. N.E. Huang, Z. Shen, S.R. Long, M.C. Wu, H.H. Shih, Q. Zheng, N.C. Yen, C.C. Tung, H.H. Liu, The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis, presented at the Proceedings of the Royal Society of London A: mathematical, physical and engineering sciences, vol. 454, pp. 903–995 (1998)

  17. M.T. Islam, C. Shahnaz, W. Zhu, M.O. Ahmad, Speech enhancement based on t student modeling of teager energy operated perceptual wavelet packet coefficients and a custom thresholding function. IEEE Trans. Audio Speech Lang. Process. 23, 1800–1811 (2015)

    Article  Google Scholar 

  18. S. Leglaive, A. Xavier, L. Girin, R. Horaud, A recurrent variational autoencoder for speech enhancement (ICASSP, Spain, 2020)

    Book  Google Scholar 

  19. C. Li, J. Shi, W. Zhang, ESPnet-SE: End-To-End speech enhancement and separation toolkit designed for ASR integration, in IEEE Spoken Language Technology Workshop (SLT’21), (2021)

  20. Z. Lin, M. Chen, L. Wu, Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices, arXiv:1009:5055 (2010)

  21. H. Liu, W. Wang, L. Xue, J. Yang, Z. Wang, C. Hua, Speech enhancement based on discrete wavelet packet transform and Itakura–Saito nonnegative matrix factorisation. Arch. Acoust. 45(4), 565–572 (2020)

    Google Scholar 

  22. P.C. Loizou, Speech enhancement: theory and practice (CRC Press, 2013)

    Book  Google Scholar 

  23. Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation, in IEEE/ACM Transactions on Audio, Speech and Language Processing, (2020)

  24. Y. Ma, Y. Cao, S. Vrudhula, J. Seo, End-to-end scalable FPGA accelerator for deep residual networks, in IEEE International Symposium On Circuits and Systems: ISCAS, (2017)

  25. Y. Ma, Y. Cao, S. Vrudhula, J. Seo, Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks, in ACM International Symposium On Field programmable Gate Arrays: FPGA, (2017)

  26. Y. Ma, N. Suda, Y. Cao, J. Seo, S. Vrudhula, Scalable and modularized RTL compilation of convolutional neural networks onto FPGA, in IEEE International Conference on Field Programmable Logic and Applications: FPL, (2016)

  27. N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)

    Article  Google Scholar 

  28. A. Pandey, D. Wang, TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain, ICASSP (2019)

  29. H. Phan, I.V. McLoughlin, L. Pham, O.Y. Chen, P. Koch, M. De Vos, A. Mertins, Improving GANs for speech enhancement. IEEE Signal Process. Lett. 27, 1700 (2020)

    Article  ADS  Google Scholar 

  30. M.F. Sahin, A. Eftekhari, A. Alacaoglu, F. Latorre, V. Cevher, An inexact augmented Lagrangian framework for nonconvex optimization with nonlinear constraints, Arxiv (2019)

  31. N. Srinivas, G. Pradhan, P. Kishore-Kumar, A classification-based non-local means adaptive filtering for speech enhancement and its FPGA prototype. Circuits Syst. Signal Process. 39, 2489–2506 (2020)

    Article  Google Scholar 

  32. C. Sun, J. Xie, Y. Leng, Signal subspace speech enhancement approach based on joint low-rank and sparse matrix decomposition. Arch. Acoust. 41, 245–254 (2016)

    Article  Google Scholar 

  33. K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement (Interspeech, 2018)

    Book  Google Scholar 

  34. K. Toh, S. Yun, An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific J. Optim. 6, 615–640 (2010)

    MathSciNet  Google Scholar 

  35. C. Valentini-Botinhao, X. Wang, S. Takaki, J. Yamagishi, Investigating RNN based speech enhancement methods for noise-robust text-to-speech, in: SSW, pp. 146–152 (2016)

  36. A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12, 247–251 (1993)

    Article  Google Scholar 

  37. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, presented at 31st Conference on Neural Information Processing Systems, pp. 5998–6008 (2017)

  38. D. Wang, Two-speaker voiced/unvoiced decision for monaural speech. Circuits Syst. Signal Process. 39, 4399–4415 (2020)

    Article  Google Scholar 

  39. D. Yin, C. Luo, Z. Xiong, W. Zeng, Phasen: a phase-and-harmonics-aware speech enhancement network, arXiv:1911.04697, (2019)

  40. H. Yue, F. Li, H. Li, C. Liu, An enhanced empirical wavelet transform for noisy and non-stationary signal processing. Digit. Signal Process. 60, 220–229 (2017)

    Article  Google Scholar 

  41. Z. Zhao, H. Liu, T. Fingscheidt, Convolutional neural networks to enhance coded speech in IEEE/ACM transactions on audio. Speech Lang. Process. 27(4), 663–678 (2019)

    Google Scholar 

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

A new empirical mode decomposition followed by an improved sub-space decomposition for speech enhancement is proposed. A learning procedure is then applied to determine the correlations along the frequency and time axes to avoid the disadvantages of the time–frequency domain. An objective evaluation of the quality and intelligibility of the enhanced speech shows that the proposed approach performs better than the compared methods.

Corresponding author

Correspondence to Khaoula Mraihi.

Ethics declarations

Conflict of interest

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mraihi, K., Ben Messaoud, M.A. Deep Learning-Based Empirical and Sub-Space Decomposition for Speech Enhancement. Circuits Syst Signal Process (2024). https://doi.org/10.1007/s00034-024-02606-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00034-024-02606-4

Keywords

Navigation