Deep Learning-Based Empirical and Sub-Space Decomposition for Speech Enhancement

Mraihi, Khaoula; Ben Messaoud, Mohamed Anouar

doi:10.1007/s00034-024-02606-4

Deep Learning-Based Empirical and Sub-Space Decomposition for Speech Enhancement

Published: 20 February 2024

(2024)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

110 Accesses
Explore all metrics

Abstract

This research presents a single-channel speech enhancement approach based on the combination of the adaptive empirical wavelet transform and the improved sub-space decomposition method followed by a deep learning network. The adaptive empirical wavelet transform is used to determine the boundaries of the segments, then we decompose the obtained spectrogram of the noisy speech into three sub-spaces to determine the low-rank matrix and the sparse matrix of the spectrogram under the perturbation of the residual matrix. The residual noise affecting the speech quality is avoided by the low-rank decomposition using the nonnegative factorization. Then, a cross-domain learning framework is developed to specify the correlations along the frequency and time axes and avoid the disadvantages of the time–frequency domain. Experimental results show that the proposed approach outperforms several competing speech enhancement methods and achieves the highest PESQ, Cov and STOI under different types of noise and at low SNR values in the two datasets. The proposed model is tested on a hardware-level manual design to accelerate the execution of the developed deep learning model on an FPGA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Heart Sound Classification Using Deep Learning Techniques Based on Log-mel Spectrogram

Article 20 August 2022

Data Availability

The data, namely the TIMIT corpus, the NOISEX-92 dataset and Voice Bank dataset, are publicly available.

Code Availability

Code will be made available under the GitHub platform and upon request.

Notes

References

J.P. Amezquita-Sanchez, H. Adeli, A new music-empirical wavelet transform methodology for time–frequency analysis of noisy nonlinear and non-stationary signals. Digit. Signal Process. 45, 55–68 (2015)
Article Google Scholar
H. Avetisyan, J. Holub, Subjective speech quality measurement with and without parallel task: laboratory test results. J. Plos One 5, e0199787 (2018)
Article Google Scholar
M.A. Ben Messaoud, A. Bouzid, Sparse representations for single channel speech enhancement based on voiced/unvoiced classification. Circuits Syst. Signal Process. 36, 1912–1933 (2017)
Article Google Scholar
S.M. Bhuiyan, R.R. Adhami, J.F. Khan, Fast and adaptive bidimensional empirical mode decomposition using order-statistics filter based envelope estimation. EURASIP J. Adv. Signal Process. 2008(1), 728356 (2008)
Article Google Scholar
S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. ASSP. 27(2), 113–120 (1979)
Article Google Scholar
E.J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? J. ACM 58, 11–37 (2011)
Article MathSciNet Google Scholar
I. Daubechies et al., Ten lectures on wavelets, presented at the CBMS-NSF regional conference series in applied mathematics, vol. 61 (1991)
A. Gabbay, A. Ephrat, T. Halperin, S. Peleg, Seeing through noise: visually driven speaker separation and enhancement, in Computer Vision and Pattern Recognition, arXiv:1708.06767 (2018)
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT acoustic-phonetic continuous speech corpus, in Linguistic Data Consortium, p. 11 (1992)
J. Gilles et al., Empirical wavelet transform. IEEE Trans. Signal Process. 61(16), 3999–4010 (2013)
Article MathSciNet ADS Google Scholar
J. Gilles, G. Tran, S. Osher, 2D empirical transforms. Wavelets, ridgelets, and curvelets revisited. SIAM J. Imag. Sci. 7(1), 157–186 (2014)
Article MathSciNet Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, presented at the 2016, in IEEE International Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
L. He, M. Lech, N.C. Maddage, N.B. Allen, Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech. Biomed. Signal Process. Control 6(2), 139–146 (2011)
Article Google Scholar
Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, L. Xie, DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement, Interspeech 2020, (2020)
P.S. Huang, S.D. Chen, P. Smaragdis, M. Hasegawa Johnson, Singing-voice separation from monaural recordings using robust principal component analysis, ICASSP 2012, (2012)
N.E. Huang, Z. Shen, S.R. Long, M.C. Wu, H.H. Shih, Q. Zheng, N.C. Yen, C.C. Tung, H.H. Liu, The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis, presented at the Proceedings of the Royal Society of London A: mathematical, physical and engineering sciences, vol. 454, pp. 903–995 (1998)
M.T. Islam, C. Shahnaz, W. Zhu, M.O. Ahmad, Speech enhancement based on t student modeling of teager energy operated perceptual wavelet packet coefficients and a custom thresholding function. IEEE Trans. Audio Speech Lang. Process. 23, 1800–1811 (2015)
Article Google Scholar
S. Leglaive, A. Xavier, L. Girin, R. Horaud, A recurrent variational autoencoder for speech enhancement (ICASSP, Spain, 2020)
Book Google Scholar
C. Li, J. Shi, W. Zhang, ESPnet-SE: End-To-End speech enhancement and separation toolkit designed for ASR integration, in IEEE Spoken Language Technology Workshop (SLT’21), (2021)
Z. Lin, M. Chen, L. Wu, Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices, arXiv:1009:5055 (2010)
H. Liu, W. Wang, L. Xue, J. Yang, Z. Wang, C. Hua, Speech enhancement based on discrete wavelet packet transform and Itakura–Saito nonnegative matrix factorisation. Arch. Acoust. 45(4), 565–572 (2020)
Google Scholar
P.C. Loizou, Speech enhancement: theory and practice (CRC Press, 2013)
Book Google Scholar
Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation, in IEEE/ACM Transactions on Audio, Speech and Language Processing, (2020)
Y. Ma, Y. Cao, S. Vrudhula, J. Seo, End-to-end scalable FPGA accelerator for deep residual networks, in IEEE International Symposium On Circuits and Systems: ISCAS, (2017)
Y. Ma, Y. Cao, S. Vrudhula, J. Seo, Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks, in ACM International Symposium On Field programmable Gate Arrays: FPGA, (2017)
Y. Ma, N. Suda, Y. Cao, J. Seo, S. Vrudhula, Scalable and modularized RTL compilation of convolutional neural networks onto FPGA, in IEEE International Conference on Field Programmable Logic and Applications: FPL, (2016)
N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)
Article Google Scholar
A. Pandey, D. Wang, TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain, ICASSP (2019)
H. Phan, I.V. McLoughlin, L. Pham, O.Y. Chen, P. Koch, M. De Vos, A. Mertins, Improving GANs for speech enhancement. IEEE Signal Process. Lett. 27, 1700 (2020)
Article ADS Google Scholar
M.F. Sahin, A. Eftekhari, A. Alacaoglu, F. Latorre, V. Cevher, An inexact augmented Lagrangian framework for nonconvex optimization with nonlinear constraints, Arxiv (2019)
N. Srinivas, G. Pradhan, P. Kishore-Kumar, A classification-based non-local means adaptive filtering for speech enhancement and its FPGA prototype. Circuits Syst. Signal Process. 39, 2489–2506 (2020)
Article Google Scholar
C. Sun, J. Xie, Y. Leng, Signal subspace speech enhancement approach based on joint low-rank and sparse matrix decomposition. Arch. Acoust. 41, 245–254 (2016)
Article Google Scholar
K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement (Interspeech, 2018)
Book Google Scholar
K. Toh, S. Yun, An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific J. Optim. 6, 615–640 (2010)
MathSciNet Google Scholar
C. Valentini-Botinhao, X. Wang, S. Takaki, J. Yamagishi, Investigating RNN based speech enhancement methods for noise-robust text-to-speech, in: SSW, pp. 146–152 (2016)
A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12, 247–251 (1993)
Article Google Scholar
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, presented at 31st Conference on Neural Information Processing Systems, pp. 5998–6008 (2017)
D. Wang, Two-speaker voiced/unvoiced decision for monaural speech. Circuits Syst. Signal Process. 39, 4399–4415 (2020)
Article Google Scholar
D. Yin, C. Luo, Z. Xiong, W. Zeng, Phasen: a phase-and-harmonics-aware speech enhancement network, arXiv:1911.04697, (2019)
H. Yue, F. Li, H. Li, C. Liu, An enhanced empirical wavelet transform for noisy and non-stationary signal processing. Digit. Signal Process. 60, 220–229 (2017)
Article Google Scholar
Z. Zhao, H. Liu, T. Fingscheidt, Convolutional neural networks to enhance coded speech in IEEE/ACM transactions on audio. Speech Lang. Process. 27(4), 663–678 (2019)
Google Scholar

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Electrical Engineering Department, National School of Engineers of Tunis, University of Tunis El Manar, Tunis, Tunisia
Khaoula Mraihi & Mohamed Anouar Ben Messaoud
Physics Department, Faculty of Sciences of Tunis, University of Tunis El Manar, Tunis, Tunisia
Khaoula Mraihi & Mohamed Anouar Ben Messaoud

Authors

Khaoula Mraihi
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Anouar Ben Messaoud
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A new empirical mode decomposition followed by an improved sub-space decomposition for speech enhancement is proposed. A learning procedure is then applied to determine the correlations along the frequency and time axes to avoid the disadvantages of the time–frequency domain. An objective evaluation of the quality and intelligibility of the enhanced speech shows that the proposed approach performs better than the compared methods.

Corresponding author

Correspondence to Khaoula Mraihi.

Ethics declarations

Conflict of interest

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mraihi, K., Ben Messaoud, M.A. Deep Learning-Based Empirical and Sub-Space Decomposition for Speech Enhancement. Circuits Syst Signal Process (2024). https://doi.org/10.1007/s00034-024-02606-4

Download citation

Received: 06 September 2022
Revised: 01 January 2024
Accepted: 04 January 2024
Published: 20 February 2024
DOI: https://doi.org/10.1007/s00034-024-02606-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning-Based Empirical and Sub-Space Decomposition for Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Heart Sound Classification Using Deep Learning Techniques Based on Log-mel Spectrogram

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep Learning-Based Empirical and Sub-Space Decomposition for Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Heart Sound Classification Using Deep Learning Techniques Based on Log-mel Spectrogram

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation