Abstract
This research presents a single-channel speech enhancement approach based on the combination of the adaptive empirical wavelet transform and the improved sub-space decomposition method followed by a deep learning network. The adaptive empirical wavelet transform is used to determine the boundaries of the segments, then we decompose the obtained spectrogram of the noisy speech into three sub-spaces to determine the low-rank matrix and the sparse matrix of the spectrogram under the perturbation of the residual matrix. The residual noise affecting the speech quality is avoided by the low-rank decomposition using the nonnegative factorization. Then, a cross-domain learning framework is developed to specify the correlations along the frequency and time axes and avoid the disadvantages of the time–frequency domain. Experimental results show that the proposed approach outperforms several competing speech enhancement methods and achieves the highest PESQ, Cov and STOI under different types of noise and at low SNR values in the two datasets. The proposed model is tested on a hardware-level manual design to accelerate the execution of the developed deep learning model on an FPGA.
Similar content being viewed by others
Data Availability
The data, namely the TIMIT corpus, the NOISEX-92 dataset and Voice Bank dataset, are publicly available.
Code Availability
Code will be made available under the GitHub platform and upon request.
References
J.P. Amezquita-Sanchez, H. Adeli, A new music-empirical wavelet transform methodology for time–frequency analysis of noisy nonlinear and non-stationary signals. Digit. Signal Process. 45, 55–68 (2015)
H. Avetisyan, J. Holub, Subjective speech quality measurement with and without parallel task: laboratory test results. J. Plos One 5, e0199787 (2018)
M.A. Ben Messaoud, A. Bouzid, Sparse representations for single channel speech enhancement based on voiced/unvoiced classification. Circuits Syst. Signal Process. 36, 1912–1933 (2017)
S.M. Bhuiyan, R.R. Adhami, J.F. Khan, Fast and adaptive bidimensional empirical mode decomposition using order-statistics filter based envelope estimation. EURASIP J. Adv. Signal Process. 2008(1), 728356 (2008)
S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. ASSP. 27(2), 113–120 (1979)
E.J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? J. ACM 58, 11–37 (2011)
I. Daubechies et al., Ten lectures on wavelets, presented at the CBMS-NSF regional conference series in applied mathematics, vol. 61 (1991)
A. Gabbay, A. Ephrat, T. Halperin, S. Peleg, Seeing through noise: visually driven speaker separation and enhancement, in Computer Vision and Pattern Recognition, arXiv:1708.06767 (2018)
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT acoustic-phonetic continuous speech corpus, in Linguistic Data Consortium, p. 11 (1992)
J. Gilles et al., Empirical wavelet transform. IEEE Trans. Signal Process. 61(16), 3999–4010 (2013)
J. Gilles, G. Tran, S. Osher, 2D empirical transforms. Wavelets, ridgelets, and curvelets revisited. SIAM J. Imag. Sci. 7(1), 157–186 (2014)
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, presented at the 2016, in IEEE International Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
L. He, M. Lech, N.C. Maddage, N.B. Allen, Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech. Biomed. Signal Process. Control 6(2), 139–146 (2011)
Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, L. Xie, DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement, Interspeech 2020, (2020)
P.S. Huang, S.D. Chen, P. Smaragdis, M. Hasegawa Johnson, Singing-voice separation from monaural recordings using robust principal component analysis, ICASSP 2012, (2012)
N.E. Huang, Z. Shen, S.R. Long, M.C. Wu, H.H. Shih, Q. Zheng, N.C. Yen, C.C. Tung, H.H. Liu, The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis, presented at the Proceedings of the Royal Society of London A: mathematical, physical and engineering sciences, vol. 454, pp. 903–995 (1998)
M.T. Islam, C. Shahnaz, W. Zhu, M.O. Ahmad, Speech enhancement based on t student modeling of teager energy operated perceptual wavelet packet coefficients and a custom thresholding function. IEEE Trans. Audio Speech Lang. Process. 23, 1800–1811 (2015)
S. Leglaive, A. Xavier, L. Girin, R. Horaud, A recurrent variational autoencoder for speech enhancement (ICASSP, Spain, 2020)
C. Li, J. Shi, W. Zhang, ESPnet-SE: End-To-End speech enhancement and separation toolkit designed for ASR integration, in IEEE Spoken Language Technology Workshop (SLT’21), (2021)
Z. Lin, M. Chen, L. Wu, Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices, arXiv:1009:5055 (2010)
H. Liu, W. Wang, L. Xue, J. Yang, Z. Wang, C. Hua, Speech enhancement based on discrete wavelet packet transform and Itakura–Saito nonnegative matrix factorisation. Arch. Acoust. 45(4), 565–572 (2020)
P.C. Loizou, Speech enhancement: theory and practice (CRC Press, 2013)
Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation, in IEEE/ACM Transactions on Audio, Speech and Language Processing, (2020)
Y. Ma, Y. Cao, S. Vrudhula, J. Seo, End-to-end scalable FPGA accelerator for deep residual networks, in IEEE International Symposium On Circuits and Systems: ISCAS, (2017)
Y. Ma, Y. Cao, S. Vrudhula, J. Seo, Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks, in ACM International Symposium On Field programmable Gate Arrays: FPGA, (2017)
Y. Ma, N. Suda, Y. Cao, J. Seo, S. Vrudhula, Scalable and modularized RTL compilation of convolutional neural networks onto FPGA, in IEEE International Conference on Field Programmable Logic and Applications: FPL, (2016)
N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)
A. Pandey, D. Wang, TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain, ICASSP (2019)
H. Phan, I.V. McLoughlin, L. Pham, O.Y. Chen, P. Koch, M. De Vos, A. Mertins, Improving GANs for speech enhancement. IEEE Signal Process. Lett. 27, 1700 (2020)
M.F. Sahin, A. Eftekhari, A. Alacaoglu, F. Latorre, V. Cevher, An inexact augmented Lagrangian framework for nonconvex optimization with nonlinear constraints, Arxiv (2019)
N. Srinivas, G. Pradhan, P. Kishore-Kumar, A classification-based non-local means adaptive filtering for speech enhancement and its FPGA prototype. Circuits Syst. Signal Process. 39, 2489–2506 (2020)
C. Sun, J. Xie, Y. Leng, Signal subspace speech enhancement approach based on joint low-rank and sparse matrix decomposition. Arch. Acoust. 41, 245–254 (2016)
K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement (Interspeech, 2018)
K. Toh, S. Yun, An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific J. Optim. 6, 615–640 (2010)
C. Valentini-Botinhao, X. Wang, S. Takaki, J. Yamagishi, Investigating RNN based speech enhancement methods for noise-robust text-to-speech, in: SSW, pp. 146–152 (2016)
A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12, 247–251 (1993)
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, presented at 31st Conference on Neural Information Processing Systems, pp. 5998–6008 (2017)
D. Wang, Two-speaker voiced/unvoiced decision for monaural speech. Circuits Syst. Signal Process. 39, 4399–4415 (2020)
D. Yin, C. Luo, Z. Xiong, W. Zeng, Phasen: a phase-and-harmonics-aware speech enhancement network, arXiv:1911.04697, (2019)
H. Yue, F. Li, H. Li, C. Liu, An enhanced empirical wavelet transform for noisy and non-stationary signal processing. Digit. Signal Process. 60, 220–229 (2017)
Z. Zhao, H. Liu, T. Fingscheidt, Convolutional neural networks to enhance coded speech in IEEE/ACM transactions on audio. Speech Lang. Process. 27(4), 663–678 (2019)
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
A new empirical mode decomposition followed by an improved sub-space decomposition for speech enhancement is proposed. A learning procedure is then applied to determine the correlations along the frequency and time axes to avoid the disadvantages of the time–frequency domain. An objective evaluation of the quality and intelligibility of the enhanced speech shows that the proposed approach performs better than the compared methods.
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mraihi, K., Ben Messaoud, M.A. Deep Learning-Based Empirical and Sub-Space Decomposition for Speech Enhancement. Circuits Syst Signal Process (2024). https://doi.org/10.1007/s00034-024-02606-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00034-024-02606-4