Skip to main content
Log in

Research on Speech Enhancement Algorithm by Fusing Improved EMD and GCRN Networks

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Under the condition of low signal-to-noise ratio, for the problem of insufficient speech feature extraction and speech enhancement effect of the traditional neural network, this paper is based on empirical mode decomposition (EMD), temporal convolutional network (TCN), and gated convolution recurrent neural network (GCRN), while combining with feature fusion module (FFM), the adaptive mean median-empirical mode decomposition-multilayer gated feature fusion module convolutional recurrent neural networks (ME-MGFCRNs) for speech enhancement modeling. The network model uses a split-frequency learning strategy to learn low-frequency features and high-frequency features, i.e., the TCN and MGFCRN networks are used to obtain low-frequency and high-frequency features, and FFM processes the two sets of features to achieve speech enhancement in the form of feature mapping. The model proposed in this paper performs ablation and comparison experiments on the dataset to evaluate the enhancement effect of speech using PESQ, FwSegSNR, and STOI metrics. The research shows that under different noise environments and SNR conditions, the model proposed in this paper improves compared with other baseline models, especially under the low SNR condition of − 5 dB, FwSegSNR and PESQ improve by more than 0.86 dB and 0.02 compared with other baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of Data and Materials

All the data included in this study are available upon request by contacting the corresponding author.

References

  1. S.H. Bae, I. Choi, N.S. Kim, Disentangled feature learning for noise-invariant speech enhancement. Appl. Sci. 9(11), 2289 (2019)

    Article  Google Scholar 

  2. C. Boeddeker, W. Zhang, T. Nakatani et al., Convolutive transfer function invariant SDR training criteria for multi-channel reverberant speech separation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 8428–8432

  3. S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)

    Article  Google Scholar 

  4. T. Bose, J. Schroeder, Adaptive mean/median filtering. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), Istanbul, Turkey, 2000 (2000) pp. 3830–3833

  5. H.S. Choi, J.H. Kim, J. Huh et al., Phase-aware speech enhancement with deep complex U-net. In: International Conference on Learning Representations (2019)

  6. F. Dang, H. Chen, P. Zhang, DPT-FSNet: dual-path transformer based full-band and sub-band fusion network for speech enhancement. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1(1) (2022), pp.6857–6861

  7. Y.N. Dauphin, A. Fan, M. Auli et al., Language modeling with gated convolutional networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70(1) (2017), pp. 933–941

  8. Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)

    Article  Google Scholar 

  9. K. Hu, P. Divenyi, D. Ellis, Z. Jin, B.Z. Shinn-Cunningham, D. Wang, Preliminary intelligibility tests of a monaural speech segregation system. In: Proceedings of Workshop on Statistical and Perceptual Audition. Brisbane (2008)

  10. G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 4700–4708

  11. N. Ibtehaz, M.S. Rahman, MultiResUNet: rethinking the U-net architecture for multimodal biomedical image segmentation neural networks. Neural Netw. 121, 74–87 (2020)

    Article  Google Scholar 

  12. C. Lea, M.D. Flynn, R. Vidal et al., Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 156–165

  13. A. Li,M. Yuan, C. Zheng et al., Convolutional recurrent neural network based progressive learning for monaural speech enhancement. arXiv:1908.10768 (2019)

  14. X.M. Li, C. Bao, M.S. Jia, A sinusoidal audio and speech analysis/synthesis model based on improved EMD by adding pure tone. IEEE Mach. Learn. Signal Process. 1(1), 1–5 (2011)

    Google Scholar 

  15. J.S. Lim, A.V. Oppenheim, Enhancement and bandwidth compression of noisy speech. Process. IEEE 67(12), 1586–1604 (1979)

    Article  Google Scholar 

  16. Y. Luo, Z. Chen, T. Yoshioka, Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 46–50

  17. D. Michelsanti, Z.H. Tan, S.X. Zhang et al., An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1368–1396 (2021)

    Article  Google Scholar 

  18. E.A. Mishra, A.K. Sharma, M. Bhalotia et al., A novel approach to analyse speech emotion using CNN and multilayer perceptron. In: 2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), NanJing, China (2022), pp. 1157–1161

  19. N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)

    Article  Google Scholar 

  20. V. Panayotov, G. Chen, D. Povey, S. Khudanpur, LibriSpeech: an ASR corpus based on public domain audio books. In: ICASSP 2015 (2015)

  21. A. Pandey, D.L. Wang, Learning complex spectral mapping for speech enhancement with improved cross-corpus generalization. In: Interspeech (2020), pp. 4511–4515

  22. D. Pearce, J. Picone, Aurora working group: DSR front end LVCSR evaluation AU/384/02, Institute for Signal and Information Processing, Mississippi State University, Technical Report (2002)

  23. S. Qin, T. Jiang, S. Wu et al., Graph convolution based deep clustering for speech separation. IEEE Access. 8, 82571–82580 (2020)

    Article  Google Scholar 

  24. C.K. Reddy, V. Gopal, R. Cutler et al., The INTERSPEECH 2020 deep noise suppression challenge: datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981 (2020)

  25. A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (IEEE, 2001), pp. 749–752

  26. N. Saleem, M.I. Khattak, Multi-scale decomposition based supervised single channel deep speech enhancement. Appl. Soft Comput. 95(4), 106666 (2020)

    Article  Google Scholar 

  27. N. Saleem, M.I. Khattak, E.V. Perez, Spectral phase estimation based on deep neural networks for single channel speech enhancement. J. Commun. Technol. Electron. 64, 1372–1382 (2019)

    Article  Google Scholar 

  28. Y. Shi, J. Bai, P. Xue, Acoustic and energy fusion feature extraction for noise robust speech recognition. IEEE Access. 7(1), 81911–81922 (2019)

    Article  Google Scholar 

  29. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)

    Article  Google Scholar 

  30. K. Tan, J. Chen, D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 189–198 (2019)

    Article  Google Scholar 

  31. K. Tan, D.L. Wang, A convolutional recurrent neural network for real-time speech enhancement. Interspeech 2018, 3229–3233 (2018)

    Google Scholar 

  32. K. Tan, D.L. Wang, Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 380–390 (2020)

    Article  Google Scholar 

  33. A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II.NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12, 247–251 (1993)

    Article  Google Scholar 

  34. S. Venkataramani, J. Casebeer, P. Smaragdis, End-to-end source separation with adaptive front-ends. In: 2018 52nd Asilomar Conference on Signals, Systems, and Computers (IEEE, 2018), pp. 684–688

  35. Y. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)

    Article  Google Scholar 

  36. N.L. Westhausen, B.T. Meyer, Dual-signal transformation LSTM network for real-time noise suppression. arXiv preprint arXiv:2005.07551 (2020)

  37. B. Wiem, M. Messaoud, A. Bouzid, Phase-aware subspace decomposition for single channel speech separation. IET Signal Proc. 14(4), 214–222 (2020)

    Article  Google Scholar 

  38. X. Xiang, X. Zhang, H. Chen, A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 29, 105–109 (2021)

    Article  Google Scholar 

  39. Z. Zhao, H. Liu, T. Fingscheidt, Convolutional neural networks to enhance coded speech. IEEE/ACM Trans. Audio Speech Lang. Process. 27(4), 663–678 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

This research was received by the natural science foundation of Heilongjiang Province (No. LH2020F033), the national natural science youth foundation of China (No. 11804068) and research project of the Heilongjiang Province Health Commission (No. 20221111001069).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Lei Zhang or Shilong Zhao.

Ethics declarations

Conflict of interest

The authors acknowledge that they have no competing and conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lan, C., Chen, H., Zhang, L. et al. Research on Speech Enhancement Algorithm by Fusing Improved EMD and GCRN Networks. Circuits Syst Signal Process (2024). https://doi.org/10.1007/s00034-024-02677-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00034-024-02677-3

Keywords

Navigation