Skip to main content
Log in

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Speech enhancement is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time–frequency (T–F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper, speech enhancement is investigated by multi-stage learning using a multi-stage structure in which time–frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion block (FB) is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling tasks. The time–frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterize the salient T–F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data that support the findings of this study are available in the LibriSpeech dataset and VCTK dataset.

References

  1. D. Baby, S. Verhulst, Sergan: speech enhancement using relativistic generative adversarial networks with gradient penalty, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2019), pp. 106–110

  2. S. Bai, J.Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)

  3. S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)

    Article  Google Scholar 

  4. J. Chen, D. Wang, Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 141(6), 4705–4714 (2017)

    Article  MathSciNet  Google Scholar 

  5. Z. Chen, Y. Huang, J. Li et al. Improving mask learning based speech enhancement system with restoration layers and residual connection, in INTERSPEECH (2017), pp. 3632–3636

  6. A. Defossez, G. Synnaeve, Y. Adi, Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020)

  7. Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)

    Article  Google Scholar 

  8. Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)

    Article  Google Scholar 

  9. C. Fan, J. Tao, B. Liu et al. Deep attention fusion feature for speech separation with end-to-end post-filter method. arXiv preprint arXiv:2003.07544 (2020)

  10. Y.A. Farha, J. Gall, MS-TCN: multi-stage temporal convolutional network for action segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 3575–3584

  11. S.W. Fu, C.F. Liao, Y. Tsao et al., Metricgan: generative adversarial networks based black-box metric scores optimization for speech enhancement, in International Conference on Machine Learning, PMLR (2019) pp. 2031–2041

  12. R. Giri, U. Isik, A. Krishnaswamy, Attention Wave-U-Net for speech enhancement, in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (IEEE, 2019), pp. 249–253

  13. X. Hao, X. Su, S. Wen et al., Masking and inpainting: a two-stage speech enhancement approach for low snr and non-stationary noise, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 6959–6963

  14. K. He, X. Zhang, S. Ren et al., Delving deep into rectifiers: surpassing human-level performance on imagenet classification, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp 1026–1034

  15. J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp 7132–7141

  16. X. Hu, S. Wang, C. Zheng et al., A cepstrum-based preprocessing and postprocessing for speech enhancement in adverse environments. Appl. Acoust. 74(12), 1458–1462 (2013)

    Article  Google Scholar 

  17. C. Jannu, S.D. Vanambathina, Shuffle attention u-Net for speech enhancement in time domain. Int. J. Image Graph. (2023). https://doi.org/10.1142/S0219467824500438

    Article  Google Scholar 

  18. J. Kim, M. El-Khamy, J. Lee, T-GSA: transformer with gaussian-weighted self-attention for speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020), pp. 6649–6653

  19. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization 3rd int, in Conf. for Learning Representations, San (2014)

  20. V. Kishore, N. Tiwari, P. Paramasivam, Improved speech enhancement using TCN with multiple encoder–decoder layers, in Interspeech (2020), pp. 4531–4535

  21. Y. Koizumi, K. Yatabe, M. Delcroix et al., Speech enhancement using self-adaptation and multi-head self-attention, in ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 181–185

  22. Y. Koyama, T. Vuong, S. Uhlich et al., Exploring the best loss function for DNN-based low-latency speech enhancement with temporal convolutional networks (2020). arXiv preprint arXiv:2005.11611

  23. A. Li, C. Zheng, C. Fan et al., A recursive network with dynamic attention for monaural speech enhancement. Proc. Interspeech 2020, 2422–2426 (2020)

    Google Scholar 

  24. A. Li, C. Zheng, R. Peng et al., Two heads are better than one: a two-stage approach for monaural noise reduction in the complex domain (2020). arXiv preprint arXiv:2011.01561

  25. A. Li, C. Zheng, L. Zhang et al., Glance and gaze: a collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 187, 108499 (2022)

    Article  Google Scholar 

  26. J. Lin, S. Niu, Z. Wei et al., Speech enhancement using forked generative adversarial networks with spectral subtraction, in Proceedings of Interspeech (2019)

  27. J. Lin, S. Niu, A.J. Wijngaarden et al., Improved speech enhancement using a time-domain GAN with mask learning, in Proceedings of Interspeech 2020 (2020)

  28. J. Lin, AJd.L. van Wijngaarden, K.C. Wang et al., Speech enhancement using multi-stage self-attentive temporal convolutional networks. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3440–3450 (2021)

    Article  Google Scholar 

  29. C.L. Liu, S.W. Fu, Y.J. Li et al., Multichannel speech enhancement by raw waveform-mapping using fully convolutional networks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1888–1900 (2020)

    Article  Google Scholar 

  30. P. Loizou, Y. Hu, Noizeus: a noisy speech corpus for evaluation of speech enhancement algorithms. Speech Commun. 49, 588–601 (2017)

    Google Scholar 

  31. Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  32. C. Macartney, T. Weyde, Improved speech enhancement with the wave-U-Net (2018). arXiv preprint arXiv:1811.11307

  33. A. Narayanan, D. Wang, Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 826–835 (2014)

    Article  Google Scholar 

  34. A. van den Oord, S. Dieleman, H. Zen et al., Wavenet: a generative model for raw audio, in 9th ISCA Speech Synthesis Workshop (2016), pp. 125–125

  35. V. Panayotov, G. Chen, D. Povey et al., Librispeech: an ASR corpus based on public domain audio books, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 5206–5210

  36. A. Pandey, D. Wang, On adversarial training and loss functions for speech enhancement, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5414–5418

  37. A. Pandey, D. Wang, A new framework for CNN-based speech enhancement in the time domain. IEEE/ACM Trans. Audio Speech Lang. Process. 27(7), 1179–1188 (2019)

    Article  Google Scholar 

  38. A. Pandey, D. Wang, TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain, in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6875–6879

  39. A. Pandey, D. Wang, Dense CNN with self-attention for time-domain speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1270–1279 (2021)

    Article  Google Scholar 

  40. S. Pascual, A. Bonafonte, J. Serra, Segan: speech enhancement generative adversarial network (2017). arXiv preprint arXiv:1703.09452

  41. H. Phan, H. Le Nguyen, O.Y. Chén et al., Self-attention generative adversarial network for speech enhancement, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 7103–7107

  42. C.K. Reddy, H. Dubey, V. Gopal et al., ICASSP 2021 deep noise suppression challenge, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 6623–6627

  43. C.K. Reddy, V. Gopal, R. Cutler, DNSMOS: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 6493–6497

  44. A.W. Rix, J.G. Beerends, M.P. Hollier et al., Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (IEEE, 2001), pp. 749–752

  45. M.H. Soni, N. Shah, H.A. Patil Time-frequency masking-based speech enhancement using generative adversarial network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5039–5043

  46. C.H. Taal, R.C. Hendriks, R. Heusdens et al., An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)

    Article  Google Scholar 

  47. K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement, in Interspeech (2018), pp. 3229–3233

  48. K. Tan, J. Chen, D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 189–198 (2018)

    Article  Google Scholar 

  49. C. Tang, C. Luo, Z. Zhao et al., Joint time-frequency and time domain learning for speech enhancement, in Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence (2021), pp. 3816–3822

  50. H. Tao, J. Qiu, Y. Chen et al., Unsupervised cross-domain rolling bearing fault diagnosis based on time-frequency information fusion. J. Franklin Inst. 360(2), 1454–1477 (2023)

    Article  MATH  Google Scholar 

  51. J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings, in Proceedings of Meetings on Acoustics ICA2013 (Acoustical Society of America, 2013), p. 035081

  52. C. Valentini-Botinhao, et al., Noisy Speech Database for Training Speech Enhancement Algorithms and TTS Models. University of Edinburgh School of Informatics Centre for Speech Technology Research (CSTR) (2017)

  53. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  54. A. Vaswani, N. Shazeer, N. Parmar et al., Attention is all you need, in Advances in Neural Information Processing Systems, vol. 30

  55. C. Veaux, J. Yamagishi, S. King, The voice bank corpus: design, collection and data analysis of a large regional accent speech database, in 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE) (IEEE, 2013), pp. 1–4

  56. D. Wang, On ideal binary mask as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines (2005), pp. 181–197

  57. Q. Wang, H. Muckenhirn, K. Wilson et al., Voicefilter: targeted voice separation by speaker-conditioned spectrogram masking. arXiv preprint arXiv:1810.04826 (2018)

  58. Q. Wang, B. Wu, P. Zhu et al., ECA-Net: efficient channel attention for deep convolutional neural networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 11534–11542

  59. Y. Wang, D. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)

    Article  Google Scholar 

  60. Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)

    Article  Google Scholar 

  61. S. Woo, J. Park, J.Y. Lee et al., CBAM: convolutional block attention module, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 3–19

  62. X. Xiang, X. Zhang, H. Chen, A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 29, 105–109 (2021)

    Article  Google Scholar 

  63. Y. Xu, J. Du, L.R. Dai et al., A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014)

    Article  Google Scholar 

  64. F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

  65. Q. Zhang, A. Nicolson, M. Wang, et al., Monaural speech enhancement using a multi-branch temporal convolutional network. arXiv preprint arXiv:1912.12023 (2019)

  66. Q. Zhang, A. Nicolson, M. Wang et al., DeepMMSE: a deep learning approach to MMSE-based noise power spectral density estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1404–1415 (2020)

    Article  Google Scholar 

  67. Q. Zhang, Q. Song, A. Nicolson et al., Temporal convolutional network with frequency dimension adaptive attention for speech enhancement. Proc. Interspeech 2021, 166–170 (2021)

    Google Scholar 

  68. Q. Zhang, X. Qian, Z. Ni et al., A time-frequency attention module for neural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 462–475 (2022)

    Article  Google Scholar 

  69. Y. Zhao, D. Wang, Noisy-reverberant speech enhancement using denseunet with time-frequency attention, in Interspeech (2020), pp. 3261–3265

  70. Y. Zhao, D. Wang, B. Xu et al., Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1598–1607 (2020)

    Article  Google Scholar 

  71. C. Zheng, X. Peng, Y. Zhang et al., Interactive speech and noise modeling for speech enhancement, in Proceedings of the AAAI Conference on Artificial Intelligence (2021), pp. 14549–14557

  72. C. Zhou, H. Tao, Y. Chen et al., Robust point-to-point iterative learning control for constrained systems: a minimum energy approach. Int. J. Robust Nonlinear Control 32(18), 10139–10161 (2022)

    Article  MathSciNet  Google Scholar 

  73. Z. Zhuang, H. Tao, Y. Chen et al., An optimal iterative learning control approach for linear systems with nonuniform trial lengths under input constraints. IEEE Trans. Syst. Man Cybern. Syst. (2022). https://doi.org/10.1109/TSMC.2022.3225381

    Article  Google Scholar 

Download references

Funding

No funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaitanya Jannu.

Ethics declarations

Conflict of interest

The authors declared that they have no conflict of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jannu, C., Vanambathina, S.D. Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks. Circuits Syst Signal Process 42, 7467–7493 (2023). https://doi.org/10.1007/s00034-023-02455-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02455-7

Keywords

Navigation