Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Jannu, Chaitanya; Vanambathina, Sunny Dayal

doi:10.1007/s00034-023-02455-7

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Published: 26 July 2023

Volume 42, pages 7467–7493, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Chaitanya Jannu¹ &
Sunny Dayal Vanambathina¹

306 Accesses
1 Citation
Explore all metrics

Abstract

Speech enhancement is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time–frequency (T–F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper, speech enhancement is investigated by multi-stage learning using a multi-stage structure in which time–frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion block (FB) is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling tasks. The time–frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterize the salient T–F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

Data availability

The data that support the findings of this study are available in the LibriSpeech dataset and VCTK dataset.

References

D. Baby, S. Verhulst, Sergan: speech enhancement using relativistic generative adversarial networks with gradient penalty, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2019), pp. 106–110
S. Bai, J.Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Article Google Scholar
J. Chen, D. Wang, Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 141(6), 4705–4714 (2017)
Article MathSciNet Google Scholar
Z. Chen, Y. Huang, J. Li et al. Improving mask learning based speech enhancement system with restoration layers and residual connection, in INTERSPEECH (2017), pp. 3632–3636
A. Defossez, G. Synnaeve, Y. Adi, Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020)
Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)
Article Google Scholar
Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)
Article Google Scholar
C. Fan, J. Tao, B. Liu et al. Deep attention fusion feature for speech separation with end-to-end post-filter method. arXiv preprint arXiv:2003.07544 (2020)
Y.A. Farha, J. Gall, MS-TCN: multi-stage temporal convolutional network for action segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 3575–3584
S.W. Fu, C.F. Liao, Y. Tsao et al., Metricgan: generative adversarial networks based black-box metric scores optimization for speech enhancement, in International Conference on Machine Learning, PMLR (2019) pp. 2031–2041
R. Giri, U. Isik, A. Krishnaswamy, Attention Wave-U-Net for speech enhancement, in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (IEEE, 2019), pp. 249–253
X. Hao, X. Su, S. Wen et al., Masking and inpainting: a two-stage speech enhancement approach for low snr and non-stationary noise, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 6959–6963
K. He, X. Zhang, S. Ren et al., Delving deep into rectifiers: surpassing human-level performance on imagenet classification, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp 1026–1034
J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp 7132–7141
X. Hu, S. Wang, C. Zheng et al., A cepstrum-based preprocessing and postprocessing for speech enhancement in adverse environments. Appl. Acoust. 74(12), 1458–1462 (2013)
Article Google Scholar
C. Jannu, S.D. Vanambathina, Shuffle attention u-Net for speech enhancement in time domain. Int. J. Image Graph. (2023). https://doi.org/10.1142/S0219467824500438
Article Google Scholar
J. Kim, M. El-Khamy, J. Lee, T-GSA: transformer with gaussian-weighted self-attention for speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020), pp. 6649–6653
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization 3rd int, in Conf. for Learning Representations, San (2014)
V. Kishore, N. Tiwari, P. Paramasivam, Improved speech enhancement using TCN with multiple encoder–decoder layers, in Interspeech (2020), pp. 4531–4535
Y. Koizumi, K. Yatabe, M. Delcroix et al., Speech enhancement using self-adaptation and multi-head self-attention, in ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 181–185
Y. Koyama, T. Vuong, S. Uhlich et al., Exploring the best loss function for DNN-based low-latency speech enhancement with temporal convolutional networks (2020). arXiv preprint arXiv:2005.11611
A. Li, C. Zheng, C. Fan et al., A recursive network with dynamic attention for monaural speech enhancement. Proc. Interspeech 2020, 2422–2426 (2020)
Google Scholar
A. Li, C. Zheng, R. Peng et al., Two heads are better than one: a two-stage approach for monaural noise reduction in the complex domain (2020). arXiv preprint arXiv:2011.01561
A. Li, C. Zheng, L. Zhang et al., Glance and gaze: a collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 187, 108499 (2022)
Article Google Scholar
J. Lin, S. Niu, Z. Wei et al., Speech enhancement using forked generative adversarial networks with spectral subtraction, in Proceedings of Interspeech (2019)
J. Lin, S. Niu, A.J. Wijngaarden et al., Improved speech enhancement using a time-domain GAN with mask learning, in Proceedings of Interspeech 2020 (2020)
J. Lin, AJd.L. van Wijngaarden, K.C. Wang et al., Speech enhancement using multi-stage self-attentive temporal convolutional networks. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3440–3450 (2021)
Article Google Scholar
C.L. Liu, S.W. Fu, Y.J. Li et al., Multichannel speech enhancement by raw waveform-mapping using fully convolutional networks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1888–1900 (2020)
Article Google Scholar
P. Loizou, Y. Hu, Noizeus: a noisy speech corpus for evaluation of speech enhancement algorithms. Speech Commun. 49, 588–601 (2017)
Google Scholar
Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
C. Macartney, T. Weyde, Improved speech enhancement with the wave-U-Net (2018). arXiv preprint arXiv:1811.11307
A. Narayanan, D. Wang, Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 826–835 (2014)
Article Google Scholar
A. van den Oord, S. Dieleman, H. Zen et al., Wavenet: a generative model for raw audio, in 9th ISCA Speech Synthesis Workshop (2016), pp. 125–125
V. Panayotov, G. Chen, D. Povey et al., Librispeech: an ASR corpus based on public domain audio books, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 5206–5210
A. Pandey, D. Wang, On adversarial training and loss functions for speech enhancement, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5414–5418
A. Pandey, D. Wang, A new framework for CNN-based speech enhancement in the time domain. IEEE/ACM Trans. Audio Speech Lang. Process. 27(7), 1179–1188 (2019)
Article Google Scholar
A. Pandey, D. Wang, TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain, in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6875–6879
A. Pandey, D. Wang, Dense CNN with self-attention for time-domain speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1270–1279 (2021)
Article Google Scholar
S. Pascual, A. Bonafonte, J. Serra, Segan: speech enhancement generative adversarial network (2017). arXiv preprint arXiv:1703.09452
H. Phan, H. Le Nguyen, O.Y. Chén et al., Self-attention generative adversarial network for speech enhancement, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 7103–7107
C.K. Reddy, H. Dubey, V. Gopal et al., ICASSP 2021 deep noise suppression challenge, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 6623–6627
C.K. Reddy, V. Gopal, R. Cutler, DNSMOS: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 6493–6497
A.W. Rix, J.G. Beerends, M.P. Hollier et al., Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (IEEE, 2001), pp. 749–752
M.H. Soni, N. Shah, H.A. Patil Time-frequency masking-based speech enhancement using generative adversarial network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5039–5043
C.H. Taal, R.C. Hendriks, R. Heusdens et al., An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
Article Google Scholar
K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement, in Interspeech (2018), pp. 3229–3233
K. Tan, J. Chen, D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 189–198 (2018)
Article Google Scholar
C. Tang, C. Luo, Z. Zhao et al., Joint time-frequency and time domain learning for speech enhancement, in Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence (2021), pp. 3816–3822
H. Tao, J. Qiu, Y. Chen et al., Unsupervised cross-domain rolling bearing fault diagnosis based on time-frequency information fusion. J. Franklin Inst. 360(2), 1454–1477 (2023)
Article MATH Google Scholar
J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings, in Proceedings of Meetings on Acoustics ICA2013 (Acoustical Society of America, 2013), p. 035081
C. Valentini-Botinhao, et al., Noisy Speech Database for Training Speech Enhancement Algorithms and TTS Models. University of Edinburgh School of Informatics Centre for Speech Technology Research (CSTR) (2017)
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Article Google Scholar
A. Vaswani, N. Shazeer, N. Parmar et al., Attention is all you need, in Advances in Neural Information Processing Systems, vol. 30
C. Veaux, J. Yamagishi, S. King, The voice bank corpus: design, collection and data analysis of a large regional accent speech database, in 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE) (IEEE, 2013), pp. 1–4
D. Wang, On ideal binary mask as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines (2005), pp. 181–197
Q. Wang, H. Muckenhirn, K. Wilson et al., Voicefilter: targeted voice separation by speaker-conditioned spectrogram masking. arXiv preprint arXiv:1810.04826 (2018)
Q. Wang, B. Wu, P. Zhu et al., ECA-Net: efficient channel attention for deep convolutional neural networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 11534–11542
Y. Wang, D. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)
Article Google Scholar
Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
Article Google Scholar
S. Woo, J. Park, J.Y. Lee et al., CBAM: convolutional block attention module, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 3–19
X. Xiang, X. Zhang, H. Chen, A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 29, 105–109 (2021)
Article Google Scholar
Y. Xu, J. Du, L.R. Dai et al., A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014)
Article Google Scholar
F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Q. Zhang, A. Nicolson, M. Wang, et al., Monaural speech enhancement using a multi-branch temporal convolutional network. arXiv preprint arXiv:1912.12023 (2019)
Q. Zhang, A. Nicolson, M. Wang et al., DeepMMSE: a deep learning approach to MMSE-based noise power spectral density estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1404–1415 (2020)
Article Google Scholar
Q. Zhang, Q. Song, A. Nicolson et al., Temporal convolutional network with frequency dimension adaptive attention for speech enhancement. Proc. Interspeech 2021, 166–170 (2021)
Google Scholar
Q. Zhang, X. Qian, Z. Ni et al., A time-frequency attention module for neural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 462–475 (2022)
Article Google Scholar
Y. Zhao, D. Wang, Noisy-reverberant speech enhancement using denseunet with time-frequency attention, in Interspeech (2020), pp. 3261–3265
Y. Zhao, D. Wang, B. Xu et al., Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1598–1607 (2020)
Article Google Scholar
C. Zheng, X. Peng, Y. Zhang et al., Interactive speech and noise modeling for speech enhancement, in Proceedings of the AAAI Conference on Artificial Intelligence (2021), pp. 14549–14557
C. Zhou, H. Tao, Y. Chen et al., Robust point-to-point iterative learning control for constrained systems: a minimum energy approach. Int. J. Robust Nonlinear Control 32(18), 10139–10161 (2022)
Article MathSciNet Google Scholar
Z. Zhuang, H. Tao, Y. Chen et al., An optimal iterative learning control approach for linear systems with nonuniform trial lengths under input constraints. IEEE Trans. Syst. Man Cybern. Syst. (2022). https://doi.org/10.1109/TSMC.2022.3225381
Article Google Scholar

Download references

Funding

No funding.

Author information

Authors and Affiliations

School of Electronics Engineering, VIT-AP University, Beside AP Secretariat, Amaravati, Andhra Pradesh, 522 237, India
Chaitanya Jannu & Sunny Dayal Vanambathina

Authors

Chaitanya Jannu
View author publications
You can also search for this author in PubMed Google Scholar
Sunny Dayal Vanambathina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chaitanya Jannu.

Ethics declarations

Conflict of interest

The authors declared that they have no conflict of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jannu, C., Vanambathina, S.D. Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks. Circuits Syst Signal Process 42, 7467–7493 (2023). https://doi.org/10.1007/s00034-023-02455-7

Download citation

Received: 12 February 2023
Revised: 10 July 2023
Accepted: 11 July 2023
Published: 26 July 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00034-023-02455-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

A Deep Learning Framework for Audio Deepfake Detection

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

A Deep Learning Framework for Audio Deepfake Detection

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation