Abstract
Speech enhancement is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time–frequency (T–F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper, speech enhancement is investigated by multi-stage learning using a multi-stage structure in which time–frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion block (FB) is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling tasks. The time–frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterize the salient T–F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available in the LibriSpeech dataset and VCTK dataset.
References
D. Baby, S. Verhulst, Sergan: speech enhancement using relativistic generative adversarial networks with gradient penalty, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2019), pp. 106–110
S. Bai, J.Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
J. Chen, D. Wang, Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 141(6), 4705–4714 (2017)
Z. Chen, Y. Huang, J. Li et al. Improving mask learning based speech enhancement system with restoration layers and residual connection, in INTERSPEECH (2017), pp. 3632–3636
A. Defossez, G. Synnaeve, Y. Adi, Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020)
Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)
Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)
C. Fan, J. Tao, B. Liu et al. Deep attention fusion feature for speech separation with end-to-end post-filter method. arXiv preprint arXiv:2003.07544 (2020)
Y.A. Farha, J. Gall, MS-TCN: multi-stage temporal convolutional network for action segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 3575–3584
S.W. Fu, C.F. Liao, Y. Tsao et al., Metricgan: generative adversarial networks based black-box metric scores optimization for speech enhancement, in International Conference on Machine Learning, PMLR (2019) pp. 2031–2041
R. Giri, U. Isik, A. Krishnaswamy, Attention Wave-U-Net for speech enhancement, in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (IEEE, 2019), pp. 249–253
X. Hao, X. Su, S. Wen et al., Masking and inpainting: a two-stage speech enhancement approach for low snr and non-stationary noise, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 6959–6963
K. He, X. Zhang, S. Ren et al., Delving deep into rectifiers: surpassing human-level performance on imagenet classification, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp 1026–1034
J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp 7132–7141
X. Hu, S. Wang, C. Zheng et al., A cepstrum-based preprocessing and postprocessing for speech enhancement in adverse environments. Appl. Acoust. 74(12), 1458–1462 (2013)
C. Jannu, S.D. Vanambathina, Shuffle attention u-Net for speech enhancement in time domain. Int. J. Image Graph. (2023). https://doi.org/10.1142/S0219467824500438
J. Kim, M. El-Khamy, J. Lee, T-GSA: transformer with gaussian-weighted self-attention for speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020), pp. 6649–6653
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization 3rd int, in Conf. for Learning Representations, San (2014)
V. Kishore, N. Tiwari, P. Paramasivam, Improved speech enhancement using TCN with multiple encoder–decoder layers, in Interspeech (2020), pp. 4531–4535
Y. Koizumi, K. Yatabe, M. Delcroix et al., Speech enhancement using self-adaptation and multi-head self-attention, in ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 181–185
Y. Koyama, T. Vuong, S. Uhlich et al., Exploring the best loss function for DNN-based low-latency speech enhancement with temporal convolutional networks (2020). arXiv preprint arXiv:2005.11611
A. Li, C. Zheng, C. Fan et al., A recursive network with dynamic attention for monaural speech enhancement. Proc. Interspeech 2020, 2422–2426 (2020)
A. Li, C. Zheng, R. Peng et al., Two heads are better than one: a two-stage approach for monaural noise reduction in the complex domain (2020). arXiv preprint arXiv:2011.01561
A. Li, C. Zheng, L. Zhang et al., Glance and gaze: a collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 187, 108499 (2022)
J. Lin, S. Niu, Z. Wei et al., Speech enhancement using forked generative adversarial networks with spectral subtraction, in Proceedings of Interspeech (2019)
J. Lin, S. Niu, A.J. Wijngaarden et al., Improved speech enhancement using a time-domain GAN with mask learning, in Proceedings of Interspeech 2020 (2020)
J. Lin, AJd.L. van Wijngaarden, K.C. Wang et al., Speech enhancement using multi-stage self-attentive temporal convolutional networks. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3440–3450 (2021)
C.L. Liu, S.W. Fu, Y.J. Li et al., Multichannel speech enhancement by raw waveform-mapping using fully convolutional networks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1888–1900 (2020)
P. Loizou, Y. Hu, Noizeus: a noisy speech corpus for evaluation of speech enhancement algorithms. Speech Commun. 49, 588–601 (2017)
Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
C. Macartney, T. Weyde, Improved speech enhancement with the wave-U-Net (2018). arXiv preprint arXiv:1811.11307
A. Narayanan, D. Wang, Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 826–835 (2014)
A. van den Oord, S. Dieleman, H. Zen et al., Wavenet: a generative model for raw audio, in 9th ISCA Speech Synthesis Workshop (2016), pp. 125–125
V. Panayotov, G. Chen, D. Povey et al., Librispeech: an ASR corpus based on public domain audio books, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 5206–5210
A. Pandey, D. Wang, On adversarial training and loss functions for speech enhancement, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5414–5418
A. Pandey, D. Wang, A new framework for CNN-based speech enhancement in the time domain. IEEE/ACM Trans. Audio Speech Lang. Process. 27(7), 1179–1188 (2019)
A. Pandey, D. Wang, TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain, in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6875–6879
A. Pandey, D. Wang, Dense CNN with self-attention for time-domain speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1270–1279 (2021)
S. Pascual, A. Bonafonte, J. Serra, Segan: speech enhancement generative adversarial network (2017). arXiv preprint arXiv:1703.09452
H. Phan, H. Le Nguyen, O.Y. Chén et al., Self-attention generative adversarial network for speech enhancement, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 7103–7107
C.K. Reddy, H. Dubey, V. Gopal et al., ICASSP 2021 deep noise suppression challenge, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 6623–6627
C.K. Reddy, V. Gopal, R. Cutler, DNSMOS: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 6493–6497
A.W. Rix, J.G. Beerends, M.P. Hollier et al., Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (IEEE, 2001), pp. 749–752
M.H. Soni, N. Shah, H.A. Patil Time-frequency masking-based speech enhancement using generative adversarial network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5039–5043
C.H. Taal, R.C. Hendriks, R. Heusdens et al., An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement, in Interspeech (2018), pp. 3229–3233
K. Tan, J. Chen, D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 189–198 (2018)
C. Tang, C. Luo, Z. Zhao et al., Joint time-frequency and time domain learning for speech enhancement, in Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence (2021), pp. 3816–3822
H. Tao, J. Qiu, Y. Chen et al., Unsupervised cross-domain rolling bearing fault diagnosis based on time-frequency information fusion. J. Franklin Inst. 360(2), 1454–1477 (2023)
J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings, in Proceedings of Meetings on Acoustics ICA2013 (Acoustical Society of America, 2013), p. 035081
C. Valentini-Botinhao, et al., Noisy Speech Database for Training Speech Enhancement Algorithms and TTS Models. University of Edinburgh School of Informatics Centre for Speech Technology Research (CSTR) (2017)
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
A. Vaswani, N. Shazeer, N. Parmar et al., Attention is all you need, in Advances in Neural Information Processing Systems, vol. 30
C. Veaux, J. Yamagishi, S. King, The voice bank corpus: design, collection and data analysis of a large regional accent speech database, in 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE) (IEEE, 2013), pp. 1–4
D. Wang, On ideal binary mask as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines (2005), pp. 181–197
Q. Wang, H. Muckenhirn, K. Wilson et al., Voicefilter: targeted voice separation by speaker-conditioned spectrogram masking. arXiv preprint arXiv:1810.04826 (2018)
Q. Wang, B. Wu, P. Zhu et al., ECA-Net: efficient channel attention for deep convolutional neural networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 11534–11542
Y. Wang, D. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)
Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
S. Woo, J. Park, J.Y. Lee et al., CBAM: convolutional block attention module, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 3–19
X. Xiang, X. Zhang, H. Chen, A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 29, 105–109 (2021)
Y. Xu, J. Du, L.R. Dai et al., A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014)
F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Q. Zhang, A. Nicolson, M. Wang, et al., Monaural speech enhancement using a multi-branch temporal convolutional network. arXiv preprint arXiv:1912.12023 (2019)
Q. Zhang, A. Nicolson, M. Wang et al., DeepMMSE: a deep learning approach to MMSE-based noise power spectral density estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1404–1415 (2020)
Q. Zhang, Q. Song, A. Nicolson et al., Temporal convolutional network with frequency dimension adaptive attention for speech enhancement. Proc. Interspeech 2021, 166–170 (2021)
Q. Zhang, X. Qian, Z. Ni et al., A time-frequency attention module for neural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 462–475 (2022)
Y. Zhao, D. Wang, Noisy-reverberant speech enhancement using denseunet with time-frequency attention, in Interspeech (2020), pp. 3261–3265
Y. Zhao, D. Wang, B. Xu et al., Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1598–1607 (2020)
C. Zheng, X. Peng, Y. Zhang et al., Interactive speech and noise modeling for speech enhancement, in Proceedings of the AAAI Conference on Artificial Intelligence (2021), pp. 14549–14557
C. Zhou, H. Tao, Y. Chen et al., Robust point-to-point iterative learning control for constrained systems: a minimum energy approach. Int. J. Robust Nonlinear Control 32(18), 10139–10161 (2022)
Z. Zhuang, H. Tao, Y. Chen et al., An optimal iterative learning control approach for linear systems with nonuniform trial lengths under input constraints. IEEE Trans. Syst. Man Cybern. Syst. (2022). https://doi.org/10.1109/TSMC.2022.3225381
Funding
No funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they have no conflict of interest to this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jannu, C., Vanambathina, S.D. Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks. Circuits Syst Signal Process 42, 7467–7493 (2023). https://doi.org/10.1007/s00034-023-02455-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02455-7