Skip to main content
Log in

Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in the deep neural network (DNN)-based speech enhancement. The parameters of the DNN can be estimated by minimizing the mask loss, but it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. Restoring the cepstral pitch peak, in turn, helps in restoring the pitch harmonics in the enhanced spectrum. The proposed cepstral pitch-peak loss acts as an adaptive comb filter on voiced segments and emphasizes the pitch harmonics in the speech spectrum. The network parameters are estimated using a combination of mask loss and cepstral pitch-peak loss. We show that this combination offers the complementary advantages of enhancing both the voiced and unvoiced regions. The DNN-based methods primarily rely on the network architecture, and hence, the prediction accuracy improves with the increasing complexity of the architecture. The lower complex models are essential for real-time processing systems. In this work, we propose a compact model using a sliding-window attention network (SWAN). The SWAN is trained to regress the spectral magnitude mask (SMM) from the noisy speech signal. Our experimental results demonstrate that the proposed approach achieves comparable performance with the state-of-the-art noncausal and causal speech enhancement methods with much lesser computational complexity. Our three-layered noncausal SWAN achieves 2.99 PESQ on the Valentini database with only \(10^9\) floating-point operations (FLOPs).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. p245(male) and p265(female) are considered for development dataset.

  2. Audio samples: https://siplab-iith.github.io/SWAN.

  3. Inference codes and pre-trained models:https://github.com/SIPLab-IITH/SWAN-Neural-Comb-Filter.

References

  1. I. Beltagy, M.E. Peters, A. Cohan, Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

  2. S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Speech Signal 27, 113–120 (1979)

    Article  Google Scholar 

  3. O. Cappe, Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Trans. Speech Audio Process. 2(2), 345–349 (1994). https://doi.org/10.1109/89.279283

    Article  Google Scholar 

  4. I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. (2001). https://doi.org/10.1016/S0165-1684(01)00128-1

    Article  MATH  Google Scholar 

  5. R. Crochiere, A weighted overlap-add method of short-time Fourier analysis/synthesis. IEEE Trans. Acoust. Speech Signal Process. 28(1), 99–102 (1980)

    Article  Google Scholar 

  6. A. Defossez, G. Synnaeve, Y. Adi, DEMUCS implementation codes and pre-trained models. https://github.com/facebookresearch/denoiser

  7. A. Défossez, G. Synnaeve, Y. Adi, Real time speech enhancement in the waveform domain, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2020). https://doi.org/10.21437/Interspeech.2020-2409

  8. S. Elshamy, T. Fingscheidt, DNN-based cepstral excitation manipulation for speech enhancement. IEEE/ACM Trans. Audio Speech Lang Process 27(11), 1803–1814 (2019)

    Article  Google Scholar 

  9. Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. (1984). https://doi.org/10.1109/TASSP.1984.1164453

    Article  Google Scholar 

  10. H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712

  11. S.W. Fu, C.F. Liao, Y. Tsao, S.D. Lin, Metricgan: generative adversarial networks based black-box metric scores optimization for speech enhancement, in International Conference on Machine Learning (PMLR, 2019), pp. 2031–2041

  12. S.W. Fu, Y. Tsao, X. Lu, SNR-aware convolutional neural network modeling for speech enhancement, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2016). https://doi.org/10.21437/Interspeech.2016-211

  13. S.W. Fu, C. Yu, T.A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, Metricgan+ pre-trained model and inference function. https://huggingface.co/speechbrain/metricgan-plus-voicebank

  14. S.W. Fu, C. Yu, T.A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, Metricgan+: an improved version of Metricgan for speech enhancement. arXiv preprint arXiv:2104.03538 (2021)

  15. J. Gnanamanickam, Y. Natarajan, S. Ramasamy, A hybrid speech enhancement algorithm for voice assistance application. Sensors 21, 7025 (2021). https://doi.org/10.3390/s21217025

    Article  Google Scholar 

  16. J.H.L. Hansen, B.L. Pellom, An effective quality evaluation protocol for speech enhancement algorithms, in Proceedings of the International Conference on Spoken Language Processing (ICSLP), Sydney, Australia (1998)

  17. R.C. Hendriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity, in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2010), pp. 4266–4269

  18. G. Hu, D. Wang, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575) (IEEE, 2001), pp. 79–82

  19. Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. (2008). https://doi.org/10.1109/TASL.2007.911054

    Article  Google Scholar 

  20. S. Jafarlou, S. Khorram, V. Kothapally, J.H.L. Hansen, Analyzing large receptive field convolutional networks for distant speech recognition, in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2019), pp. 252–259

  21. K. Kasi, S.A. Zahorian, Yet another algorithm for pitch tracking, in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (IEEE, 2002), pp. I–361

  22. J. Kim, M. El-Khamy, J. Lee, T-GSA: transformer with Gaussian-weighted self-attention for speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020). https://doi.org/10.1109/ICASSP40776.2020.9053591

  23. J. Kim, M. El-Kharmy, End-to-end multi-task denoising for joint SDR and PESQ optimization. arXiv:1901.09146, J Lee - arXiv preprint (2019)

  24. D.P. Kingma, J.L. Ba, Adam: a method for stochastic optimization, in 3rd International Conference on Learning Representations, ICLR—Conference Track Proceedings (2015)

  25. D.H. Klatt, L.C. Klatt, Analysis, synthesis, and perception of voice quality variations among female and male talkers. J. Acoust. Soc. Am. 87(2), 820–857 (1990)

    Article  Google Scholar 

  26. M. Klatte, T. Lachmann, M. Meis, Effects of noise and reverberation on speech perception and listening comprehension of children and adults in a classroom-like setting. Noise Health (2010). https://doi.org/10.4103/1463-1741.70506

    Article  Google Scholar 

  27. J.F. Kolen, S.C. Kremer, Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies (Wiley-IEEE Press, New York, 2001), pp.237–243. https://doi.org/10.1109/9780470544037.ch14

    Book  Google Scholar 

  28. P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2007)

    Book  Google Scholar 

  29. Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (2019). https://doi.org/10.1109/TASLP.2019.2915167

    Article  Google Scholar 

  30. D. Malah, R. Cox, A generalized comb filtering technique for speech enhancement, in ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 7 (IEEE, 1982), pp. 160–163

  31. M.K. Marguiles, Male–female differences in speaker intelligibility; normal and hearing-impaired listeners. J. Acoust. Soc. Am. 65(S1), S99–S99 (1979)

    Article  Google Scholar 

  32. R. Martin, Spectral subtraction based on minimum statistics, in European Signal Processing Conference (EUSIPCO)–Proceedings, pp. 1182–1185 (1994)

  33. R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)

    Article  Google Scholar 

  34. K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)

    Article  Google Scholar 

  35. A. Nehorai, B. Porat, Adaptive comb filtering for harmonic signal enhancement. IEEE Trans. Acoust. Speech Signal Process. 34(5), 1124–1138 (1986)

    Article  Google Scholar 

  36. A. Nicolson, K.K. Paliwal, Deep learning for minimum mean-square error approaches to speech enhancement. Speech Commun. 111, 44–55 (2019)

    Article  Google Scholar 

  37. A. Nicolson, K.K. Paliwal, Masked multi-head self-attention for causal speech enhancement. Speech Commun. 125, 80–96 (2020)

    Article  Google Scholar 

  38. A.M. Noll, Cepstrum pitch determination. J. Acoust. Soc. Am. (1967). https://doi.org/10.1121/1.1910339

    Article  Google Scholar 

  39. S.R. Park, J.W. Lee, A fully convolutional neural network for speech enhancement, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2017). https://doi.org/10.21437/Interspeech.2017-1465

  40. S. Pascual, A. Bonafonte, J. Serra, SEGAN: speech enhancement generative adversarial network, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2017). https://doi.org/10.21437/Interspeech.2017-1428

  41. J.C. Príncipe, W. Liu, S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction (Wiley, New York, 2011)

    Google Scholar 

  42. L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Inc., Hoboken, 1993)

    Google Scholar 

  43. P.S. Rani, S. Andhavarapu, S.R. Murty Kodukula, Significance of phase in DNN based speech enhancement algorithms, in 26th National Conference on Communications, NCC (2020). https://doi.org/10.1109/NCC48643.2020.9056089

  44. K.R. Rao, P. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications (Academic Press, New York, 2014)

    MATH  Google Scholar 

  45. I.T. Recommendation, Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P. 862 (2001)

  46. M. Romaniuk, P. Masztalski, K. Piaskowski, M. Matuszewski, Efficient low-latency speech enhancement with mobile audio streaming networks, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2020). https://doi.org/10.21437/Interspeech.2020-2443

  47. P. Scalart, J.V. Filho, Speech enhancement based on a priori signal to noise estimation, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, vol. 2 (1996). https://doi.org/10.1109/icassp.1996.543199

  48. M.H. Soni, N. Shah, H.A. Patil, Time-frequency masking-based speech enhancement using generative adversarial network, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, vol. 2018 (2018). https://doi.org/10.1109/ICASSP.2018.8462068

  49. S. Srinivasan, N. Roman, D. Wang, Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)

    Article  Google Scholar 

  50. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings (2010). https://doi.org/10.1109/ICASSP.2010.5495701

  51. Y. Tay, M. Dehghani, D. Bahri, D. Metzler, Efficient transformers: a survey. ACM Comput. Surv. (2022). https://doi.org/10.1145/3530811

    Article  Google Scholar 

  52. D. Terpstra, H. Jagode, H. You, J. Dongarra, Collecting performance data with PAPI-C, in Proceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing 2009 (2010). https://doi.org/10.1007/978-3-642-11261-4_11

  53. J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. J. Acoust. Soc. Am. (2013). https://doi.org/10.1121/1.4806631

    Article  Google Scholar 

  54. C. Valentini-Botinhao, X. Wang, S. Takaki, J. Yamagishi, Investigating RNN-based speech enhancement methods for noise-robust text-to-speech, in SSW (2016), pp. 146–152

  55. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems (2017)

  56. C. Veaux, J. Yamagishi, S. King, The voice bank corpus: design, collection and data analysis of a large regional accent speech database, in 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE 2013 (2013). https://doi.org/10.1109/ICSDA.2013.6709856

  57. D. Wang, J. Chen, Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)

    Article  Google Scholar 

  58. K. Wang, B. He, W.P. Zhu, TSTNN implementation codes. https://github.com/key2miao/TSTNN

  59. K. Wang, B. He, W.P. Zhu, TSTNN: two-stage transformer based neural network for speech enhancement in the time domain, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021). https://doi.org/10.1109/icassp39728.2021.9413740

  60. Q. Wang, J. Du, L.R. Dai, C.H. Lee, A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures. IEEE/ACM Trans. Audio Speech Lang. Process. 26(7), 1185–1197 (2018)

    Article  Google Scholar 

  61. Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)

    Article  Google Scholar 

  62. F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation, in IEEE Global Conference on Signal and Information Processing, GlobalSIP (2014). https://doi.org/10.1109/GlobalSIP.2014.7032183

  63. D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)

    Article  Google Scholar 

  64. Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. (2014). https://doi.org/10.1109/LSP.2013.2291240

    Article  Google Scholar 

  65. L.P. Yang, Q.J. Fu, Spectral subtraction-based speech enhancement for cochlear implant patients in background noise. J. Acoust. Soc. Am. (2005). https://doi.org/10.1121/1.1852873

    Article  Google Scholar 

  66. B. Yegnanarayana, C. Avendano, H. Hermansky, P.S. Murthy, Speech enhancement using linear prediction residual. Speech Commun. 28(1), 25–42 (1999)

    Article  Google Scholar 

  67. S.A. Zahorian, H. Hu, A spectral/temporal method for robust fundamental frequency tracking. J. Acoust. Soc. Am. (2008). https://doi.org/10.1121/1.2916590

    Article  Google Scholar 

  68. L. Zhang, M. Wang, Q. Zhang, X. Wang, M. Liu, PhaseDCN: a phase-enhanced dual-path dilated convolutional network for single-channel speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2561–2574 (2021)

    Article  Google Scholar 

  69. Q. Zhang, A. Nicolson, M. Wang, K.K. Paliwal, C. Wang, DeepMMSE: a deep learning approach to MMSE-based noise power spectral density estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1404–1415 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Venkatesh Parvathala.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Parvathala, V., Andhavarapu, S., Pamisetty, G. et al. Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement. Circuits Syst Signal Process 42, 322–343 (2023). https://doi.org/10.1007/s00034-022-02123-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02123-2

Keywords

Navigation