Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement

Parvathala, Venkatesh; Andhavarapu, Sivaganesh; Pamisetty, Giridhar; Murty, K. Sri Rama

doi:10.1007/s00034-022-02123-2

Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement

Published: 05 August 2022

Volume 42, pages 322–343, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

327 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in the deep neural network (DNN)-based speech enhancement. The parameters of the DNN can be estimated by minimizing the mask loss, but it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. Restoring the cepstral pitch peak, in turn, helps in restoring the pitch harmonics in the enhanced spectrum. The proposed cepstral pitch-peak loss acts as an adaptive comb filter on voiced segments and emphasizes the pitch harmonics in the speech spectrum. The network parameters are estimated using a combination of mask loss and cepstral pitch-peak loss. We show that this combination offers the complementary advantages of enhancing both the voiced and unvoiced regions. The DNN-based methods primarily rely on the network architecture, and hence, the prediction accuracy improves with the increasing complexity of the architecture. The lower complex models are essential for real-time processing systems. In this work, we propose a compact model using a sliding-window attention network (SWAN). The SWAN is trained to regress the spectral magnitude mask (SMM) from the noisy speech signal. Our experimental results demonstrate that the proposed approach achieves comparable performance with the state-of-the-art noncausal and causal speech enhancement methods with much lesser computational complexity. Our three-layered noncausal SWAN achieves 2.99 PESQ on the Valentini database with only \(10^9\) floating-point operations (FLOPs).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Article 26 July 2023

Low-complexity artificial noise suppression methods for deep learning-based speech enhancement algorithms

Article Open access 12 April 2021

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Article Open access 11 April 2024

Notes

p245(male) and p265(female) are considered for development dataset.
Audio samples: https://siplab-iith.github.io/SWAN.
Inference codes and pre-trained models:https://github.com/SIPLab-IITH/SWAN-Neural-Comb-Filter.

References

I. Beltagy, M.E. Peters, A. Cohan, Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Speech Signal 27, 113–120 (1979)
Article Google Scholar
O. Cappe, Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Trans. Speech Audio Process. 2(2), 345–349 (1994). https://doi.org/10.1109/89.279283
Article Google Scholar
I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. (2001). https://doi.org/10.1016/S0165-1684(01)00128-1
Article MATH Google Scholar
R. Crochiere, A weighted overlap-add method of short-time Fourier analysis/synthesis. IEEE Trans. Acoust. Speech Signal Process. 28(1), 99–102 (1980)
Article Google Scholar
A. Defossez, G. Synnaeve, Y. Adi, DEMUCS implementation codes and pre-trained models. https://github.com/facebookresearch/denoiser
A. Défossez, G. Synnaeve, Y. Adi, Real time speech enhancement in the waveform domain, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2020). https://doi.org/10.21437/Interspeech.2020-2409
S. Elshamy, T. Fingscheidt, DNN-based cepstral excitation manipulation for speech enhancement. IEEE/ACM Trans. Audio Speech Lang Process 27(11), 1803–1814 (2019)
Article Google Scholar
Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. (1984). https://doi.org/10.1109/TASSP.1984.1164453
Article Google Scholar
H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712
S.W. Fu, C.F. Liao, Y. Tsao, S.D. Lin, Metricgan: generative adversarial networks based black-box metric scores optimization for speech enhancement, in International Conference on Machine Learning (PMLR, 2019), pp. 2031–2041
S.W. Fu, Y. Tsao, X. Lu, SNR-aware convolutional neural network modeling for speech enhancement, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2016). https://doi.org/10.21437/Interspeech.2016-211
S.W. Fu, C. Yu, T.A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, Metricgan+ pre-trained model and inference function. https://huggingface.co/speechbrain/metricgan-plus-voicebank
S.W. Fu, C. Yu, T.A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, Metricgan+: an improved version of Metricgan for speech enhancement. arXiv preprint arXiv:2104.03538 (2021)
J. Gnanamanickam, Y. Natarajan, S. Ramasamy, A hybrid speech enhancement algorithm for voice assistance application. Sensors 21, 7025 (2021). https://doi.org/10.3390/s21217025
Article Google Scholar
J.H.L. Hansen, B.L. Pellom, An effective quality evaluation protocol for speech enhancement algorithms, in Proceedings of the International Conference on Spoken Language Processing (ICSLP), Sydney, Australia (1998)
R.C. Hendriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity, in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2010), pp. 4266–4269
G. Hu, D. Wang, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575) (IEEE, 2001), pp. 79–82
Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. (2008). https://doi.org/10.1109/TASL.2007.911054
Article Google Scholar
S. Jafarlou, S. Khorram, V. Kothapally, J.H.L. Hansen, Analyzing large receptive field convolutional networks for distant speech recognition, in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2019), pp. 252–259
K. Kasi, S.A. Zahorian, Yet another algorithm for pitch tracking, in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (IEEE, 2002), pp. I–361
J. Kim, M. El-Khamy, J. Lee, T-GSA: transformer with Gaussian-weighted self-attention for speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020). https://doi.org/10.1109/ICASSP40776.2020.9053591
J. Kim, M. El-Kharmy, End-to-end multi-task denoising for joint SDR and PESQ optimization. arXiv:1901.09146, J Lee - arXiv preprint (2019)
D.P. Kingma, J.L. Ba, Adam: a method for stochastic optimization, in 3rd International Conference on Learning Representations, ICLR—Conference Track Proceedings (2015)
D.H. Klatt, L.C. Klatt, Analysis, synthesis, and perception of voice quality variations among female and male talkers. J. Acoust. Soc. Am. 87(2), 820–857 (1990)
Article Google Scholar
M. Klatte, T. Lachmann, M. Meis, Effects of noise and reverberation on speech perception and listening comprehension of children and adults in a classroom-like setting. Noise Health (2010). https://doi.org/10.4103/1463-1741.70506
Article Google Scholar
J.F. Kolen, S.C. Kremer, Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies (Wiley-IEEE Press, New York, 2001), pp.237–243. https://doi.org/10.1109/9780470544037.ch14
Book Google Scholar
P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2007)
Book Google Scholar
Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (2019). https://doi.org/10.1109/TASLP.2019.2915167
Article Google Scholar
D. Malah, R. Cox, A generalized comb filtering technique for speech enhancement, in ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 7 (IEEE, 1982), pp. 160–163
M.K. Marguiles, Male–female differences in speaker intelligibility; normal and hearing-impaired listeners. J. Acoust. Soc. Am. 65(S1), S99–S99 (1979)
Article Google Scholar
R. Martin, Spectral subtraction based on minimum statistics, in European Signal Processing Conference (EUSIPCO)–Proceedings, pp. 1182–1185 (1994)
R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)
Article Google Scholar
K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)
Article Google Scholar
A. Nehorai, B. Porat, Adaptive comb filtering for harmonic signal enhancement. IEEE Trans. Acoust. Speech Signal Process. 34(5), 1124–1138 (1986)
Article Google Scholar
A. Nicolson, K.K. Paliwal, Deep learning for minimum mean-square error approaches to speech enhancement. Speech Commun. 111, 44–55 (2019)
Article Google Scholar
A. Nicolson, K.K. Paliwal, Masked multi-head self-attention for causal speech enhancement. Speech Commun. 125, 80–96 (2020)
Article Google Scholar
A.M. Noll, Cepstrum pitch determination. J. Acoust. Soc. Am. (1967). https://doi.org/10.1121/1.1910339
Article Google Scholar
S.R. Park, J.W. Lee, A fully convolutional neural network for speech enhancement, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2017). https://doi.org/10.21437/Interspeech.2017-1465
S. Pascual, A. Bonafonte, J. Serra, SEGAN: speech enhancement generative adversarial network, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2017). https://doi.org/10.21437/Interspeech.2017-1428
J.C. Príncipe, W. Liu, S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction (Wiley, New York, 2011)
Google Scholar
L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Inc., Hoboken, 1993)
Google Scholar
P.S. Rani, S. Andhavarapu, S.R. Murty Kodukula, Significance of phase in DNN based speech enhancement algorithms, in 26th National Conference on Communications, NCC (2020). https://doi.org/10.1109/NCC48643.2020.9056089
K.R. Rao, P. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications (Academic Press, New York, 2014)
MATH Google Scholar
I.T. Recommendation, Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P. 862 (2001)
M. Romaniuk, P. Masztalski, K. Piaskowski, M. Matuszewski, Efficient low-latency speech enhancement with mobile audio streaming networks, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2020). https://doi.org/10.21437/Interspeech.2020-2443
P. Scalart, J.V. Filho, Speech enhancement based on a priori signal to noise estimation, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, vol. 2 (1996). https://doi.org/10.1109/icassp.1996.543199
M.H. Soni, N. Shah, H.A. Patil, Time-frequency masking-based speech enhancement using generative adversarial network, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, vol. 2018 (2018). https://doi.org/10.1109/ICASSP.2018.8462068
S. Srinivasan, N. Roman, D. Wang, Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)
Article Google Scholar
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings (2010). https://doi.org/10.1109/ICASSP.2010.5495701
Y. Tay, M. Dehghani, D. Bahri, D. Metzler, Efficient transformers: a survey. ACM Comput. Surv. (2022). https://doi.org/10.1145/3530811
Article Google Scholar
D. Terpstra, H. Jagode, H. You, J. Dongarra, Collecting performance data with PAPI-C, in Proceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing 2009 (2010). https://doi.org/10.1007/978-3-642-11261-4_11
J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. J. Acoust. Soc. Am. (2013). https://doi.org/10.1121/1.4806631
Article Google Scholar
C. Valentini-Botinhao, X. Wang, S. Takaki, J. Yamagishi, Investigating RNN-based speech enhancement methods for noise-robust text-to-speech, in SSW (2016), pp. 146–152
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems (2017)
C. Veaux, J. Yamagishi, S. King, The voice bank corpus: design, collection and data analysis of a large regional accent speech database, in 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE 2013 (2013). https://doi.org/10.1109/ICSDA.2013.6709856
D. Wang, J. Chen, Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
Article Google Scholar
K. Wang, B. He, W.P. Zhu, TSTNN implementation codes. https://github.com/key2miao/TSTNN
K. Wang, B. He, W.P. Zhu, TSTNN: two-stage transformer based neural network for speech enhancement in the time domain, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021). https://doi.org/10.1109/icassp39728.2021.9413740
Q. Wang, J. Du, L.R. Dai, C.H. Lee, A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures. IEEE/ACM Trans. Audio Speech Lang. Process. 26(7), 1185–1197 (2018)
Article Google Scholar
Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
Article Google Scholar
F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation, in IEEE Global Conference on Signal and Information Processing, GlobalSIP (2014). https://doi.org/10.1109/GlobalSIP.2014.7032183
D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Article Google Scholar
Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. (2014). https://doi.org/10.1109/LSP.2013.2291240
Article Google Scholar
L.P. Yang, Q.J. Fu, Spectral subtraction-based speech enhancement for cochlear implant patients in background noise. J. Acoust. Soc. Am. (2005). https://doi.org/10.1121/1.1852873
Article Google Scholar
B. Yegnanarayana, C. Avendano, H. Hermansky, P.S. Murthy, Speech enhancement using linear prediction residual. Speech Commun. 28(1), 25–42 (1999)
Article Google Scholar
S.A. Zahorian, H. Hu, A spectral/temporal method for robust fundamental frequency tracking. J. Acoust. Soc. Am. (2008). https://doi.org/10.1121/1.2916590
Article Google Scholar
L. Zhang, M. Wang, Q. Zhang, X. Wang, M. Liu, PhaseDCN: a phase-enhanced dual-path dilated convolutional network for single-channel speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2561–2574 (2021)
Article Google Scholar
Q. Zhang, A. Nicolson, M. Wang, K.K. Paliwal, C. Wang, DeepMMSE: a deep learning approach to MMSE-based noise power spectral density estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1404–1415 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Speech Information Processing Lab, Department of Electrical Engineering, Indian Institute of Technology Hyderabad, Hyderabad, 502285, India
Venkatesh Parvathala, Sivaganesh Andhavarapu, Giridhar Pamisetty & K. Sri Rama Murty

Authors

Venkatesh Parvathala
View author publications
You can also search for this author in PubMed Google Scholar
Sivaganesh Andhavarapu
View author publications
You can also search for this author in PubMed Google Scholar
Giridhar Pamisetty
View author publications
You can also search for this author in PubMed Google Scholar
K. Sri Rama Murty
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Venkatesh Parvathala.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Parvathala, V., Andhavarapu, S., Pamisetty, G. et al. Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement. Circuits Syst Signal Process 42, 322–343 (2023). https://doi.org/10.1007/s00034-022-02123-2

Download citation

Received: 03 October 2021
Revised: 14 July 2022
Accepted: 14 July 2022
Published: 05 August 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00034-022-02123-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Low-complexity artificial noise suppression methods for deep learning-based speech enhancement algorithms

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Low-complexity artificial noise suppression methods for deep learning-based speech enhancement algorithms

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation