Abstract
In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in the deep neural network (DNN)-based speech enhancement. The parameters of the DNN can be estimated by minimizing the mask loss, but it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. Restoring the cepstral pitch peak, in turn, helps in restoring the pitch harmonics in the enhanced spectrum. The proposed cepstral pitch-peak loss acts as an adaptive comb filter on voiced segments and emphasizes the pitch harmonics in the speech spectrum. The network parameters are estimated using a combination of mask loss and cepstral pitch-peak loss. We show that this combination offers the complementary advantages of enhancing both the voiced and unvoiced regions. The DNN-based methods primarily rely on the network architecture, and hence, the prediction accuracy improves with the increasing complexity of the architecture. The lower complex models are essential for real-time processing systems. In this work, we propose a compact model using a sliding-window attention network (SWAN). The SWAN is trained to regress the spectral magnitude mask (SMM) from the noisy speech signal. Our experimental results demonstrate that the proposed approach achieves comparable performance with the state-of-the-art noncausal and causal speech enhancement methods with much lesser computational complexity. Our three-layered noncausal SWAN achieves 2.99 PESQ on the Valentini database with only \(10^9\) floating-point operations (FLOPs).
Similar content being viewed by others
Notes
p245(male) and p265(female) are considered for development dataset.
Audio samples: https://siplab-iith.github.io/SWAN.
Inference codes and pre-trained models:https://github.com/SIPLab-IITH/SWAN-Neural-Comb-Filter.
References
I. Beltagy, M.E. Peters, A. Cohan, Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Speech Signal 27, 113–120 (1979)
O. Cappe, Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Trans. Speech Audio Process. 2(2), 345–349 (1994). https://doi.org/10.1109/89.279283
I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. (2001). https://doi.org/10.1016/S0165-1684(01)00128-1
R. Crochiere, A weighted overlap-add method of short-time Fourier analysis/synthesis. IEEE Trans. Acoust. Speech Signal Process. 28(1), 99–102 (1980)
A. Defossez, G. Synnaeve, Y. Adi, DEMUCS implementation codes and pre-trained models. https://github.com/facebookresearch/denoiser
A. Défossez, G. Synnaeve, Y. Adi, Real time speech enhancement in the waveform domain, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2020). https://doi.org/10.21437/Interspeech.2020-2409
S. Elshamy, T. Fingscheidt, DNN-based cepstral excitation manipulation for speech enhancement. IEEE/ACM Trans. Audio Speech Lang Process 27(11), 1803–1814 (2019)
Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. (1984). https://doi.org/10.1109/TASSP.1984.1164453
H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712
S.W. Fu, C.F. Liao, Y. Tsao, S.D. Lin, Metricgan: generative adversarial networks based black-box metric scores optimization for speech enhancement, in International Conference on Machine Learning (PMLR, 2019), pp. 2031–2041
S.W. Fu, Y. Tsao, X. Lu, SNR-aware convolutional neural network modeling for speech enhancement, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2016). https://doi.org/10.21437/Interspeech.2016-211
S.W. Fu, C. Yu, T.A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, Metricgan+ pre-trained model and inference function. https://huggingface.co/speechbrain/metricgan-plus-voicebank
S.W. Fu, C. Yu, T.A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, Metricgan+: an improved version of Metricgan for speech enhancement. arXiv preprint arXiv:2104.03538 (2021)
J. Gnanamanickam, Y. Natarajan, S. Ramasamy, A hybrid speech enhancement algorithm for voice assistance application. Sensors 21, 7025 (2021). https://doi.org/10.3390/s21217025
J.H.L. Hansen, B.L. Pellom, An effective quality evaluation protocol for speech enhancement algorithms, in Proceedings of the International Conference on Spoken Language Processing (ICSLP), Sydney, Australia (1998)
R.C. Hendriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity, in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2010), pp. 4266–4269
G. Hu, D. Wang, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575) (IEEE, 2001), pp. 79–82
Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. (2008). https://doi.org/10.1109/TASL.2007.911054
S. Jafarlou, S. Khorram, V. Kothapally, J.H.L. Hansen, Analyzing large receptive field convolutional networks for distant speech recognition, in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2019), pp. 252–259
K. Kasi, S.A. Zahorian, Yet another algorithm for pitch tracking, in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (IEEE, 2002), pp. I–361
J. Kim, M. El-Khamy, J. Lee, T-GSA: transformer with Gaussian-weighted self-attention for speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020). https://doi.org/10.1109/ICASSP40776.2020.9053591
J. Kim, M. El-Kharmy, End-to-end multi-task denoising for joint SDR and PESQ optimization. arXiv:1901.09146, J Lee - arXiv preprint (2019)
D.P. Kingma, J.L. Ba, Adam: a method for stochastic optimization, in 3rd International Conference on Learning Representations, ICLR—Conference Track Proceedings (2015)
D.H. Klatt, L.C. Klatt, Analysis, synthesis, and perception of voice quality variations among female and male talkers. J. Acoust. Soc. Am. 87(2), 820–857 (1990)
M. Klatte, T. Lachmann, M. Meis, Effects of noise and reverberation on speech perception and listening comprehension of children and adults in a classroom-like setting. Noise Health (2010). https://doi.org/10.4103/1463-1741.70506
J.F. Kolen, S.C. Kremer, Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies (Wiley-IEEE Press, New York, 2001), pp.237–243. https://doi.org/10.1109/9780470544037.ch14
P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2007)
Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (2019). https://doi.org/10.1109/TASLP.2019.2915167
D. Malah, R. Cox, A generalized comb filtering technique for speech enhancement, in ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 7 (IEEE, 1982), pp. 160–163
M.K. Marguiles, Male–female differences in speaker intelligibility; normal and hearing-impaired listeners. J. Acoust. Soc. Am. 65(S1), S99–S99 (1979)
R. Martin, Spectral subtraction based on minimum statistics, in European Signal Processing Conference (EUSIPCO)–Proceedings, pp. 1182–1185 (1994)
R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)
K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)
A. Nehorai, B. Porat, Adaptive comb filtering for harmonic signal enhancement. IEEE Trans. Acoust. Speech Signal Process. 34(5), 1124–1138 (1986)
A. Nicolson, K.K. Paliwal, Deep learning for minimum mean-square error approaches to speech enhancement. Speech Commun. 111, 44–55 (2019)
A. Nicolson, K.K. Paliwal, Masked multi-head self-attention for causal speech enhancement. Speech Commun. 125, 80–96 (2020)
A.M. Noll, Cepstrum pitch determination. J. Acoust. Soc. Am. (1967). https://doi.org/10.1121/1.1910339
S.R. Park, J.W. Lee, A fully convolutional neural network for speech enhancement, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2017). https://doi.org/10.21437/Interspeech.2017-1465
S. Pascual, A. Bonafonte, J. Serra, SEGAN: speech enhancement generative adversarial network, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2017). https://doi.org/10.21437/Interspeech.2017-1428
J.C. Príncipe, W. Liu, S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction (Wiley, New York, 2011)
L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Inc., Hoboken, 1993)
P.S. Rani, S. Andhavarapu, S.R. Murty Kodukula, Significance of phase in DNN based speech enhancement algorithms, in 26th National Conference on Communications, NCC (2020). https://doi.org/10.1109/NCC48643.2020.9056089
K.R. Rao, P. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications (Academic Press, New York, 2014)
I.T. Recommendation, Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P. 862 (2001)
M. Romaniuk, P. Masztalski, K. Piaskowski, M. Matuszewski, Efficient low-latency speech enhancement with mobile audio streaming networks, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2020). https://doi.org/10.21437/Interspeech.2020-2443
P. Scalart, J.V. Filho, Speech enhancement based on a priori signal to noise estimation, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, vol. 2 (1996). https://doi.org/10.1109/icassp.1996.543199
M.H. Soni, N. Shah, H.A. Patil, Time-frequency masking-based speech enhancement using generative adversarial network, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, vol. 2018 (2018). https://doi.org/10.1109/ICASSP.2018.8462068
S. Srinivasan, N. Roman, D. Wang, Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings (2010). https://doi.org/10.1109/ICASSP.2010.5495701
Y. Tay, M. Dehghani, D. Bahri, D. Metzler, Efficient transformers: a survey. ACM Comput. Surv. (2022). https://doi.org/10.1145/3530811
D. Terpstra, H. Jagode, H. You, J. Dongarra, Collecting performance data with PAPI-C, in Proceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing 2009 (2010). https://doi.org/10.1007/978-3-642-11261-4_11
J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. J. Acoust. Soc. Am. (2013). https://doi.org/10.1121/1.4806631
C. Valentini-Botinhao, X. Wang, S. Takaki, J. Yamagishi, Investigating RNN-based speech enhancement methods for noise-robust text-to-speech, in SSW (2016), pp. 146–152
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems (2017)
C. Veaux, J. Yamagishi, S. King, The voice bank corpus: design, collection and data analysis of a large regional accent speech database, in 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE 2013 (2013). https://doi.org/10.1109/ICSDA.2013.6709856
D. Wang, J. Chen, Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
K. Wang, B. He, W.P. Zhu, TSTNN implementation codes. https://github.com/key2miao/TSTNN
K. Wang, B. He, W.P. Zhu, TSTNN: two-stage transformer based neural network for speech enhancement in the time domain, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021). https://doi.org/10.1109/icassp39728.2021.9413740
Q. Wang, J. Du, L.R. Dai, C.H. Lee, A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures. IEEE/ACM Trans. Audio Speech Lang. Process. 26(7), 1185–1197 (2018)
Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation, in IEEE Global Conference on Signal and Information Processing, GlobalSIP (2014). https://doi.org/10.1109/GlobalSIP.2014.7032183
D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. (2014). https://doi.org/10.1109/LSP.2013.2291240
L.P. Yang, Q.J. Fu, Spectral subtraction-based speech enhancement for cochlear implant patients in background noise. J. Acoust. Soc. Am. (2005). https://doi.org/10.1121/1.1852873
B. Yegnanarayana, C. Avendano, H. Hermansky, P.S. Murthy, Speech enhancement using linear prediction residual. Speech Commun. 28(1), 25–42 (1999)
S.A. Zahorian, H. Hu, A spectral/temporal method for robust fundamental frequency tracking. J. Acoust. Soc. Am. (2008). https://doi.org/10.1121/1.2916590
L. Zhang, M. Wang, Q. Zhang, X. Wang, M. Liu, PhaseDCN: a phase-enhanced dual-path dilated convolutional network for single-channel speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2561–2574 (2021)
Q. Zhang, A. Nicolson, M. Wang, K.K. Paliwal, C. Wang, DeepMMSE: a deep learning approach to MMSE-based noise power spectral density estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1404–1415 (2020)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Parvathala, V., Andhavarapu, S., Pamisetty, G. et al. Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement. Circuits Syst Signal Process 42, 322–343 (2023). https://doi.org/10.1007/s00034-022-02123-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02123-2