Abstract
In this paper, we present a new approach to effective speech denoising based on deep learning methods. We used encoder-decoder architecture for the proposed neural network. It takes a noisy signal processed by windowed Fourier transform as an input and produces a complex mask which is the ratio of clean and distorted audio signals. When this mask is multiplied element-wise to the spectrum of the in-put signal, the noise component is eliminated. The key component of the approach is usage of hierarchical structure of the neural network model which allowed one to process input signal in different scales. We used self-attention layers to take into account the non-local dependencies in the time-frequency decomposition of the noisy signal. We used spatial reduction attention modification to reduce time and memory complexity. The scale-invariant signal-to-disturbance ratio (SI-SDR) was used as the loss function in the developed method. The DNS Challenge 2020 dataset, which includes samples of clean voice records and a representative set of various noise classes, was used to train the network. To compare performance with the best existing models several standard quality metrics (WB-PESQ, STOI, etc.) was used. The proposed method had shown its effectiveness on all the metrics and can be used to improve the quality of speech audio recording.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Instead of the more accurate term “three-dimensional matrix”, the term “tensor” is used in this paper, as is customary in the terminology of deep neural network libraries.
References
Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton, FL, USA (2007)
Tan, K., Wang, D.A.: Convolutional recurrent neural network for real-time speech enhancement: In: Proc. Interspeech 2018, 2–6 September 2018, Hyderabad, India, pp. 3229–3233 (2018)
Umut, I., Giri, R., Phansalkar, N., Valin, J.-M., Helwani, K., Krishnaswamy, A.: PoCoNet: better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss. In: Proc. of Interspeech 2020, 25–29 October 2020, Shanghai, pp. 2487–2491 (2020)
Hao, X., Su, X., Horaud, R, Li, X.: FullSubNet: a full-band and sub-band fusion model for real-time single-channel speech enhancement. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 6–11 June 2021, Toronto, Canada, pp. 6633–6637 (2021)
Xu, R., Wu, R., Ishiwaka, Y., Vondrick, C., Zheng, C.: Listening to sounds of silence for speech denoising. In: Conference on Neural Information Processing Systems (NeurIPS 2020), 6–12 December 2020, Vancouver, Canada, pp. 9633–9648 (2020)
Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Hu, G., Wang, D.: Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)
Xia, S., Li, H., Zhang, X.: Using optimal ratio mask as training target for supervised speech separation. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 12–15 December 2017, Kuala Lumpur, pp. 163–166 (2017)
Liu, Y., Zhang, H., Zhang, X., Yang, L.: Supervised speech enhancement with real spectrum approximation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 12–17 May 2019, Brighton, UK, pp. 5746–5750 (2019)
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2016)
Vaswani, A., et al.: Attention is all you need. In: Conference on Neural Information Processing Systems (NIPS 2017), 4–9 December 2017, Long Beach, CA, USA., 11 p. (2017)
Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image recognition at scale. In: Proc. of International Conference on Learning Representations (ICLR 2021), 3–7 May 2021, virtual only, 21 p. (2021)
Wang, W., et al.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: IEEE/CVF International Conference on Computer Vision (ICCV), 11–17 October 2021, virtual only, pp. 548–558 (2021)
Cao, H., et al.: Swin-Unet: Unet-like pure transformer for medical image segmentation (2020). Preprint at https://arxiv.org/abs/2105.05537
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR – half-baked or well done? In: IEEE International Conference on Acoustics, Speech and Signal Processing, 12–17 May 2019, Brighton, UK, pp. 626–630 (2019)
Reddy, C., Beyrami, E., Dubey, H.: Deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. In: Proc. of Interspeech 2020, 25–29 October 2020, Shanghai, China, pp. 2492–2496 (2020)
International Telecommunication Union: ITU-T P.808 Subjective Evaluation of Speech Quality with a Crowdsourcing Approach (2018)
Rix, A.W., Beerends, J.G., Hollier, M.P. Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings, 7–11 May 2001, Salt Lake City, UT, USA, pp. 749–752 (2001)
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 15–19 March 2010, Dallas, TX, USA, pp. 4214–4217 (2010)
Nasretdinov, R., Ilyashenko, I., Lependin, A.: Two-stage method of speech denoising by long short-term memory neural network. In: HPCST 2021: High-Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production, Communications in Computer and Information Science, Springer, 1526, pp. 86–97 (2022)
Braun, S., Tashev, I.: Data augmentation and loss normalization for deep noise suppression. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 79–86. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_8
Hu, Y., et al.: DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. In: Proc. of Interspeech 2020, 25–29 October 2020, Shanghai, China, pp. 2472–2476 (2020)
Acknowledgments
This work was supported by the grant from the Russian Science Foundation, project no. 22–21-00199, https://rscf.ru/en/project/22-21-00199/.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nasretdinov, R., Ilyashenko, I., Filin, J., Lependin, A. (2023). Hierarchical Encoder-Decoder Neural Network with Self-Attention for Single-Channel Speech Denoising. In: Jordan, V., Tarasov, I., Shurina, E., Filimonov, N., Faerman, V. (eds) High-Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production. HPCST 2022. Communications in Computer and Information Science, vol 1733. Springer, Cham. https://doi.org/10.1007/978-3-031-23744-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-23744-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23743-0
Online ISBN: 978-3-031-23744-7
eBook Packages: Computer ScienceComputer Science (R0)