Skip to main content

Hierarchical Encoder-Decoder Neural Network with Self-Attention for Single-Channel Speech Denoising

  • Conference paper
  • First Online:
High-Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production (HPCST 2022)

Abstract

In this paper, we present a new approach to effective speech denoising based on deep learning methods. We used encoder-decoder architecture for the proposed neural network. It takes a noisy signal processed by windowed Fourier transform as an input and produces a complex mask which is the ratio of clean and distorted audio signals. When this mask is multiplied element-wise to the spectrum of the in-put signal, the noise component is eliminated. The key component of the approach is usage of hierarchical structure of the neural network model which allowed one to process input signal in different scales. We used self-attention layers to take into account the non-local dependencies in the time-frequency decomposition of the noisy signal. We used spatial reduction attention modification to reduce time and memory complexity. The scale-invariant signal-to-disturbance ratio (SI-SDR) was used as the loss function in the developed method. The DNS Challenge 2020 dataset, which includes samples of clean voice records and a representative set of various noise classes, was used to train the network. To compare performance with the best existing models several standard quality metrics (WB-PESQ, STOI, etc.) was used. The proposed method had shown its effectiveness on all the metrics and can be used to improve the quality of speech audio recording.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Instead of the more accurate term “three-dimensional matrix”, the term “tensor” is used in this paper, as is customary in the terminology of deep neural network libraries.

References

  1. Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton, FL, USA (2007)

    Book  Google Scholar 

  2. Tan, K., Wang, D.A.: Convolutional recurrent neural network for real-time speech enhancement: In: Proc. Interspeech 2018, 2–6 September 2018, Hyderabad, India, pp. 3229–3233 (2018)

    Google Scholar 

  3. Umut, I., Giri, R., Phansalkar, N., Valin, J.-M., Helwani, K., Krishnaswamy, A.: PoCoNet: better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss. In: Proc. of Interspeech 2020, 25–29 October 2020, Shanghai, pp. 2487–2491 (2020)

    Google Scholar 

  4. Hao, X., Su, X., Horaud, R, Li, X.: FullSubNet: a full-band and sub-band fusion model for real-time single-channel speech enhancement. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 6–11 June 2021, Toronto, Canada, pp. 6633–6637 (2021)

    Google Scholar 

  5. Xu, R., Wu, R., Ishiwaka, Y., Vondrick, C., Zheng, C.: Listening to sounds of silence for speech denoising. In: Conference on Neural Information Processing Systems (NeurIPS 2020), 6–12 December 2020, Vancouver, Canada, pp. 9633–9648 (2020)

    Google Scholar 

  6. Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  7. Hu, G., Wang, D.: Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)

    Article  Google Scholar 

  8. Xia, S., Li, H., Zhang, X.: Using optimal ratio mask as training target for supervised speech separation. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 12–15 December 2017, Kuala Lumpur, pp. 163–166 (2017)

    Google Scholar 

  9. Liu, Y., Zhang, H., Zhang, X., Yang, L.: Supervised speech enhancement with real spectrum approximation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 12–17 May 2019, Brighton, UK, pp. 5746–5750 (2019)

    Google Scholar 

  10. Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2016)

    Article  Google Scholar 

  11. Vaswani, A., et al.: Attention is all you need. In: Conference on Neural Information Processing Systems (NIPS 2017), 4–9 December 2017, Long Beach, CA, USA., 11 p. (2017)

    Google Scholar 

  12. Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image recognition at scale. In: Proc. of International Conference on Learning Representations (ICLR 2021), 3–7 May 2021, virtual only, 21 p. (2021)

    Google Scholar 

  13. Wang, W., et al.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: IEEE/CVF International Conference on Computer Vision (ICCV), 11–17 October 2021, virtual only, pp. 548–558 (2021)

    Google Scholar 

  14. Cao, H., et al.: Swin-Unet: Unet-like pure transformer for medical image segmentation (2020). Preprint at https://arxiv.org/abs/2105.05537

  15. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  16. Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR – half-baked or well done? In: IEEE International Conference on Acoustics, Speech and Signal Processing, 12–17 May 2019, Brighton, UK, pp. 626–630 (2019)

    Google Scholar 

  17. Reddy, C., Beyrami, E., Dubey, H.: Deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. In: Proc. of Interspeech 2020, 25–29 October 2020, Shanghai, China, pp. 2492–2496 (2020)

    Google Scholar 

  18. International Telecommunication Union: ITU-T P.808 Subjective Evaluation of Speech Quality with a Crowdsourcing Approach (2018)

    Google Scholar 

  19. Rix, A.W., Beerends, J.G., Hollier, M.P. Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings, 7–11 May 2001, Salt Lake City, UT, USA, pp. 749–752 (2001)

    Google Scholar 

  20. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 15–19 March 2010, Dallas, TX, USA, pp. 4214–4217 (2010)

    Google Scholar 

  21. Nasretdinov, R., Ilyashenko, I., Lependin, A.: Two-stage method of speech denoising by long short-term memory neural network. In: HPCST 2021: High-Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production, Communications in Computer and Information Science, Springer, 1526, pp. 86–97 (2022)

    Google Scholar 

  22. Braun, S., Tashev, I.: Data augmentation and loss normalization for deep noise suppression. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 79–86. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_8

    Chapter  Google Scholar 

  23. Hu, Y., et al.: DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. In: Proc. of Interspeech 2020, 25–29 October 2020, Shanghai, China, pp. 2472–2476 (2020)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the grant from the Russian Science Foundation, project no. 22–21-00199, https://rscf.ru/en/project/22-21-00199/.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nasretdinov, R., Ilyashenko, I., Filin, J., Lependin, A. (2023). Hierarchical Encoder-Decoder Neural Network with Self-Attention for Single-Channel Speech Denoising. In: Jordan, V., Tarasov, I., Shurina, E., Filimonov, N., Faerman, V. (eds) High-Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production. HPCST 2022. Communications in Computer and Information Science, vol 1733. Springer, Cham. https://doi.org/10.1007/978-3-031-23744-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23744-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23743-0

  • Online ISBN: 978-3-031-23744-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics