Hierarchical Encoder-Decoder Neural Network with Self-Attention for Single-Channel Speech Denoising

Nasretdinov, Rauf; Ilyashenko, Ilya; Filin, Jacob; Lependin, Andrey

doi:10.1007/978-3-031-23744-7_1

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1733))

Included in the following conference series:

International Conference on High-Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production

228 Accesses
2 Citations

Abstract

In this paper, we present a new approach to effective speech denoising based on deep learning methods. We used encoder-decoder architecture for the proposed neural network. It takes a noisy signal processed by windowed Fourier transform as an input and produces a complex mask which is the ratio of clean and distorted audio signals. When this mask is multiplied element-wise to the spectrum of the in-put signal, the noise component is eliminated. The key component of the approach is usage of hierarchical structure of the neural network model which allowed one to process input signal in different scales. We used self-attention layers to take into account the non-local dependencies in the time-frequency decomposition of the noisy signal. We used spatial reduction attention modification to reduce time and memory complexity. The scale-invariant signal-to-disturbance ratio (SI-SDR) was used as the loss function in the developed method. The DNS Challenge 2020 dataset, which includes samples of clean voice records and a representative set of various noise classes, was used to train the network. To compare performance with the best existing models several standard quality metrics (WB-PESQ, STOI, etc.) was used. The proposed method had shown its effectiveness on all the metrics and can be used to improve the quality of speech audio recording.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Article 26 July 2023

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Article Open access 11 April 2024

Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement

Article Open access 03 February 2024

Notes

1.
Instead of the more accurate term “three-dimensional matrix”, the term “tensor” is used in this paper, as is customary in the terminology of deep neural network libraries.

References

Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton, FL, USA (2007)
Book Google Scholar
Tan, K., Wang, D.A.: Convolutional recurrent neural network for real-time speech enhancement: In: Proc. Interspeech 2018, 2–6 September 2018, Hyderabad, India, pp. 3229–3233 (2018)
Google Scholar
Umut, I., Giri, R., Phansalkar, N., Valin, J.-M., Helwani, K., Krishnaswamy, A.: PoCoNet: better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss. In: Proc. of Interspeech 2020, 25–29 October 2020, Shanghai, pp. 2487–2491 (2020)
Google Scholar
Hao, X., Su, X., Horaud, R, Li, X.: FullSubNet: a full-band and sub-band fusion model for real-time single-channel speech enhancement. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 6–11 June 2021, Toronto, Canada, pp. 6633–6637 (2021)
Google Scholar
Xu, R., Wu, R., Ishiwaka, Y., Vondrick, C., Zheng, C.: Listening to sounds of silence for speech denoising. In: Conference on Neural Information Processing Systems (NeurIPS 2020), 6–12 December 2020, Vancouver, Canada, pp. 9633–9648 (2020)
Google Scholar
Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
Hu, G., Wang, D.: Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)
Article Google Scholar
Xia, S., Li, H., Zhang, X.: Using optimal ratio mask as training target for supervised speech separation. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 12–15 December 2017, Kuala Lumpur, pp. 163–166 (2017)
Google Scholar
Liu, Y., Zhang, H., Zhang, X., Yang, L.: Supervised speech enhancement with real spectrum approximation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 12–17 May 2019, Brighton, UK, pp. 5746–5750 (2019)
Google Scholar
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2016)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Conference on Neural Information Processing Systems (NIPS 2017), 4–9 December 2017, Long Beach, CA, USA., 11 p. (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image recognition at scale. In: Proc. of International Conference on Learning Representations (ICLR 2021), 3–7 May 2021, virtual only, 21 p. (2021)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: IEEE/CVF International Conference on Computer Vision (ICCV), 11–17 October 2021, virtual only, pp. 548–558 (2021)
Google Scholar
Cao, H., et al.: Swin-Unet: Unet-like pure transformer for medical image segmentation (2020). Preprint at https://arxiv.org/abs/2105.05537
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR – half-baked or well done? In: IEEE International Conference on Acoustics, Speech and Signal Processing, 12–17 May 2019, Brighton, UK, pp. 626–630 (2019)
Google Scholar
Reddy, C., Beyrami, E., Dubey, H.: Deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. In: Proc. of Interspeech 2020, 25–29 October 2020, Shanghai, China, pp. 2492–2496 (2020)
Google Scholar
International Telecommunication Union: ITU-T P.808 Subjective Evaluation of Speech Quality with a Crowdsourcing Approach (2018)
Google Scholar
Rix, A.W., Beerends, J.G., Hollier, M.P. Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings, 7–11 May 2001, Salt Lake City, UT, USA, pp. 749–752 (2001)
Google Scholar
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 15–19 March 2010, Dallas, TX, USA, pp. 4214–4217 (2010)
Google Scholar
Nasretdinov, R., Ilyashenko, I., Lependin, A.: Two-stage method of speech denoising by long short-term memory neural network. In: HPCST 2021: High-Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production, Communications in Computer and Information Science, Springer, 1526, pp. 86–97 (2022)
Google Scholar
Braun, S., Tashev, I.: Data augmentation and loss normalization for deep noise suppression. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 79–86. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_8
Chapter Google Scholar
Hu, Y., et al.: DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. In: Proc. of Interspeech 2020, 25–29 October 2020, Shanghai, China, pp. 2472–2476 (2020)
Google Scholar

Download references

Acknowledgments

This work was supported by the grant from the Russian Science Foundation, project no. 22–21-00199, https://rscf.ru/en/project/22-21-00199/.

Author information

Authors and Affiliations

AltSU – Altai State University, Lenin Ave. 61, 656049, Barnaul, Russia
Rauf Nasretdinov, Ilya Ilyashenko, Jacob Filin & Andrey Lependin

Authors

Rauf Nasretdinov
View author publications
You can also search for this author in PubMed Google Scholar
Ilya Ilyashenko
View author publications
You can also search for this author in PubMed Google Scholar
Jacob Filin
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Lependin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Altai State University, Barnaul, Russia
Vladimir Jordan
MIREA - Russian Technological University, Moscow, Russia
Ilya Tarasov
Novosibirsk State Technical University, Novosibirsk, Russia
Ella Shurina
Lomonosov Moscow State University, Moscow, Russia
Nikolay Filimonov
Tomsk State University of Control Systems and Radio-Electronics, Tomsk, Russia
Vladimir Faerman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nasretdinov, R., Ilyashenko, I., Filin, J., Lependin, A. (2023). Hierarchical Encoder-Decoder Neural Network with Self-Attention for Single-Channel Speech Denoising. In: Jordan, V., Tarasov, I., Shurina, E., Filimonov, N., Faerman, V. (eds) High-Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production. HPCST 2022. Communications in Computer and Information Science, vol 1733. Springer, Cham. https://doi.org/10.1007/978-3-031-23744-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-23744-7_1
Published: 26 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23743-0
Online ISBN: 978-3-031-23744-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hierarchical Encoder-Decoder Neural Network with Self-Attention for Single-Channel Speech Denoising

Abstract

Access this chapter

Similar content being viewed by others

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Hierarchical Encoder-Decoder Neural Network with Self-Attention for Single-Channel Speech Denoising

Abstract

Access this chapter

Similar content being viewed by others

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation