Skip to main content

Speech enhancement using U-nets with wide-context units

Abstract

In this article a new neural network for speech enhancement is proposed where single-channel noisy speech is processed in order to improve its intelligibility and quality. It is based on the U-net architecture, i.e. it is composed of two main blocks: encoder and decoder. Some of the corresponding layers in the encoder and decoder are connected with skip connections. In most of the encoder-decoder neural networks for speech enhancement known from the literature, the time-frequency resolution of the hidden feature maps is reduced. The main strategy in the presented approach is to maintain the time-frequency resolution of feature maps at all levels of the network while having large receptive field at the same time. In order to obtain features dependent on wide context we propose neural network units based on recurrent cells or dilated convolutions. The proposed neural network was evaluated using WSJ0 and TIMIT speech data mixed with noises from Noisex, DCASE and field recordings from Freesound online database. The results showed improvement over the baseline networks based on gated dilated convolutions or long-short term memory (LSTM) in terms of scale-independent speech-to-distortion ratio (SI-SDR), spectro-temporal objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) measures.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. Ananthakrishnan KS, Dogancay K (2009) Recent trends and challenges in speech-separation systems research—-A tutorial review. In: TENCON 2009-2009 IEEE region 10 conference, pp 1–6. IEEE

  2. Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271

  3. Chen J, Wang D (2017) Long short-term memory for speaker generalization in supervised speech separation. J Acoust Soc Am 141(6):4705–4714

    MathSciNet  Article  Google Scholar 

  4. Ephraim Y, Malah D (1984) Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 32(6):1109–1121. https://doi.org/10.1109/TASSP.1984.1164453

    Article  Google Scholar 

  5. Erdogan H, Hershey JR, Watanabe S, Le Roux J (2015) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: 2015 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 708–712. IEEE

  6. Freesound. https://freesound.org/. Accessed: 2019-09-28

  7. Garofalo J, Graff D, Paul D, Pallett D (2007) Csr-i (wsj0) complete. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  8. Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993

  9. Grais EM, Plumbley MD (2017) Single channel audio source separation using convolutional denoising autoencoders. arXiv:1703.08019

  10. Grzywalski T, Drgas S (2018) Application of recurrent u-net architecture to speech enhancement. In: Signal processing: algorithms, architectures, arrangements, and applications (SPA), pp 82–87. IEEE

  11. Grzywalski T, Drgas S (2019) Using recurrences in time and frequency within u-net architecture for speech enhancement. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6970–6974. IEEE

  12. Guido RC, Pedroso F, Furlan A, Contreras RC, Caobianco LG, Neto JS (2020) Cwt× dwt× dtwt× sdtwt: clarifying terminologies and roles of different types of wavelet transforms. Int J Wavelets Multiresol Inform Proces 18(06):2030001

    MathSciNet  Article  Google Scholar 

  13. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. arXiv:1603.05027

  14. Huang PS, Kim M, Hasegawa-Johnson M, Smaragdis P (2014) Deep learning for monaural speech separation. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 1562–1566. IEEE

  15. Huang PS, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans Audio Speech Lang Process 23(12):2136–2147

    Article  Google Scholar 

  16. Hui L, Cai M, Guo C, He L, Zhang WQ, Liu J (2015) Convolutional maxout neural networks for speech separation. In: 2015 IEEE international symposium on signal processing and information technology (ISSPIT), pp 24–27. https://doi.org/10.1109/ISSPIT.2015.7394335

  17. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization, arXiv:1412.6980

  18. Le Roux J, Wichern G, Watanabe S, Sarroff A, Hershey JR (2019) The phasebook: building complex masks via discrete representations for source separation. In: ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 66–70. IEEE

  19. Le Roux J, Wisdom S, Erdogan H, Hershey JR (2019) Sdr–half-baked or well done?. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 626–630. IEEE

  20. Luo Y, Mesgarani N (2018) Tasnet: surpassing ideal time-frequency masking for speech separation, arXiv:1809.07454

  21. Mesaros A, Heittola T, Virtanen T (2016) Tut database for acoustic scene classification and sound event detection. In: 2016 24th European signal processing conference (EUSIPCO), pp 1128–1132. IEEE

  22. Mowlaee P, Saeidi R, Christensen MG, Martin R (2012) Subjective and objective quality assessment of single-channel speech separation algorithms. In: 2012 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 69–72. IEEE

  23. Nicolson A, Paliwal KK (2019) Deep learning for minimum mean-square error approaches to speech enhancement. Speech Commun 111:44–55. https://doi.org/10.1016/j.specom.2019.06.002, https://www.sciencedirect.com/science/article/pii/S0167639318304308

    Article  Google Scholar 

  24. Nicolson A, Paliwal KK (2020) Masked multi-head self-attention for causal speech enhancement. Speech Comm 125:80–96

    Article  Google Scholar 

  25. Pandey A, Wang D (2019) A new framework for cnn-based speech enhancement in the time domain. IEEE/ACM Trans Audio Speech Lang Process 27 (7):1179–1188

    Article  Google Scholar 

  26. Pandey A, Wang D (2019) Tcnn: temporal convolutional neural network for real-time speech enhancement in the time domain. In: ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6875–6879. IEEE

  27. Park SR, Lee J (2016) A fully convolutional neural network for speech enhancement, arXiv:1609.07132

  28. Pirhosseinloo S, Brumberg JS (2019) Monaural speech enhancement with dilated convolutions. In: Proc. Interspeech 2019, pp 3143–3147. https://doi.org/10.21437/Interspeech.2019-2782

  29. Rix AW, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), vol 2, pp 749–752. IEEE

  30. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp 234–241. Springer

  31. Stowell D, Plumbley MD (2013) An open dataset for research on audio field recording archives: freefield1010. arXiv:1309.5275

  32. Sun Y, Xian Y, Wang W, Naqvi SM (2019) Monaural source separation in complex domain with long short-term memory neural network. IEEE J Selected Topics Signal Process 13(2):359–369

    Article  Google Scholar 

  33. Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19(7):2125–2136

    Article  Google Scholar 

  34. Tan K, Chen J, Wang D (2019) Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 27(1):189–198

    Article  Google Scholar 

  35. Tan K, Wang D (2018) A convolutional recurrent neural network for real-time speech enhancement. In: Interspeech, pp 3229–3233

  36. Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251

    Article  Google Scholar 

  37. Wang D (2017) Deep learning reinvents the hearing aid. IEEE Spectr 54(3):32–37

    Article  Google Scholar 

  38. Wang D, Chen J (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Transactions on audio, speech, and language processing

  39. Wang Y, Narayanan A, Wang D (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 22 (12):1849–1858

    Article  Google Scholar 

  40. Wang Y, Wang D (2013) Towards scaling up classification-based speech separation. IEEE Trans Audio Speech Lang Process 21(7):1381–1390

    Article  Google Scholar 

  41. Wang Y, Wang D (2014) A structure-preserving training target for supervised speech separation. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6107–6111. IEEE

  42. Wang ZQ, Roux JL, Wang D, Hershey JR (2018) End-to-end speech separation with unfolded iterative phase reconstruction, arXiv:1804.10204

  43. Wang ZQ, Tan K, Wang D (2019) Deep learning based phase reconstruction for speaker separation: a trigonometric perspective. In: ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 71–75. IEEE

  44. Williamson DS, Wang D (2017) Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 25(7):1492–1501

    Article  Google Scholar 

  45. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions, arXiv:1511.07122

  46. Yuan W (2020) A time–frequency smoothing neural network for speech enhancement. Speech Commun 124:75–84. https://doi.org/10.1016/j.specom.2020.09.002, https://www.sciencedirect.com/science/article/pii/S0167639320302703

    Article  Google Scholar 

  47. Zhang R (2019) Making convolutional networks shift-invariant again, arXiv:1904.11486

Download references

Acknowledgments

The work of Szymon Drgas was supported by grant 0211/SBAD/0222.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Grzywalski.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors contributed equally to this work. The order of the authors is random.

Appendix: Gated recurrent unit

Appendix: Gated recurrent unit

There are many implementations and parameters of GRUs which can affect performance of the neural network. Therefore, in this appendix we provide information about variant of GRUs used in the experiments reported in this paper. The value of the reset gate is calculated as

$$ \textbf{r}_{t}=\sigma_{r}(\textbf{W}_{xr}\textbf{x}_{t}+\textbf{W}_{hr}\textbf{h}_{t-1}+\textbf{b}_{r}) , $$
(10)

where σr is the sigmoid, xt is input vector to the GRU for time index t, while ht− 1 is a vector representing previous state. Matrix Wxr, Whr, and br are parameters of the reset gate. Similarly, update gate

$$ \textbf{u}_{t}=\sigma_{u}(\textbf{W}_{xu}\textbf{x}_{t}+\textbf{W}_{hu}\textbf{h}_{t-1}+\textbf{b}_{u}) , $$
(11)

and candidate state is computed using the formula

$$ \textbf{c}_{t}=\sigma_{c}(\textbf{W}_{xc}\textbf{x}_{t}+\textbf{r}_{t}\odot (\textbf{W}_{hc}\textbf{h}_{t-1})+\textbf{b}_{c}) , $$
(12)

where σc is tanh function, Wxc, Whc and bc are weights and bias. Finally, state ht is computed using

$$ \textbf{h}_{t}=(\textbf{1}-\textbf{u}_{t})\odot \textbf{h}_{t-1} + \textbf{u}_{t} \odot \textbf{c}_{t} . $$
(13)

The processing performed by GRU can be described as a linear transformation

$$ \left[\begin{array}{cc} \textbf{W}_{xr} & \textbf{W}_{hr} \\ \textbf{W}_{xu} & \textbf{W}_{hu} \\ \textbf{W}_{xc} & \textbf{0} \\ \textbf{0} & \textbf{W}_{hc} \\ \textbf{0} & \textbf{I} \end{array}\right]\left[ \begin{array}{c} \textbf{x}_{t} \\ \textbf{h}_{t-1} \end{array}\right] + \left[ \begin{array}{c} \textbf{b}_{r} \\ \textbf{b}_{u} \\ \textbf{0} \\ \textbf{0} \\ \textbf{0} \end{array}\right] = \left[ \begin{array}{c} \textbf{z}_{r} \\ \textbf{z}_{u} \\ \textbf{z}_{xc} \\ \textbf{z}_{xh} \\ \textbf{h}_{t-1} \end{array}\right] , $$
(14)

which is transformed using the following nonlinear function

$$ f(\textbf{z}) = (1-\sigma(\textbf{z}_{u})\odot \textbf{h}_{t-1} + \sigma(\textbf{z}_{u})\odot \tanh(\textbf{z}_{xc}+\sigma(\textbf{z}_{r})\odot\textbf{z}_{hc}+\textbf{b}_{c}) . $$
(15)

Additionally, in all experiments featuring recurrent layers, the initial states of recurrences were also learned during training. We found this to have a small but consistent positive effect on the network’s performance.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Grzywalski, T., Drgas, S. Speech enhancement using U-nets with wide-context units. Multimed Tools Appl 81, 18617–18639 (2022). https://doi.org/10.1007/s11042-022-12632-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12632-6

Keywords

  • Speech enhancement
  • U-nets
  • DNN