Skip to main content
Log in

A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The skip connection mechanism has been proven to be an effective approach for improving speech enhancement networks. By strengthening the information transfer between the encoder and the decoder, it facilitates the restoration of speech features during the up-sampling process. However, simple skip connection mechanism that directly connect corresponding layers of the encoder and decoder have several issues. Firstly, it only forces the features of the same scale to be aggregated, ignoring the potential relationships between different scales. Secondly, the shallow encoder feature contains a lot of redundant information. Studies have shown that coarse skip connections can even be detrimental to model performance in some cases. In this work, we propose a novel skip connection mechanism based on channel-wise Transformer for speech enhancement, comprising two components: multi-scale channel-wise cross fusion and channel-wise cross attention. This proposed skip connection mechanism can fuse multi-scale speech features from different levels of the encoder and effectively connect the reconstructed features to the decoder. Building on this, we propose a lightweight U-shaped network (UNet) structure called UCTNet. Experimental results show that UCTNet is comparable to other competitive models in terms of various objective speech quality metrics with only a few parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Savargiv M, Bastanfard A (2016) Real-time speech emotion recognition by minimum number of features. Artificial Intelligence and Robotics (IRANOPEN) 2016:72–76

  2. Mahdavi R, Bastanfard A, Amirkhani D (2020) Persian accents identification using modeling of speech articulatory features. 2020 25th international computer conference, Computer Society of Iran (CSICC), 1-9

  3. Bastanfard A, Amirkhani D, Hasani M (2019) Increasing the accuracy of automatic speaker age estimation by using multiple UBMs. 2019 5th conference on knowledge based engineering and innovation (KBEI), 592-598

  4. Ephraim Y, Cohen I (2006) Recent advancements in speech enhancement. The electrical engineering handbook 35

  5. Boll S (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 27:113–120

    Article  Google Scholar 

  6. Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67:1586–1604

    Article  Google Scholar 

  7. Delfarah M, Wang D (2017) Features for masking-based monaural speech separation in reverberant conditions. IEEE/ACM Trans Audio Speech Language Process 25:1085–1094

    Article  Google Scholar 

  8. Tan K, Wang D (2018) A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. Interspeech 2018:3229–3233

    Google Scholar 

  9. Wang D, Lim J (1982) The unimportance of phase in speech enhancement. IEEE Trans Acoust Speech Signal Process 30:679–681

    Article  Google Scholar 

  10. Paliwal K, Wójcicki K, Shannon B (2011) The importance of phase in speech enhancement. Speech Comm 53:465–494

    Article  Google Scholar 

  11. Choi H-S, Kim J-H, Huh J, Kim A, Ha J-W, Lee K (2018) Phase-aware speech enhancement with deep complex u-net. International Conference on Learning Representations

  12. Hu Y, Liu Y, Lv S, Xing M, Zhang S, Fu Y, Wu J, Zhang B, Xie L (2020) DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264

  13. Li Q, Gao F, Guan H, Ma K (2021) Real-time monaural speech enhancement with short-time discrete cosine transform. arXiv preprint arXiv:2102.04629

  14. Reddy CK, Dubey H, Gopal V, Cutler R, Braun S, Gamper H, Aichner R, Srinivasan S (2021) ICASSP 2021 deep noise suppression challenge. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6623-6627

  15. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30

  16. Yu W, Zhou J, Wang H, Tao L (2021) SETransformer: Speech Enhancement Transformer. Cognitive Computation, 1-7.

  17. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention, 234-241

  18. Wang H, Cao P, Wang J, Zaiane OR (2022) Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. Proc AAAI Conf Artificial Intell 36:2441–2449

    Google Scholar 

  19. Giri R, Isik U, Krishnaswamy A (2019) Attention wave-u-net for speech enhancement. 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 249-253

  20. Zhou L, Gao Y, Wang Z, Li J, Zhang W (2021) Complex spectral mapping with attention based convolution recurrent neural network for speech enhancement. arXiv preprint arXiv:2104.05267

  21. Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022

  22. Li A, Zheng C, Peng R, Li X (2021) On the importance of power compression and phase estimation in monaural speech dereverberation. JASA Express Letters 1:014802

    Article  Google Scholar 

  23. Paul DB, Baker J (1992) The design for the Wall Street Journal-based CSR corpus. Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992

  24. Reddy CK, Dubey H, Koishida K, Nair A, Gopal V, Cutler R, Braun S, Gamper H, Aichner R, Srinivasan S (2021) Interspeech 2021 deep noise suppression challenge. arXiv preprint arXiv:2101.01902

  25. Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12:247–251

    Article  Google Scholar 

  26. Rix AW, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) 2, 749-752

  27. Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19:2125–2136

    Article  Google Scholar 

  28. Hu Y, Loizou PC (2007) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 16:229–238

    Article  Google Scholar 

  29. Westhausen NL, Meyer BT (2020) Dual-signal transformation lstm network for real-time noise suppression. arXiv preprint arXiv:2005.07551.

  30. Luo Y, Mesgarani N (2019) Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Language Process 27:1256–1266

    Article  Google Scholar 

  31. Yin D, Luo C, Xiong Z, Zeng W (2020) Phasen: A phase-and-harmonics-aware speech enhancement network. Proc AAAI Conf Artificial Intell 34:9458–9465

    Google Scholar 

  32. Veaux C, Yamagishi J, King S (2013) The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), 1-4

  33. Thiemann J, Ito N, Vincent E (2013) The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. Proceedings of Meetings on Acoustics ICA2013 19, 035081

  34. Valin J-M (2018) A hybrid DSP/deep learning approach to real-time full-band speech enhancement. 2018 IEEE 20th international workshop on multimedia signal processing (MMSP), 1-5

  35. Zhang Z, Deng C, Shen Y, Williamson DS, Sha Y, Zhang Y, Song H, Li X (2020) On loss functions and recurrency training for GAN-based speech enhancement systems. arXiv preprint arXiv:2007.14974

  36. Braun S, Gamper H, Reddy CK, Tashev I (2021) Towards efficient models for real-time deep noise suppression. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 656-660

  37. Li A, Zheng C, Zhang L, Li X (2022) Glance and gaze: A collaborative learning framework for single-channel speech enhancement. Appl Acoust 187:108499

    Article  Google Scholar 

  38. Lv S, Fu Y, Xing M, Sun J, Xie L, Huang J, Wang Y, Yu T (2022) S-dccrn: Super wide band dccrn with learnable complex feature for speech enhancement. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7767-7771

  39. Schroter H, Escalante-B AN, Rosenkranz T, Maier A (2022) DeepFilterNet: A low complexity speech enhancement framework for full-band audio based on deep filtering. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7407-7411

  40. Chen J, Wang Z, Tuo D, Wu Z, Kang S, Meng H (2022) FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7857-7861

Download references

Acknowledgments

This work was supported partly by National Natural Science Foundation of China (No. 61861033, 62261040, 62162044), Graduate Innovation Special Foundation of Jiangxi Province (No. YC2022-s731), Natural Science Foundation of Jiangxi Province (No. 20202ACBL202007), GanPo Talent Support Program of Jiangxi (No. 20232BCJ22050), and Natural Science Foundation of Shandong Province (No. ZR2020MF020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengli Sun.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, W., Sun, C., Chen, F. et al. A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement. Multimed Tools Appl 83, 34849–34866 (2024). https://doi.org/10.1007/s11042-023-16977-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16977-4

Keywords

Navigation