A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement

Jiang, Weiqi; Sun, Chengli; Chen, Feilong; Leng, Yan; Guo, Qiaosheng

doi:10.1007/s11042-023-16977-4

A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement

Published: 27 September 2023

Volume 83, pages 34849–34866, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Weiqi Jiang¹,
Chengli Sun ORCID: orcid.org/0000-0002-5346-8796²,
Feilong Chen¹,
Yan Leng³ &
…
Qiaosheng Guo⁴

194 Accesses
Explore all metrics

Abstract

The skip connection mechanism has been proven to be an effective approach for improving speech enhancement networks. By strengthening the information transfer between the encoder and the decoder, it facilitates the restoration of speech features during the up-sampling process. However, simple skip connection mechanism that directly connect corresponding layers of the encoder and decoder have several issues. Firstly, it only forces the features of the same scale to be aggregated, ignoring the potential relationships between different scales. Secondly, the shallow encoder feature contains a lot of redundant information. Studies have shown that coarse skip connections can even be detrimental to model performance in some cases. In this work, we propose a novel skip connection mechanism based on channel-wise Transformer for speech enhancement, comprising two components: multi-scale channel-wise cross fusion and channel-wise cross attention. This proposed skip connection mechanism can fuse multi-scale speech features from different levels of the encoder and effectively connect the reconstructed features to the decoder. Building on this, we propose a lightweight U-shaped network (UNet) structure called UCTNet. Experimental results show that UCTNet is comparable to other competitive models in terms of various objective speech quality metrics with only a few parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Learning a Deep Convolutional Network for Image Super-Resolution

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Savargiv M, Bastanfard A (2016) Real-time speech emotion recognition by minimum number of features. Artificial Intelligence and Robotics (IRANOPEN) 2016:72–76
Mahdavi R, Bastanfard A, Amirkhani D (2020) Persian accents identification using modeling of speech articulatory features. 2020 25th international computer conference, Computer Society of Iran (CSICC), 1-9
Bastanfard A, Amirkhani D, Hasani M (2019) Increasing the accuracy of automatic speaker age estimation by using multiple UBMs. 2019 5th conference on knowledge based engineering and innovation (KBEI), 592-598
Ephraim Y, Cohen I (2006) Recent advancements in speech enhancement. The electrical engineering handbook 35
Boll S (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 27:113–120
Article Google Scholar
Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67:1586–1604
Article Google Scholar
Delfarah M, Wang D (2017) Features for masking-based monaural speech separation in reverberant conditions. IEEE/ACM Trans Audio Speech Language Process 25:1085–1094
Article Google Scholar
Tan K, Wang D (2018) A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. Interspeech 2018:3229–3233
Google Scholar
Wang D, Lim J (1982) The unimportance of phase in speech enhancement. IEEE Trans Acoust Speech Signal Process 30:679–681
Article Google Scholar
Paliwal K, Wójcicki K, Shannon B (2011) The importance of phase in speech enhancement. Speech Comm 53:465–494
Article Google Scholar
Choi H-S, Kim J-H, Huh J, Kim A, Ha J-W, Lee K (2018) Phase-aware speech enhancement with deep complex u-net. International Conference on Learning Representations
Hu Y, Liu Y, Lv S, Xing M, Zhang S, Fu Y, Wu J, Zhang B, Xie L (2020) DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264
Li Q, Gao F, Guan H, Ma K (2021) Real-time monaural speech enhancement with short-time discrete cosine transform. arXiv preprint arXiv:2102.04629
Reddy CK, Dubey H, Gopal V, Cutler R, Braun S, Gamper H, Aichner R, Srinivasan S (2021) ICASSP 2021 deep noise suppression challenge. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6623-6627
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30
Yu W, Zhou J, Wang H, Tao L (2021) SETransformer: Speech Enhancement Transformer. Cognitive Computation, 1-7.
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention, 234-241
Wang H, Cao P, Wang J, Zaiane OR (2022) Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. Proc AAAI Conf Artificial Intell 36:2441–2449
Google Scholar
Giri R, Isik U, Krishnaswamy A (2019) Attention wave-u-net for speech enhancement. 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 249-253
Zhou L, Gao Y, Wang Z, Li J, Zhang W (2021) Complex spectral mapping with attention based convolution recurrent neural network for speech enhancement. arXiv preprint arXiv:2104.05267
Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022
Li A, Zheng C, Peng R, Li X (2021) On the importance of power compression and phase estimation in monaural speech dereverberation. JASA Express Letters 1:014802
Article Google Scholar
Paul DB, Baker J (1992) The design for the Wall Street Journal-based CSR corpus. Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992
Reddy CK, Dubey H, Koishida K, Nair A, Gopal V, Cutler R, Braun S, Gamper H, Aichner R, Srinivasan S (2021) Interspeech 2021 deep noise suppression challenge. arXiv preprint arXiv:2101.01902
Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12:247–251
Article Google Scholar
Rix AW, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) 2, 749-752
Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19:2125–2136
Article Google Scholar
Hu Y, Loizou PC (2007) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 16:229–238
Article Google Scholar
Westhausen NL, Meyer BT (2020) Dual-signal transformation lstm network for real-time noise suppression. arXiv preprint arXiv:2005.07551.
Luo Y, Mesgarani N (2019) Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Language Process 27:1256–1266
Article Google Scholar
Yin D, Luo C, Xiong Z, Zeng W (2020) Phasen: A phase-and-harmonics-aware speech enhancement network. Proc AAAI Conf Artificial Intell 34:9458–9465
Google Scholar
Veaux C, Yamagishi J, King S (2013) The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), 1-4
Thiemann J, Ito N, Vincent E (2013) The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. Proceedings of Meetings on Acoustics ICA2013 19, 035081
Valin J-M (2018) A hybrid DSP/deep learning approach to real-time full-band speech enhancement. 2018 IEEE 20th international workshop on multimedia signal processing (MMSP), 1-5
Zhang Z, Deng C, Shen Y, Williamson DS, Sha Y, Zhang Y, Song H, Li X (2020) On loss functions and recurrency training for GAN-based speech enhancement systems. arXiv preprint arXiv:2007.14974
Braun S, Gamper H, Reddy CK, Tashev I (2021) Towards efficient models for real-time deep noise suppression. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 656-660
Li A, Zheng C, Zhang L, Li X (2022) Glance and gaze: A collaborative learning framework for single-channel speech enhancement. Appl Acoust 187:108499
Article Google Scholar
Lv S, Fu Y, Xing M, Sun J, Xie L, Huang J, Wang Y, Yu T (2022) S-dccrn: Super wide band dccrn with learnable complex feature for speech enhancement. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7767-7771
Schroter H, Escalante-B AN, Rosenkranz T, Maier A (2022) DeepFilterNet: A low complexity speech enhancement framework for full-band audio based on deep filtering. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7407-7411
Chen J, Wang Z, Tuo D, Wu Z, Kang S, Meng H (2022) FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7857-7861

Download references

Acknowledgments

This work was supported partly by National Natural Science Foundation of China (No. 61861033, 62261040, 62162044), Graduate Innovation Special Foundation of Jiangxi Province (No. YC2022-s731), Natural Science Foundation of Jiangxi Province (No. 20202ACBL202007), GanPo Talent Support Program of Jiangxi (No. 20232BCJ22050), and Natural Science Foundation of Shandong Province (No. ZR2020MF020).

Author information

Authors and Affiliations

School of Information Engineering, Nanchang Hangkong University, Nanchang, 330063, China
Weiqi Jiang & Feilong Chen
School of Information and Communication Engineering, Guangzhou Maritime University, Guangzhou, 510725, China
Chengli Sun
School of Physics and Electronics, Shandong Normal University, Jinan, 250358, China
Yan Leng
Zhaoyang Gevotai (Xin Feng) Technology Co., Ltd., Ganzhou, 341600, China
Qiaosheng Guo

Authors

Weiqi Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Chengli Sun
View author publications
You can also search for this author in PubMed Google Scholar
Feilong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yan Leng
View author publications
You can also search for this author in PubMed Google Scholar
Qiaosheng Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chengli Sun.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jiang, W., Sun, C., Chen, F. et al. A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement. Multimed Tools Appl 83, 34849–34866 (2024). https://doi.org/10.1007/s11042-023-16977-4

Download citation

Received: 24 May 2023
Revised: 04 August 2023
Accepted: 11 September 2023
Published: 27 September 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11042-023-16977-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Learning a Deep Convolutional Network for Image Super-Resolution

Data availability

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Learning a Deep Convolutional Network for Image Super-Resolution

Data availability

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation