Abstract
The skip connection mechanism has been proven to be an effective approach for improving speech enhancement networks. By strengthening the information transfer between the encoder and the decoder, it facilitates the restoration of speech features during the up-sampling process. However, simple skip connection mechanism that directly connect corresponding layers of the encoder and decoder have several issues. Firstly, it only forces the features of the same scale to be aggregated, ignoring the potential relationships between different scales. Secondly, the shallow encoder feature contains a lot of redundant information. Studies have shown that coarse skip connections can even be detrimental to model performance in some cases. In this work, we propose a novel skip connection mechanism based on channel-wise Transformer for speech enhancement, comprising two components: multi-scale channel-wise cross fusion and channel-wise cross attention. This proposed skip connection mechanism can fuse multi-scale speech features from different levels of the encoder and effectively connect the reconstructed features to the decoder. Building on this, we propose a lightweight U-shaped network (UNet) structure called UCTNet. Experimental results show that UCTNet is comparable to other competitive models in terms of various objective speech quality metrics with only a few parameters.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Savargiv M, Bastanfard A (2016) Real-time speech emotion recognition by minimum number of features. Artificial Intelligence and Robotics (IRANOPEN) 2016:72–76
Mahdavi R, Bastanfard A, Amirkhani D (2020) Persian accents identification using modeling of speech articulatory features. 2020 25th international computer conference, Computer Society of Iran (CSICC), 1-9
Bastanfard A, Amirkhani D, Hasani M (2019) Increasing the accuracy of automatic speaker age estimation by using multiple UBMs. 2019 5th conference on knowledge based engineering and innovation (KBEI), 592-598
Ephraim Y, Cohen I (2006) Recent advancements in speech enhancement. The electrical engineering handbook 35
Boll S (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 27:113–120
Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67:1586–1604
Delfarah M, Wang D (2017) Features for masking-based monaural speech separation in reverberant conditions. IEEE/ACM Trans Audio Speech Language Process 25:1085–1094
Tan K, Wang D (2018) A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. Interspeech 2018:3229–3233
Wang D, Lim J (1982) The unimportance of phase in speech enhancement. IEEE Trans Acoust Speech Signal Process 30:679–681
Paliwal K, Wójcicki K, Shannon B (2011) The importance of phase in speech enhancement. Speech Comm 53:465–494
Choi H-S, Kim J-H, Huh J, Kim A, Ha J-W, Lee K (2018) Phase-aware speech enhancement with deep complex u-net. International Conference on Learning Representations
Hu Y, Liu Y, Lv S, Xing M, Zhang S, Fu Y, Wu J, Zhang B, Xie L (2020) DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264
Li Q, Gao F, Guan H, Ma K (2021) Real-time monaural speech enhancement with short-time discrete cosine transform. arXiv preprint arXiv:2102.04629
Reddy CK, Dubey H, Gopal V, Cutler R, Braun S, Gamper H, Aichner R, Srinivasan S (2021) ICASSP 2021 deep noise suppression challenge. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6623-6627
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30
Yu W, Zhou J, Wang H, Tao L (2021) SETransformer: Speech Enhancement Transformer. Cognitive Computation, 1-7.
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention, 234-241
Wang H, Cao P, Wang J, Zaiane OR (2022) Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. Proc AAAI Conf Artificial Intell 36:2441–2449
Giri R, Isik U, Krishnaswamy A (2019) Attention wave-u-net for speech enhancement. 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 249-253
Zhou L, Gao Y, Wang Z, Li J, Zhang W (2021) Complex spectral mapping with attention based convolution recurrent neural network for speech enhancement. arXiv preprint arXiv:2104.05267
Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022
Li A, Zheng C, Peng R, Li X (2021) On the importance of power compression and phase estimation in monaural speech dereverberation. JASA Express Letters 1:014802
Paul DB, Baker J (1992) The design for the Wall Street Journal-based CSR corpus. Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992
Reddy CK, Dubey H, Koishida K, Nair A, Gopal V, Cutler R, Braun S, Gamper H, Aichner R, Srinivasan S (2021) Interspeech 2021 deep noise suppression challenge. arXiv preprint arXiv:2101.01902
Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12:247–251
Rix AW, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) 2, 749-752
Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19:2125–2136
Hu Y, Loizou PC (2007) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 16:229–238
Westhausen NL, Meyer BT (2020) Dual-signal transformation lstm network for real-time noise suppression. arXiv preprint arXiv:2005.07551.
Luo Y, Mesgarani N (2019) Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Language Process 27:1256–1266
Yin D, Luo C, Xiong Z, Zeng W (2020) Phasen: A phase-and-harmonics-aware speech enhancement network. Proc AAAI Conf Artificial Intell 34:9458–9465
Veaux C, Yamagishi J, King S (2013) The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), 1-4
Thiemann J, Ito N, Vincent E (2013) The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. Proceedings of Meetings on Acoustics ICA2013 19, 035081
Valin J-M (2018) A hybrid DSP/deep learning approach to real-time full-band speech enhancement. 2018 IEEE 20th international workshop on multimedia signal processing (MMSP), 1-5
Zhang Z, Deng C, Shen Y, Williamson DS, Sha Y, Zhang Y, Song H, Li X (2020) On loss functions and recurrency training for GAN-based speech enhancement systems. arXiv preprint arXiv:2007.14974
Braun S, Gamper H, Reddy CK, Tashev I (2021) Towards efficient models for real-time deep noise suppression. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 656-660
Li A, Zheng C, Zhang L, Li X (2022) Glance and gaze: A collaborative learning framework for single-channel speech enhancement. Appl Acoust 187:108499
Lv S, Fu Y, Xing M, Sun J, Xie L, Huang J, Wang Y, Yu T (2022) S-dccrn: Super wide band dccrn with learnable complex feature for speech enhancement. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7767-7771
Schroter H, Escalante-B AN, Rosenkranz T, Maier A (2022) DeepFilterNet: A low complexity speech enhancement framework for full-band audio based on deep filtering. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7407-7411
Chen J, Wang Z, Tuo D, Wu Z, Kang S, Meng H (2022) FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7857-7861
Acknowledgments
This work was supported partly by National Natural Science Foundation of China (No. 61861033, 62261040, 62162044), Graduate Innovation Special Foundation of Jiangxi Province (No. YC2022-s731), Natural Science Foundation of Jiangxi Province (No. 20202ACBL202007), GanPo Talent Support Program of Jiangxi (No. 20232BCJ22050), and Natural Science Foundation of Shandong Province (No. ZR2020MF020).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, W., Sun, C., Chen, F. et al. A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement. Multimed Tools Appl 83, 34849–34866 (2024). https://doi.org/10.1007/s11042-023-16977-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16977-4