Effects of Skip Connections in CNN-Based Architectures for Speech Enhancement

Zheng, Nengheng; Shi, Yupeng; Rong, Weicong; Kang, Yuyong

doi:10.1007/s11265-020-01518-1

Effects of Skip Connections in CNN-Based Architectures for Speech Enhancement

Published: 15 May 2020

Volume 92, pages 875–884, (2020)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Nengheng Zheng¹,
Yupeng Shi¹,
Weicong Rong¹ &
…
Yuyong Kang¹

1261 Accesses
8 Citations
3 Altmetric
Explore all metrics

Abstract

Eliminating the negative effect of adverse environmental noise has been an intriguing and challenging task for speech technology. Neural networks (NNs)-based denoising techniques have achieved favorable performance in recent years. In particular, adding skip connections to NNs has been demonstrated to significantly improve the performance of NNs-based speech enhancement systems. However, in most of the studies, the adding of skip connections was kind of tricks of the trade and lack of sufficient analyses, quantitatively and/or qualitatively, on the underlying principle. This paper presents a denoising architecture of Convolutional Neural Network (CNN) with skip connections for speech enhancement. Particularly, to investigate the inherent mechanism of NNs with skip connections in learning the noise properties, CNN with different skip connection schemes are constructed and a set of denoising experiments, in which statistically different noises being tested, are presented to evaluate the performance of the denoising architectures. Results show that CNNs with skip connections provide better denoising ability than the baseline, i.e., the basic CNN, for both stationary and nonstationary noises. In particular, benefit by adding more sophisticated skip connections is more significant for nonstationary noises than stationary noises, which implies that the complex properties of noise can be learned by CNN with more skip connections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Enhancement Algorithm Combining Cochlear Features and Deep Neural Network with Skip Connections

Article 01 August 2023

Chaofeng Lan, Yuqiao Wang, … Xiaoxia Guo

Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN

Article 06 October 2021

D. Hepsiba & Judith Justin

Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration

Article Open access 10 December 2020

Maximilian Strake, Bruno Defraene, … Tim Fingscheidt

References

Wang, D. L., & Chen, J. T. (2018). Supervised speech separation based on deep learning: An overview. IEEE Transactions on Audio, Speech, and Language Processing, 26(10), 1702–1726.
Article MathSciNet Google Scholar
Virtanen, T., Singh, R., & Raj, B. (2012). Techniques for noise robustness in automatic speech recognition. John Wiley & Sons.
Bolner, F., Goehring, T., Monaghan, J., Dijk, B. V., Wouters, J., & Bleeck, S. (2016). Speech enhancement based on neural networks applied to cochlear implant coding strategies. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6520-6524).
Loizou, P. C. (2013). Speech enhancement: Theory and practice. CRC Press.
Wang, Q., Du, J., Dai, L. R., & Lee, C. H. (2018). A multiobjective learning and ensembling approach to high performance speech enhancement with compact neural network architectures. IEEE Transactions on Audio, Speech, and Language Processing, 26(7), 1185–1197.
Article Google Scholar
Dahl, G., Yu, D., Deng, L., & Acero, A. (2011). Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4688-4691).
Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42.
Article Google Scholar
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.
Article Google Scholar
Seltzer, M. L., Yu, D., & Wang, Y. (2013). An investigation of deep neural networks for noise robust speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7398-7402).
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Article Google Scholar
Fu, S.-W., Tsao, Y., & Lu, X. (2016). SNR-aware convolutional neural network modeling for speech enhancement. In Seventeenth Proceedings of the Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, (pp. 8–12).
Kounovsky, T., & Malek, J. (2017). Single channel speech enhancement using convolutional neural network. Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM).
Park, S. R., & Lee, J. W. (2017). A fully convolutional neural network for speech enhancement. In the Eighteenth Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH).
Vincen, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 3371–3408.
MathSciNet MATH Google Scholar
Osako, K., Singh, R., & Raj, B. (2015). Complex recurrent neural networks for denoising speech signals. In Applications of Signal Processing to Audio and Acoustics.
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (pp. 234–241).
Pascual, S., Bonafonte, A., & Serrà, J. (2017). SEGAN: speech enhancement generative adversarial network. In the Eighteenth Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH).
Michelsanti, D., & Tan, Z.-H. (2017). Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. In the Eighteenth Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH).
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
Mao, X. J., Shen, C. H., & Yang, Y. B. (2016). Image denoising using very deep fully convolutional encoder-decoder networks with symmetric skip connections. arXiv preprint arXiv:1603.09056.
Shi, Y. P., Rong, W. C., & Zheng, N. N. (2018). Speech enhancement using convolutional neural network with skip connections. In the 11th international symposium on Chinese spoken language processing (ISCSLP).
Lim, J. S., & Oppenheim, A. V. (1978). All-pole modeling of degraded speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(3), 197–210.
Article MATH Google Scholar
Ram, R., & Mohanty, M. N. (2018). The use of deep learning in speech enhancement. In Proceedings of the First International Conference on Information Technology and Knowledge Management, 14, 107–111.
Hu, Y., & Loizou, P. C. (2003). A perceptually motivated approach for speech enhancement. IEEE Transactions on Speech and Audio Processing, 11(5), 457–465.
Article Google Scholar
Qian, Y., Bi, M., Tan, T., & Yu, K. (2016). Very deep convolutional neural networks for noise robust speech recognition. IEEE Transactions, Audio, Speech, and Language Processing, 24(12), 2263–2276.
Article Google Scholar
Zhao, Y., Wang, Z.-Q., & Wang, D. L. (2019). Two-stage deep learning for noisy-reverberant speech enhancement. IEEE Transactions on Audio, Speech, Language Processing, 27(1), 53–62.
Article Google Scholar
Wang, D., & Zhang, X. W. (2015). THCHS-30: A free Chinese speech corpus. arXiv preprint arXiv: 1512.01882v2.
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the ACM International Conference on Multimedia (pp. 1015-1018).
Said, S. E., & Dickey, D. A. (1984). Testing for unit roots in autoregressive moving average models of unknown order. Biometrika, 71(3), 599–607.
Article MathSciNet MATH Google Scholar
MacKinnon, J. G. (2010). Critical values for Cointegration tests. Queen’s Economics Department Working Paper, 1227.
MacKinnon, J. G. (1994). Approximate asymptotic distribution functions for unit-root and cointegration tests. Journal of Business and Economic Statistics, 12, 167–176.
MathSciNet Google Scholar
Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In the thirteenth International Conference on Artificial Intelligence and Statistics, 9, 249–256.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
Article MATH Google Scholar
Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. IEEE Transactions on Acoustics, Speech, and Signal Processing, 2, 749–752.
Google Scholar
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2010). A short-time objective intelligibility measure for time-frequency weighted noisy speech. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4214-4217).
Gray, A., & Markel, J. (1976). Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5), 380–391.
Article Google Scholar
Hegde, V., & Pallavi, M. S. (2015). Descriptive analytical approach to analyze the student performance by comparative study using Z score factor through R language. In IEEE International Conference on Computational Intelligence and Computing Research (ICCIC) (pp. 10-12).
Abdullah, N., Rashid, N. E. A., Khan Z. I., & Musirin, I. (2015). Analysis of different Z-score data to the neural network for automatic FSR vehicle classification. In the Third IET International Radar Conference.
Chittineni, S., & Bhogapathi, R. B. (2012). A study on the behavior of a neural network for grouping the data. In the International Journal of Computer Science Issues (IJCSI), 9(1), 228–234.
Google Scholar
Lim, J. S., & Oppenheim, A. V. (1979). Enhancement and bandwidth compression of noisy speech. Proceedings of the IEEE, 67(12), 1586–1604.
Article Google Scholar

Download references

Acknowledgements

This work is jointly supported by Guangdong Key R&D Project (Grant No. 2018B030338001), NSF of China (Grant No. 61771320) and Shenzhen Science & Innovation Funds (Grant No. JCYJ 20170302145906843).

Author information

Authors and Affiliations

Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, Guangdong, China
Nengheng Zheng, Yupeng Shi, Weicong Rong & Yuyong Kang

Authors

Nengheng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yupeng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Weicong Rong
View author publications
You can also search for this author in PubMed Google Scholar
Yuyong Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yupeng Shi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, N., Shi, Y., Rong, W. et al. Effects of Skip Connections in CNN-Based Architectures for Speech Enhancement. J Sign Process Syst 92, 875–884 (2020). https://doi.org/10.1007/s11265-020-01518-1

Download citation

Received: 19 February 2019
Revised: 05 December 2019
Accepted: 08 January 2020
Published: 15 May 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s11265-020-01518-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effects of Skip Connections in CNN-Based Architectures for Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

Speech Enhancement Algorithm Combining Cochlear Features and Deep Neural Network with Skip Connections

Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN

Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effects of Skip Connections in CNN-Based Architectures for Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

Speech Enhancement Algorithm Combining Cochlear Features and Deep Neural Network with Skip Connections

Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN

Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation