Abstract
Nowadays, deep neural network has become the prime approach for enhancing speech signals as it yields good results compared to the traditional methods. This paper describes the transformation in the enhanced speech signal by applying the deep convolutional neural network (Deep CNN), which can model nonlinear relationships and compare it with the Wiener filtering method, which is the best technique for speech enhancement among the traditional methods. Denoising is performed in the frequency domain and converted back to the time domain to analyze performance metrics such as speech quality and speech intelligibility. The speech quality is analyzed based on the signal to noise ratio (SNR) and perceptual evaluation of speech quality (PESQ). Speech intelligibility is analyzed by short-time objective intelligibility (STOI). Both the methods evaluated the denoised speech, and the analysis made on the results shows that the SNR of the conventional Wiener filtering method is much improved when compared with Deep CNN. However, the PESQ and STOI of Deep CNN-based enhanced speech outperform the Wiener filtering method. The performance metrics indicate that Deep CNN achieves better results than the conventional technique.
Similar content being viewed by others
References
Chai L, Du J, Liu Q-F, Lee C-H (2019) Using generalized Gaussian distributions to improve regression error modeling for deep learning-based speech enhancement. IEEE ACM Trans Audio Speech Lang Process 27(12):1919–1931
Cui X, Chen Z, Yin F (2020) Speech enhancement based on simple recurrent unit network. Appl Acoust 157:107019
De S, Smith SL (2020) Batch normalization biases deep residual networks towards shallow paths. CoRR, vol. abs/2002.10444
Dionelis N, Brookes M (2018) Phase aware single channel speech enhancement with modulation domain Kalman filtering. IEEE ACM Trans Audio Speech Lang Process 26:5
Du et al (2017) Stacked convolutional denoising auto-encoders for feature representation. IEEE Trans Cybern 47(4):1017–1027
Fu S-W, Tsao Y, Lu X (2016) Snr-aware convolutional neural network modeling for speech enhancement. In: Interspeech, pp 3768–3772
Fu S-W, Tsao Y, Lu X, Kawai H (2017) Raw waveform-based speech enhancement by fully convolutional networks. In: Proceedings of the APSIPA ASC, pp 6–12
Fu S-W, Wang T-W, Tsao Y, Lu X, Kawai H (2018) End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE ACM Trans Audio Speech Lang Process (TASLP) 26(9):1570–1584
Grais EM, Erdogan H (2013) Discriminative nonnegative dictionary learning using cross-coherence penalties for single channel source separation. In: Proc. Inter-speech
Grais EM, Plumbley MD (2017) Single channel audio source separation using convolutional denoising autoencoders. In: Proceedings of the IEEE global conference on signal information processing, pp 1265–1269
Healy EW, Delfarah M, Vasko JL, Carter BL, Wang D (2017) An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker. J Acoust Soc Am 141(6):4230–4239
Hsieh T-A, Wang H-M, Lu X, Tsao Y (2020) WaveCRN: an efficient convolutional recurrent neural network for end-to-end speech enhancement. IEEE Signal Process Lett 27:2149
Hu Y, Loizou PC (2008) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 16(1):229–238
ITU, Perceptual Evaluation of Speech Quality (PESQ): an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs ITU-T Rec. p 862 (2000)
Jain K, Ross A, Prabhakar S (2004) An introduction to biometric recognition. IEEE Trans Circuits Syst Video Technol 14(1):4–20
Kameoka H, Tanaka K, Kwasny D, Kaneko T, Hojo N (2020) ConvS2S-VC: fully convolutional sequence-to-sequence voice conversion. IEEE ACM Trans Audio Speech Lang Process 28:1849–1863
Kolbæk M, Tran Z-H, Jensen SH, Jensen J (2020) On loss functions for supervised monaural time-domain speech enhancement. IEEE ACM Trans Audio Speech Lang Process 28:825–838
Kolbk M, Tan Z, Jensen J (2017) Speech intelligibility potential of general and specialized deep neural network-based speech enhancement systems. IEEE ACM Trans Audio Speech Lang Process 25(1):153–167
Kumar TS (2021) Construction of hybrid deep learning model for predicting children behavior based on their emotional reaction. J Inf Technol 3(01):29–43
Lan T, Lyu Y, Ye W, Hui G, Zenglin Xu, Liu Q (2020) Combining multi-perspective attention mechanism with convolutional networks for monoaural speech enhancement. IEEE Access 8:78979–78991
Li A, Yuan M, Zheng C, Li X (2020) Speech enhancement using progressive learning-based convolutional recurrent neural network. Appl Acoust 166:107347
Li R, Liu Y, Shi Y, Dong L, Cui W (2016) ILMSAF based speech enhancement with DNN and noise classification. Speech Commun 85:53–70
Li J, Zhang H, Zhang X, Li C (2019) Single channel speech enhancement using temporal convolutional recurrent neural networks. In: Proceedings of the APSIPA ASC, pp 896–900
Loizou PC (2013) Speech enhancement: theory and practice, 2nd edn. CRC Press, Boca Raton
Meng Z, Li J, Gong Y, Juang BH (2018) Cycle-consistent speech enhancement. In: Proceedings of the INTERSPEECH, pp 1165–1169
Nossier SA, Wall J, Moniri M, Glackin C, Cannings N (2021) An experimental analysis of deep learning architectures for supervised speech enhancement. Electronics 10(1):17
Paliwal KK, Wojcicki K, Schwerin B (2010) Single-channel speech enhancement using spectral subtraction in the short-time modulation domain. Speech Commun 52(5):450–475
Pandey D, Wang D (2019) TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain. In: Proceedings of the Interspeech, pp 6975–6879
Pandey A, Wang D (2019) A new framework for CNN based speech enhancement in the time domain. IEEE ACM Trans Audio Speech Lang Process 27(7):1179
Park SR, Lee JW (2017) A fully convolutional neural network for speech enhancement. Proc Interspeech 2017:1993–1997
Rix W, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 749–752
Schwerin B, Paliwal KK (2014) Using STFT real and imaginary parts of modulation signals for MMSE-based speech enhancement. Speech Commun 58:49–68
Srinivasan S, Samuelsson J, Kleijn WB (2006) Codebook driven short term predictor parameter estimation for speech enhancement. IEEE Trans Audio Speech Lang Process 14(1):163–176
Sungheetha A, Rajesh Sharma R (2021) Classification of remote sensing image scenes using double feature extraction hybrid deep learning approach. J Inf Technol 3(02):133–149
Tan K, Wang D (2020) Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE ACM Trans Audio Speech Lang Process 28:380–390
Thiergart O, Taseska M, Habets EAP (2014) An informed parametric spatial filter based on instantaneous direction-of-arrival estimates. IEEE ACM Trans Audio Speech Lang Process 22:12
Wang D, Chen J (2018) Supervised speech separation based on deep learning: An overview. IEEE ACM Trans Audio Speech Lang Process 26(10):1702–1726
Wang NY-H, Wang H-LS, Wang F-W, Lu X, Wang H-M, Tsao Y (2021) Improving the intelligibility of speech for simulated electric and acoustic simulation using fully convolutional neural network. IEEE Trans Neural Syst Rehabil Eng 29:184–195
Xia B, Bao C (2014) Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Commun 60:13–29
Xian Y, Sun Y, Wang W, Naqvi SM (2021) Convolutional fusion network for monaural speech enhancement. Neural Netw 143:97–107
Xu Y, Jun Du, Dai L-R, Lee C-H (2013) An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett 21(1):65–68
Yuan W (2020) A time–frequency smoothing neural network for speech enhancement. Speech Commun 124:75–84
Zhao H, Zarar S, Tashev I, Lee C (2018) Convolutional-recurrent neural networks for speech enhancement. In: International conference on acoustics, speech, and signal processing, pp 2401–2405
Zheng N, Shi Y, Rong W, Kang Y (2020) Effects of skip connections in CNN-based architectures for speech enhancement. J Signal Process Syst 92:875–884
Funding
No funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We don’t have any conflict of interest.
Human and animal rights statement
Humans/animals are not involved in this research work.
Data availability statements
The datasets analyzed during the current study are available in the University of Edinburgh, Centre for Speech Technology Research (CSTR). https://datashare.is.ed.ac.uk/handle/10283/2791.
Additional information
Communicated by Joy Iong-Zong Chen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hepsiba, D., Justin, J. Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN. Soft Comput 26, 13037–13047 (2022). https://doi.org/10.1007/s00500-021-06291-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-06291-2