Abstract
Automatic speech recognition is generally analyzed for two types of word utterances; isolated and continuous-words speech. Continuous-words speech is almost natural way of speaking but is difficult to be recognized through machines (speech recognizers). It is also highly sensitive to environmental variations. There are various parameters which are directly affecting the performance of automatic speech recognition like size of datasets/corpus, type of data sets (isolated, spontaneous or continuous) and environment variations (noisy/clean). The performance of speech recognizers is generally good in clean environments for isolated words, but it becomes typical in noisy environments especially for continuous words/sentences and is still a challenge. In this paper, a hybrid feature extraction technique is proposed by joining core blocks of PLP (perceptual linear predictive) and Mel frequency cepstral coefficients (MFCC) that can be utilized to improve the performance of speech recognizers under such circumstances. Voice activity and detection (VAD)-based frame dropping formula has been used solely within the training part of ASR (automatic speech recognition) procedure obviating its need in actual implementations. The motivation to use this formula is for removal of pauses and distorted elements of speech improving the phonemes modeling further. The proposed method shows average improvement in performance by 12.88% for standard datasets.
Similar content being viewed by others
References
Kurzekar PK, Desmukh RR, Waghmare VB, Shrishrimal P (2014) Continuous speech recognition system: a review. Asian J Comput Sci Inform Technol (AJCSIT) 4:(6): 62–66
Agarwal RK, Dave M (2008) Implementing a speech recognition interface for Indian Languages. In: Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages. pp. 105–112
Keronen S, Remes U, Palomaki KJ, Virtanen T, Kurimo M (2010) Comparison of noise robust methods in large vocabulary speech recognition. In: 18th European Signal Processing Conference (EUSIPCO-2010), 1973–1977
Li Q, Zheng J, Tsai A, Zhou Q (2002) Robust endpoint detection and energy normalization for real-time speech and speaker recognition. IEEE Trans Speech Audio Process 10(3):146–157
Cui X, Alwan A (2005) Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR. IEEE Trans Speech Audio Process 13(6):1161–1172. https://doi.org/10.1109/TSA.2005.853002
Le Prell CG, Clavier OH (2017) Effects of noise on speech recognition: Challenges for communication by service members, www.elsevier.com/locate/heares. Hearing Res 349:76–89
Wright SJ, Kanevsky D, Deng L, He X, Heigold G, Li H (2013) Optimization algorithms and applications for speech and language processing. IEEE Trans Audio Speech Lang Process 21(11):2231–2243
Nasef A, Marjanovic-Jakovlijevic M, Njegus A (2017) Optimization of the speaker recognition in noisy environments using a stochastic gradient descent. Intern Sci Conf Inform Technol Data Relat Res Sinteza 2017:369–373
Healy EW, Yoho SE, Wang Y, Wang D (2013) An algorithm to improve speech recognition in noise for hearing-impaired listeners. J Acoust Soc Am 134(4):3029–3038. https://doi.org/10.1121/1.4820893
Geiger JT, Weninger F, Gemmeke JF, Wollmer M, Schuller B, Rigoll G (2014) Memory-enhanced neural networks and NMF for robust ASR. IEEE/ACM Trans Audio Speech Lang process 22(6):1037–1046. https://doi.org/10.1109/TASLP.2014.2318514
Sahu SK, Kumar P, Singh AP (2018) Modified K-NN algorithm for classification problems with improved accuracy. Intern J Inform Technol 10:65–70. https://doi.org/10.1007/s41870-017-0058-z
Bouafif L, Ouni K (2012) A speech tool software for signal processing applications. In: 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT). pp. 788–791
Sumithra MG, Ramya MS, Thanuskodi K (2011) Speech recognition in noisy environment using different feature extraction techniques. Intern J Computat Intell Telecommun Syst 2(1):57–62
Rahman MM, Saha SK, Hossain MK, Islam MB (2012) Performance evaluation of CMN for Mel-LPC based speech recognition in different noisy environments. Intern J Comput Appl 58(10):6–10. https://doi.org/10.5120/9316-3548
Pillai D, Siddavatam I (2019) A modified framework to detect keyloggers using machine learning algorithm. Int J Inf Technol 11:707–712. https://doi.org/10.1007/s41870-018-0237-6
Eringis D, Tamulevicius G (2014) Improving speech recognition rate through analysis parameters. Electr Contr Commun Eng 5(1). https://doi.org/10.2478/ecce-2014-009
Dave N (2013) Feature extraction methods LPC PLP and MFCC in speech recognition. Intern J Adv Res Eng Technol 1(6):1–5
Patil S, Anandhi RJ (2020) Diversity based self-adaptive clusters using PSO clustering for crime data. Int J Inf Technol 12:319–327. https://doi.org/10.1007/s41870-019-00311-z
Dekens T, Verhelst W, Capman F, Beaugendre F (2010) Improved speech recognition in noisy environments by using a throat microphone for accurate voicing detection. In: 18th European Signal Processing Conference (EUSIPCO-2010), 1978–1982
Sharma K, Sinha HP, Agarwal RK (2010) Comparative study of speech recognition system using various feature extraction techniques. Intern J Inform Technol Knowl Manage 3(2):695–698
Rahkar Farshi T, Orujpour M (2019) Multi-level image thresholding based on social spider algorithm for global optimization. Intern J Inform Technol 11:713–718. https://doi.org/10.1007/s41870-019-00328-4
Qazi KA, Nawaz T, Mehmood Z, Rashid M, Hafiz AH (2018) A hybrid technique for speech segregation and classification using a sophisticated deep neural network. PLoS ONE 13:e0194151. https://doi.org/10.1371/journal.pone.0194151
Joseph FJJ (2020) Effect of supervised learning methodologies in offline handwritten Thai character recognition. Int J Inf Technol 12:57–64. https://doi.org/10.1007/s41870-019-00366-y
Nassif AB, Shanin I, Attili I, Azzeh M, Shaalan K (2019) Speech recognition using deep neural networks: a systematic review. IEEE Access 7:19143–19165
Gerkmann T, Hendriks RC (2011) Noise power estimation based on the probability of speech presence. In: 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 145–148
Psutka J, Muller L, Psutka JV (2001) Comparison of MFCC and PLP Parameterizations in the speaker independent continuous speech recognition task, Eurospeech 2001, Scandinavia
Xie L, Liu ZQ (2006) A comparative study of audio features for audio to visual cobversion in MPEG-4 compliant facial animation. In: Proc. of ICMLC, Dalian, 13–16 Aug-2006
Leong ATK (2003) A music identification system based on audio content similarity. In: Thesis of Bachelor of Engineering, Division of Electrical Engineering, The School of Information Technology and Electrical Engineering, The University of Queensland, Queensland
Murugappan M, Selvaraj J (2012) DWT and MFCC based human emotional speech classification using LDA. In: International Conference on Biomedical Engineering (ICoBE), Penang, pp. 203–206
Prithvi P, Kumar TK (2016) Comparative analysis of MFCC, LFCC, RASTA-PLP. In: International Journal of Scientific Engineering and Research (IJSER) 4(5): 4–7
Dua M, Agarwal RK, Biswas M (2018) Performance evaluation of hindi speech recognition using optimized filter banks. Eng Sci Technol Intern J 21(2018):389–398. https://doi.org/10.1016/j.jestch.2018.04.005
Hermansky H (1990) Perceptual linear predictive (PLP) analysis for speech. J Acoust Soc Am 87(4):1738–1752. https://doi.org/10.1121/1.399423
Hermansky H., Hanson B. and Wakita H (1985) Perceptually based linear predictive analysis of speech, acoustics, speech, and signal processing. In: IEEE International Conference on ICASSP 85, 10:509–512
Hermansky H, Morgan N, Bayya A, Kohn P (1991) The challenge of inverse-E: the RASTA-PLP method. IEEE 2:800–804. https://doi.org/10.1109/ACSSC.1991.186557
Kim Phil, MATLAB Deep Learning. https://doi.org/10.1007/978-1-4842-2845-6
Acknowledgements
A special note of thanks is due to ECE Department, NIT Kurukshetra, Haryana, India for providing the required infrastructure and research environment.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kumar, A., Mittal, V. Hindi speech recognition in noisy environment using hybrid technique. Int. j. inf. tecnol. 13, 483–492 (2021). https://doi.org/10.1007/s41870-020-00586-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-020-00586-7