Abstract
This paper implements and evaluates the performance of a discriminatively trained continuous Hindi language speech recognition system. The system uses maximum mutual information and minimum phone error discriminative techniques with various numbers of Gaussian mixtures to train the automatic speech recognition (ASR) system. The training dataset consists of Hindi speech transcription. The experiments show a significant performance gain over maximum likelihood-based Hindi language speech recognition system. The system uses an efficient recurrent neural network (RNN)-based language modeling. The results indicate that the use of RNN-based language modeling enhances the performance of the ASR system. Further, the interpolation of n-gram language model (LM) with the RNNLM exhibits an additional increase in recognition performance of the implemented system. The proposed system introduces the concept of speaker adaption using maximum likelihood linear regression technique. The paper also gives an overview of the techniques used for discriminative training along with practical issues involved in their implementation.
Similar content being viewed by others
References
Liu H, Yin J, Luo X, Zhang S (2018) Foreword to the special issue on recent advances on pattern recognition and artificial intelligence. Neural Comput Appl 29(1):1–2
de Jesús Rubio J et al (2013) A method for online pattern recognition of abnormal eye movements. Neural Comput Appl 22(3–4):597–605
Acır N (2006) A modified hybrid neural network for pattern recognition and its application to SSW complex in EEG. Neural Comput Appl 15(1):49–54
Cervelló-Royo R, Guijarro F, Michniuk K (2015) Stock market trading rule based on pattern recognition and technical analysis: forecasting the DJIA index with intraday data. Expert Syst Appl 42(14):5963–5975
Arabacı H, Bilgin O (2010) Automatic detection and classification of rotor cage faults in squirrel cage induction motor. Neural Comput Appl 19(5):713–723
Cardoso JS, Pardo XM, Paredes R (2017) Foreword to the special issue on pattern recognition and image analysis. Neural Comput Appl 28(9):2371–2372
Daneshyari M (2010) Chaotic neural network controlled by particle swarm with decaying chaotic inertia weight for pattern recognition. Neural Comput Appl 19(4):637–645
Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Zweig G (2017) The Microsoft 2016 conversational speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5255–5259
Adiga A, Magimai M, Seelamantula CS (2013) Gammatone wavelet cepstral coefficients for robust speech recognition. In: TENCON 2013-2013 IEEE Region 10 conference (31194). IEEE, pp 1–4
Aggarwal RK, Dave M (2011) Discriminative techniques for Hindi speech recognition system. In: Information systems for Indian languages, pp 261–266
Biswas A et al (2015) Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature. Comput Electr Eng 42(2015):12–22
Shao Y et al (2009) An auditory-based feature for robust speech recognition. In: IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009. IEEE, pp 4625–4628
Baba Ali B, Sameti H, Falk TH (2011) A model distance maximizing framework for speech recognizer-based speech enhancement. AEU Int J Electron Commun 65(2):99–106
Huang Z, Siniscalchi SM, Lee C-H (2016) A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition. Neurocomputing 218:448–459
Sun S et al (2017) An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257:79–87
Hayasaka N, Kawamura A, Sasaoka N (2017) Noise-robust scream detection using band-limited spectral entropy. AEU Int J Electron Commun 76:117–124
Mahapatra A et al (2014) Human recognition system for outdoor videos using Hidden Markov model. AEU Int J Electron Commun 68(3):227–236
Vertanen K (2004) An overview of discriminative training for speech recognition. University of Cambridge, Cambridge, pp 1–14
Gillick D, Wegmann S, Gillick L (2012) Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework. In: 2012 IEEE acoustics, speech and signal processing (ICASSP-12) conference, Kyoto. IEEE, pp 4745–4748
McDermott E, Hazen TJ, Le Roux J, Nakamura A, Katagiri S (2007) Discriminative training for large-vocabulary speech recognition using minimum classification error. IEEE Trans Audio Speech Lang Process 15(1):203–223
Siniscalchi SM, Svendsen T, Lee C-H (2014) An artificial neural network approach to automatic speech processing. Neurocomputing 140:326–338
Trentin E, Gori M (2001) A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1):91–126
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Valtchev V (2002) The HTK book. Cambridge University Engineering Department, vol 3, pp 1–285
Kumar M, Rajput N, Verma A (2004) A large-vocabulary continuous speech recognition system for Hindi. IBM J Res Dev 48(5.6):703–715
Kuamr A, Dua M, Choudhary A (2014) Implementation and performance evaluation of continuous Hindi speech recognition. In: Electronics and communication systems (ICECS), 2014 international conference on. IEEE, pp 1–5
Fung ADYLP (2012) Using English acoustic models for Hindi automatic speech recognition. In: 24th international conference on computational linguistics
Patil HA, Basu TK (2008) Development of speech corpora for speaker recognition research and evaluation in Indian languages. Int J Speech Technol 11(1):17–32
Aggarwal RKumar, Dave M (2012) Filterbank optimization for robust ASR using GA and PSO. Int J Speech Technol 15(2):191–201
Biswas A, Sahu PK, Chandra M (2016) Admissible wavelet packet sub-band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition. IET Signal Proc 10(8):902–911
Biswas A et al (2015) Admissible wavelet packet sub-band-based harmonic energy features for Hindi phoneme recognition. IET Signal Proc 9(6):511–519
Mittal T, Sharma R (2016) Speech recognition using ANN and predator-influenced civilized swarm optimization algorithm. Turk J Electr Eng Comput Sci 24:4790–4803
Gopalakrishnan PS, Kanevsky D, Nadas A, Nahamoo D (1991) An inequality for rational functions with applications to some statistical estimation problems. IEEE Trans Inf Theory 37(1):107–113
Valtchev V (1995) Discriminative methods in HMM-based speech recognition, Ph.D Thesis. University of Cambridge
Povey D, Kanevsky D, Kingsbury B, Ramabhadran B, Saon G, Visweswariah K (2008) Boosted MMI for model and feature-space discriminative training. In: 2008 IEEE international conference on acoustics, speech and signal processing (ICASSP-08), Las Vegas. IEEE, pp 4057–4060
Povey D (2005) Discriminative training for large vocabulary speech recognition, Ph.D Thesis. University of Cambridge
Liu X, Wang Y, Chen X, Gales MJF, Woodland PC (2014) Efficient lattice rescoring using recurrent neural network language models. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP-14), Florence. IEEE, pp 4908–4912
Williams DRGHR, Hinton GE (1986) Learning representations by back-propagating errors. Nature 323(6088):533–538
Boden M (2002) A guide to recurrent neural networks and back propagation. The Dallas Project, Halmstad University, Sweden
Shi Y, Hwang MY, Yao K, Larson M (2013) Speed up of recurrent neural network language models with sentence independent sub sampling stochastic gradient descent. In: Proceeding of interspeech conference, Lyon. ISCA, pp 1203–1207
Huang Z, Zweig G, Levit M, Dumoulin B, Oguz B, Chang S (2013) Accelerating recurrent neural network training via two stage classes and parallelization. In: 2013 IEEE workshop on automatic speech recognition and understanding, Olomouc. IEEE, pp 326–331
Li B, Zhou E, Huang B, Duan J, Wang Y, Xu N, Zhang J, Yang H (2014) Large scale recurrent neural network on GPU. In: 2014 international joint conference on neural networks (IJCNN), Beijing. IEEE, pp 4062–4069
Chen X, Wang Y, Liu X, Gales MJ, Woodland PC (2014) Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch. In: Proceeding of interspeech conference, Singapore. ISCA, pp 641–645
Liu X, Chen X, Wang Y, Gales MJ, Woodland PC (2016) Two efficient lattice rescoring methods using recurrent neural network language models. IEEE/ACM Trans Audio Speech Lang Process 24(8):1438–1449
Samudravijaya K, Rao PVS, Agrawal SS (2002) Hindi speech database. In: International conference on spoken language processing, Beijing, pp 456–464
Macherey W (2010) Discriminative training and acoustic modeling for speech recognition, Ph.D Thesis. RWTH Aachen University
Chen X, Liu X, Qian Y, Gales MJF, Woodland PC (2016) CUED-RNN LM—an open-source toolkit for efficient training and evaluation of recurrent neural network language models. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP-16), Shanghai. IEEE, pp 6000–6004
Deoras A, Mikolov T, Kombrink S, Karafiát M, Khudanpur S (2011) Variational approximation of long-span language models for LVCSR. In: 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP-11), Prague. IEEE, pp 5532–5535
Lecouteux B, Linares G, Esteve Y, Gravier G (2008) Generalized driven decoding for speech recognition system combination. In: Acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference on. IEEE, pp 1549–1552
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Dua, M., Aggarwal, R.K. & Biswas, M. Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput & Applic 31, 6747–6755 (2019). https://doi.org/10.1007/s00521-018-3499-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3499-9