Advertisement

Neural Computing and Applications

, Volume 31, Issue 10, pp 6747–6755 | Cite as

Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling

  • Mohit DuaEmail author
  • R. K. Aggarwal
  • Mantosh Biswas
Original Article

Abstract

This paper implements and evaluates the performance of a discriminatively trained continuous Hindi language speech recognition system. The system uses maximum mutual information and minimum phone error discriminative techniques with various numbers of Gaussian mixtures to train the automatic speech recognition (ASR) system. The training dataset consists of Hindi speech transcription. The experiments show a significant performance gain over maximum likelihood-based Hindi language speech recognition system. The system uses an efficient recurrent neural network (RNN)-based language modeling. The results indicate that the use of RNN-based language modeling enhances the performance of the ASR system. Further, the interpolation of n-gram language model (LM) with the RNNLM exhibits an additional increase in recognition performance of the implemented system. The proposed system introduces the concept of speaker adaption using maximum likelihood linear regression technique. The paper also gives an overview of the techniques used for discriminative training along with practical issues involved in their implementation.

Keywords

Automatic speech recognition Language modeling Minimum phone error Maximum mutual information Recurrent neural network 

Notes

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. 1.
    Liu H, Yin J, Luo X, Zhang S (2018) Foreword to the special issue on recent advances on pattern recognition and artificial intelligence. Neural Comput Appl 29(1):1–2CrossRefGoogle Scholar
  2. 2.
    de Jesús Rubio J et al (2013) A method for online pattern recognition of abnormal eye movements. Neural Comput Appl 22(3–4):597–605CrossRefGoogle Scholar
  3. 3.
    Acır N (2006) A modified hybrid neural network for pattern recognition and its application to SSW complex in EEG. Neural Comput Appl 15(1):49–54CrossRefGoogle Scholar
  4. 4.
    Cervelló-Royo R, Guijarro F, Michniuk K (2015) Stock market trading rule based on pattern recognition and technical analysis: forecasting the DJIA index with intraday data. Expert Syst Appl 42(14):5963–5975CrossRefGoogle Scholar
  5. 5.
    Arabacı H, Bilgin O (2010) Automatic detection and classification of rotor cage faults in squirrel cage induction motor. Neural Comput Appl 19(5):713–723CrossRefGoogle Scholar
  6. 6.
    Cardoso JS, Pardo XM, Paredes R (2017) Foreword to the special issue on pattern recognition and image analysis. Neural Comput Appl 28(9):2371–2372CrossRefGoogle Scholar
  7. 7.
    Daneshyari M (2010) Chaotic neural network controlled by particle swarm with decaying chaotic inertia weight for pattern recognition. Neural Comput Appl 19(4):637–645CrossRefGoogle Scholar
  8. 8.
    Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Zweig G (2017) The Microsoft 2016 conversational speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5255–5259Google Scholar
  9. 9.
    Adiga A, Magimai M, Seelamantula CS (2013) Gammatone wavelet cepstral coefficients for robust speech recognition. In: TENCON 2013-2013 IEEE Region 10 conference (31194). IEEE, pp 1–4Google Scholar
  10. 10.
    Aggarwal RK, Dave M (2011) Discriminative techniques for Hindi speech recognition system. In: Information systems for Indian languages, pp 261–266Google Scholar
  11. 11.
    Biswas A et al (2015) Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature. Comput Electr Eng 42(2015):12–22CrossRefGoogle Scholar
  12. 12.
    Shao Y et al (2009) An auditory-based feature for robust speech recognition. In: IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009. IEEE, pp 4625–4628Google Scholar
  13. 13.
    Baba Ali B, Sameti H, Falk TH (2011) A model distance maximizing framework for speech recognizer-based speech enhancement. AEU Int J Electron Commun 65(2):99–106CrossRefGoogle Scholar
  14. 14.
    Huang Z, Siniscalchi SM, Lee C-H (2016) A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition. Neurocomputing 218:448–459CrossRefGoogle Scholar
  15. 15.
    Sun S et al (2017) An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257:79–87CrossRefGoogle Scholar
  16. 16.
    Hayasaka N, Kawamura A, Sasaoka N (2017) Noise-robust scream detection using band-limited spectral entropy. AEU Int J Electron Commun 76:117–124CrossRefGoogle Scholar
  17. 17.
    Mahapatra A et al (2014) Human recognition system for outdoor videos using Hidden Markov model. AEU Int J Electron Commun 68(3):227–236CrossRefGoogle Scholar
  18. 18.
    Vertanen K (2004) An overview of discriminative training for speech recognition. University of Cambridge, Cambridge, pp 1–14Google Scholar
  19. 19.
    Gillick D, Wegmann S, Gillick L (2012) Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework. In: 2012 IEEE acoustics, speech and signal processing (ICASSP-12) conference, Kyoto. IEEE, pp 4745–4748Google Scholar
  20. 20.
    McDermott E, Hazen TJ, Le Roux J, Nakamura A, Katagiri S (2007) Discriminative training for large-vocabulary speech recognition using minimum classification error. IEEE Trans Audio Speech Lang Process 15(1):203–223CrossRefGoogle Scholar
  21. 21.
    Siniscalchi SM, Svendsen T, Lee C-H (2014) An artificial neural network approach to automatic speech processing. Neurocomputing 140:326–338CrossRefGoogle Scholar
  22. 22.
    Trentin E, Gori M (2001) A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1):91–126CrossRefGoogle Scholar
  23. 23.
    Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Valtchev V (2002) The HTK book. Cambridge University Engineering Department, vol 3, pp 1–285Google Scholar
  24. 24.
    Kumar M, Rajput N, Verma A (2004) A large-vocabulary continuous speech recognition system for Hindi. IBM J Res Dev 48(5.6):703–715CrossRefGoogle Scholar
  25. 25.
    Kuamr A, Dua M, Choudhary A (2014) Implementation and performance evaluation of continuous Hindi speech recognition. In: Electronics and communication systems (ICECS), 2014 international conference on. IEEE, pp 1–5Google Scholar
  26. 26.
    Fung ADYLP (2012) Using English acoustic models for Hindi automatic speech recognition. In: 24th international conference on computational linguisticsGoogle Scholar
  27. 27.
    Patil HA, Basu TK (2008) Development of speech corpora for speaker recognition research and evaluation in Indian languages. Int J Speech Technol 11(1):17–32CrossRefGoogle Scholar
  28. 28.
    Aggarwal RKumar, Dave M (2012) Filterbank optimization for robust ASR using GA and PSO. Int J Speech Technol 15(2):191–201CrossRefGoogle Scholar
  29. 29.
    Biswas A, Sahu PK, Chandra M (2016) Admissible wavelet packet sub-band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition. IET Signal Proc 10(8):902–911CrossRefGoogle Scholar
  30. 30.
    Biswas A et al (2015) Admissible wavelet packet sub-band-based harmonic energy features for Hindi phoneme recognition. IET Signal Proc 9(6):511–519CrossRefGoogle Scholar
  31. 31.
    Mittal T, Sharma R (2016) Speech recognition using ANN and predator-influenced civilized swarm optimization algorithm. Turk J Electr Eng Comput Sci 24:4790–4803CrossRefGoogle Scholar
  32. 32.
    Gopalakrishnan PS, Kanevsky D, Nadas A, Nahamoo D (1991) An inequality for rational functions with applications to some statistical estimation problems. IEEE Trans Inf Theory 37(1):107–113CrossRefGoogle Scholar
  33. 33.
    Valtchev V (1995) Discriminative methods in HMM-based speech recognition, Ph.D Thesis. University of CambridgeGoogle Scholar
  34. 34.
    Povey D, Kanevsky D, Kingsbury B, Ramabhadran B, Saon G, Visweswariah K (2008) Boosted MMI for model and feature-space discriminative training. In: 2008 IEEE international conference on acoustics, speech and signal processing (ICASSP-08), Las Vegas. IEEE, pp 4057–4060Google Scholar
  35. 35.
    Povey D (2005) Discriminative training for large vocabulary speech recognition, Ph.D Thesis. University of CambridgeGoogle Scholar
  36. 36.
    Liu X, Wang Y, Chen X, Gales MJF, Woodland PC (2014) Efficient lattice rescoring using recurrent neural network language models. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP-14), Florence. IEEE, pp 4908–4912Google Scholar
  37. 37.
    Williams DRGHR, Hinton GE (1986) Learning representations by back-propagating errors. Nature 323(6088):533–538CrossRefGoogle Scholar
  38. 38.
    Boden M (2002) A guide to recurrent neural networks and back propagation. The Dallas Project, Halmstad University, SwedenGoogle Scholar
  39. 39.
    Shi Y, Hwang MY, Yao K, Larson M (2013) Speed up of recurrent neural network language models with sentence independent sub sampling stochastic gradient descent. In: Proceeding of interspeech conference, Lyon. ISCA, pp 1203–1207Google Scholar
  40. 40.
    Huang Z, Zweig G, Levit M, Dumoulin B, Oguz B, Chang S (2013) Accelerating recurrent neural network training via two stage classes and parallelization. In: 2013 IEEE workshop on automatic speech recognition and understanding, Olomouc. IEEE, pp 326–331Google Scholar
  41. 41.
    Li B, Zhou E, Huang B, Duan J, Wang Y, Xu N, Zhang J, Yang H (2014) Large scale recurrent neural network on GPU. In: 2014 international joint conference on neural networks (IJCNN), Beijing. IEEE, pp 4062–4069Google Scholar
  42. 42.
    Chen X, Wang Y, Liu X, Gales MJ, Woodland PC (2014) Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch. In: Proceeding of interspeech conference, Singapore. ISCA, pp 641–645Google Scholar
  43. 43.
    Liu X, Chen X, Wang Y, Gales MJ, Woodland PC (2016) Two efficient lattice rescoring methods using recurrent neural network language models. IEEE/ACM Trans Audio Speech Lang Process 24(8):1438–1449CrossRefGoogle Scholar
  44. 44.
    Samudravijaya K, Rao PVS, Agrawal SS (2002) Hindi speech database. In: International conference on spoken language processing, Beijing, pp 456–464Google Scholar
  45. 45.
    Macherey W (2010) Discriminative training and acoustic modeling for speech recognition, Ph.D Thesis. RWTH Aachen UniversityGoogle Scholar
  46. 46.
    Chen X, Liu X, Qian Y, Gales MJF, Woodland PC (2016) CUED-RNN LM—an open-source toolkit for efficient training and evaluation of recurrent neural network language models. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP-16), Shanghai. IEEE, pp 6000–6004Google Scholar
  47. 47.
    Deoras A, Mikolov T, Kombrink S, Karafiát M, Khudanpur S (2011) Variational approximation of long-span language models for LVCSR. In: 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP-11), Prague. IEEE, pp 5532–5535Google Scholar
  48. 48.
    Lecouteux B, Linares G, Esteve Y, Gravier G (2008) Generalized driven decoding for speech recognition system combination. In: Acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference on. IEEE, pp 1549–1552Google Scholar

Copyright information

© The Natural Computing Applications Forum 2018

Authors and Affiliations

  1. 1.Department of Computer EngineeringNational Institute of Technology, KurukshetraKurukshetraIndia

Personalised recommendations