State of Research of Speech Recognition

Part of the Studies in Computational Intelligence book series (SCI, volume 550)


In this chapter, a brief overview derived out of detailed survey of speech recognition works reported from different groups all over the globe in the last two decades is described. Robustness of speech recognition systems toward language variation is the recent trend of research in speech recognition technology. To develop a system which can communicate with human being in any language like any other human being is the foremost requirement of any speech recognition technology for one and all. From the beginning of commercial availability of the speech recognition system, the technology has been dominated by the Hidden Markov Model (HMM) methodology due to its capability of modeling temporal structures of speech and encoding them as a sequence of spectral vectors. However, from the last 10 to 15 years, after the acceptance of neurocomputing as an alternative to HMM, ANN-based methodologies have started to receive attention for application in speech recognition. This is a trend worldwide as part of which a few works have also reported by researchers. India is a country which has vast linguistic variations among its billion plus population. Therefore, it provides a sound area of research toward language-specific speech recognition technology. This review also covers a study on speech recognition works done specifically in certain Indian languages. Most of the work done in Indian languages also uses HMM technology. However, ANN technology is also adopted by a few Indian researchers.


Automatic speech recognition Hidden Markov model  Artificial neural network Indian language 


  1. 1.
    Juang BH, Rabiner LR (2004) Automatic speech recognition: a brief history of the technology development. Available via
  2. 2.
    Gales M, Young S (2007) The application of hidden Markov models in speech recognition. Found Trends Sig Process 1(3):195–304Google Scholar
  3. 3.
    Levinson SE, Rabiner LR, Sondhi MM (1983) An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Sys Tech J 62(4):1035–1074Google Scholar
  4. 4.
    Ferguson JD (1980) Hidden Markov analysis: an introduction, in hidden Markov models for speech. Princeton, Institute for Defense AnalysesGoogle Scholar
  5. 5.
    Rabiner LR, Juang BH (2004) Statistical methods for the recognition and understanding of speech. Encyclopedia of language and linguistics. Available via
  6. 6.
    Davis KH, Biddulph R, Balashek S (1952) Automatic recognition of spoken digits. J Acous Soc Am 24(6):637–642CrossRefGoogle Scholar
  7. 7.
    Suzuki J, Nakata K (1961) Recognition of japanese vowels preliminary to the recognition of speech. J Radio Res Lab 37(8):193–212Google Scholar
  8. 8.
    Sakai J, Doshita S (1962) The phonetic typewriter. In: Information processing. Proceedings, IFIP congress, MunichGoogle Scholar
  9. 9.
    Nagata K, Kato Y, Chiba S (1963) Spoken digit recognizer for Japanese language. NEC research and development, 6Google Scholar
  10. 10.
    Fry DB, Denes P (1959) The design and operation of the mechanical speech recognizer at university college london. J British Inst Radio Eng 19(4):211–229Google Scholar
  11. 11.
    Martin TB, Nelson AL, Zadell HJ (1964) Speech recognition by feature abstraction techniques. Technical report AL-TDR-64-176, Air Force Avionics LaboratoryGoogle Scholar
  12. 12.
    Vintsyuk TK (1968) Speech discrimination by dynamic programming. Kibernetika 4(2):81–88MathSciNetGoogle Scholar
  13. 13.
    Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans Inf Theor 13:260–269CrossRefMATHGoogle Scholar
  14. 14.
    Atal BS, Hanauer SL (1971) Speech analysis and synthesis by linear prediction of the speech wave. J Acoust Soc Am 50(2):637–655CrossRefGoogle Scholar
  15. 15.
    Itakura F (1970) A statistical method for estimation of speech spectral density and formant frequencies. Electron Commun Jpn 53(A):36–43Google Scholar
  16. 16.
    Lowerre BT (1976) The HARPY speech recognition system. Doctoral thesis, Carnegie-Mellon University, Department of Computer ScienceGoogle Scholar
  17. 17.
    Baker JK (1975) The dragon system: an overview. IEEE Trans Acoust Speech Sig Process 23(1):24–29CrossRefGoogle Scholar
  18. 18.
    Jelinek F (1976) Continuous speech recognition by statistical methods. Proc IEEE 64(4):532–556CrossRefGoogle Scholar
  19. 19.
    Juang BH (1984) On the hidden Markov model and dynamic time warping for speech recognition: a unified view. AT&T Bell Lab Tech J 63(7):1213–1243Google Scholar
  20. 20.
    Juang BH (1985) Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains. AT&T Bell Lab Tech J 64(6):1235–1249Google Scholar
  21. 21.
    Levinson SE, Rabiner LR, Sondhi MM (1983) An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Syst Tech J 62(4):1035–1074Google Scholar
  22. 22.
    Itakura F (1975) Minimum prediction residual principle applied to speech recognition. IEEE Trans Acoust Speech Sig Process 23:57–72CrossRefGoogle Scholar
  23. 23.
    Rabiner LR, Levinson SE, Rosenberg AE, Wilpon JG (1979) Speaker independent recognition of isolated words using clustering techniques. IEEE Trans Acoust Speech Sig Process 27:336–349Google Scholar
  24. 24.
    Hu YH, Hwang JN (2002) Handbook of neural network signal processing. The electrical engineering and applied signal processing series. CRC Press, USAGoogle Scholar
  25. 25.
    Lippmann RP (1990) Review of neural networks for speech recognition. Readings in speech recognition. Morgan Kaufmann Publishers, Burlington, pp 374–392Google Scholar
  26. 26.
    Evermann G, Chan HY, Gales MJF, Hain T, Liu X, Mrva D, Wang L, Woodland P (2004) Development of the 2003 CU-HTK conversational telephone speech transcription system. In: Proceedings of ICASSP, Montreal, CanadaGoogle Scholar
  27. 27.
    Matsoukas S, Gauvain JL, Adda A, Colthurst T, Kao CI, Kimball O, Lamel L, Lefevre F, Ma JZ, Makhoul J, Nguyen L, Prasad R, Schwartz R, Schwenk H, Xiang B (2006) Advances in transcription of broadcast news and conversational telephone speech within the combined ears bbn/limsi system. IEEE Trans Audio Speech Lang Process 14(5):1541–1556CrossRefGoogle Scholar
  28. 28.
    Rigoll G (1995) Speech recognition experiments with a new multilayer LVQ network (MLVQ). In: Proceedings of eurospeech, ISCAGoogle Scholar
  29. 29.
    Choi HJ, Ohand YH, Dong YK (1996) Speech recognition using an enhanced FVQ based on a codeword dependent distribution normalization and codeword weighting by fuzzy objective function. In: Proceedings of the international conference on spoken language processingGoogle Scholar
  30. 30.
    Kingsbury BED, Morgan N, Greenberg S (1998) Robust speech recognition using the modulation spectrogram. Speech Commun 25:117–132CrossRefGoogle Scholar
  31. 31.
    Deng L, Acero A, Plumpe M, Huang X (2000) Large-vocabulary speech recognition under adverse acoustic environments. In: Proceedings of interspeech, ISCA, pp 806–809Google Scholar
  32. 32.
    Wessel F, Schluter R, Macherey K, Ney H (2001) Confidence measures for large vocabulary continuous speech. IEEE Trans Speech Audio Process 9(3):288–298CrossRefGoogle Scholar
  33. 33.
    Huo Q, Lee CH (2001) Robust speech recognition based on adaptive classification and decision strategies. Speech Commun 34:175–194CrossRefMATHGoogle Scholar
  34. 34.
    Cowling M, Sitte R (2002) Analysis of speech recognition techniques for use in a non-speech sound recognition system. In: Proceedings of 6th international symposium on digital signal processing for communication systemsGoogle Scholar
  35. 35.
    Parveen S, Green PD (2002) Speech recognition with missing data using recurrent neural nets. In: Proceedings of the 14th conference. Advances in neural information processing systems, vol 2, pp 1189–1194Google Scholar
  36. 36.
    Li X, Stern RM (2003) Feature generation based on maximum classification probability for improved speech recognition. Interspeech, ISCAGoogle Scholar
  37. 37.
    Povey D (2003) Discriminative training for large vocabulary speech recognition. Doctoral thesis, University of Cambridge, CambridgeGoogle Scholar
  38. 38.
    Ahmad AM, Ismail S, Samaonl DF (2004) Recurrent neural network with backpropagation through time for speech recognition. In: Proceedings of international symposium on communications and information technologies, Sapporo, JapanGoogle Scholar
  39. 39.
    Ala-Keturi V (2004) Speech recognition based on artificial neural networks. Helsinki University of Technology, Available via
  40. 40.
    Suh Y and Kim H (2004) Data-driven filter-bank-based feature extraction for speech recognition. In: Proceedings of the 9th conference speech and computer, St. Petersburg, RussiaGoogle Scholar
  41. 41.
    Halavati R, Shouraki S B, Eshraghi M, Alemzadeh M (2004) A novel fuzzy approach to speech recognition. In: Proceedings of 4th international conference on hybrid intelligent systemsGoogle Scholar
  42. 42.
    Yousefian N, Analoui M (2005) Using radial basis probabilistic neural network for speech recognition. Available via
  43. 43.
    Jorgensen C, Binsted K (2005) Web browser control using EMG based sub vocal speech recognition. In: Proceedings of the 38th annual Hawaii international conference on system Sciences, p 294cGoogle Scholar
  44. 44.
    Scheme EJ, Hudgins B, Parker PA (2007) Myoelectric signal classification for phoneme-based speech recognition. IEEE Trans Biomed Eng 54(4):694–699Google Scholar
  45. 45.
    Maheswari NU, Kabilan AP, Venkatesh R (2009) Speech recognition system based on phonemes using neural networks. Int J Comput Sci Netw Secur 9(7):148–153Google Scholar
  46. 46.
    Savage J, Rivera C, Aguilar V (2010) Isolated word speech recognition using vector quantization techniques and artificial neural networksGoogle Scholar
  47. 47.
    Hawickhorst BA, Zahorian SA, A comparison of three neural network architectures for automatic speech recognition. Department of Electrical and Computer Engineering Old Dominion University, Norfolk. Available via
  48. 48.
    Languages with official status in India. The Constitution of India, eighth schedule, articles 344(1) and 351: 330. Available via
  49. 49.
    Rabiner L, Juang BH (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16Google Scholar
  50. 50.
    Young S, Kershaw D, Odell J, Ollason D, Valtchev V, Woodland P (2000) The HTK book. Available via
  51. 51.
    Lee KF, Hon HW, Reddy R (1990) An overview of the SPHINX speech recognition system. IEEE Trans Acoust Speech Sig Process 38(1):35–45Google Scholar
  52. 52.
    Samudravijaya K, Ahuja R, Bondale N, Jose T, Krishnan S, Poddar P, Rao PVS, Raveendran R (1998) A feature-based hierarchical speech recognition system for hindi. Sadhana 23(4):313–340CrossRefGoogle Scholar
  53. 53.
    Rajput N, Subramaniam LV, Verma A (2000) Adapting phonetic decision trees between languages for continuous speech recognition. In: Proceedings of IEEE international conference on spoken language processing, Beijing, ChinaGoogle Scholar
  54. 54.
    Kumar M, Rajput N, Verma A (2004) A large-vocabulary continuous speech recognition system for Hindi. IBM J Res Dev 48(5/6):703–715Google Scholar
  55. 55.
    Gaurav DS, Deiv G, Sharma K, Bhattacharya M (2012) Development of application specific continuous speech recognition system in hindi. J Sig Inf Process 3:394–401Google Scholar
  56. 56.
    Kumar M, Aggarwal RK, Leekha G, Kumar Y (2012) Ensemble feature extraction modules for improved hindi speech recognition system. Proc Int J Comput Sci Issues 9(3):359–364Google Scholar
  57. 57.
    Bhuvanagirir K, Kopparapu SK (2012) Mixed language speech recognition without explicit identification of language. Am J Sig Process 2(5):92–97CrossRefGoogle Scholar
  58. 58.
    Thangarajan R, Natarajan AM, Selvam M (2008) Word and triphone based approaches in continuous speech recognition for tamil language. WSEAS Trans Sig Process 4(3):76–85Google Scholar
  59. 59.
    Kalyani N, Sunitha KVN (2009) Syllable analysis to build a dictation system in telugu language. Int J Comput Sci Inf Secur 6(3):171–176Google Scholar
  60. 60.
    Usha Rani N, Girija PN (2012) Error analysis to improve the speech recognition accuracy on telugu language. Sadhana 37(6):747–761CrossRefGoogle Scholar
  61. 61.
    Das B, Mandal S, Mitra P (2011) Bengali speech corpus for continuous automatic speech recognition system. In: Proceedings of international conference on speech database and assessments, pp 51–55Google Scholar
  62. 62.
    Dua M, Aggarwal RK, Kadyan V, Dua S (2012) Punjabi automatic speech recognition using HTK. Int J Comput Sci Issues 9(4):359–364Google Scholar
  63. 63.
    Mehta LR, Mahajan SP, Dabhade AS (2013) Comparative study of MFCC and LPC for Marathi isolated word recognition system. Int J Adv Res Electr Electron Instrum Eng 2(6):2133–2139Google Scholar
  64. 64.
    Udhyakumar N, Swaminathan R, Ramakrishnan SK (2004) Multilingual speech recognition for information retrieval in indian context. In: Proceedings of the student research workshop at HLT-NAACL, pp 1–6Google Scholar
  65. 65.
    Anumanchipalli G, Chitturi R, Joshi S, Kumar R, Singh SP, Sitaram RNV, Kishore SP (2005) Development of Indian language speech databases for large vocabulary speech recognition systems. In: Proceedings of international conference on speech and computerGoogle Scholar
  66. 66.
    Lakshmi A, Murthy HA (2008) A new approach to continuous speech recognition in Indian languages. In: Proceedings national conferrence communicationGoogle Scholar
  67. 67.
    Lakshmi SG, Lakshmi A, Murthy HA, Nagarajan T (2009) Automatic transcription of continuous speech into syllable-like units for Indian languages. Sadhana 34(2):221–233Google Scholar
  68. 68.
    Bhaskar PV, Rao SRM, Gopi A (2012) HTK based Telugu speech recognition. Int J Adv Res Comput Sci Softw Eng 2(12):307–314Google Scholar
  69. 69.
    Mohan A, Rose R, Ghalehjegh SH, Umesh S (2013) Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Commun 56:167–180. Available via
  70. 70.
    Povey D, Burget L, Agarwal M, Akyazi P, Kai F, Ghoshal A, Glembek O, Goel N, Karafiat M, Rastrow A (2011) The subspace Gaussian mixture model: a structured model for speech recognition. Comput Speech Lang 25(2):404–439Google Scholar
  71. 71.
    Rose RC, Yin SC, Tang Y (2011) An investigation of subspace modeling for phonetic and speaker variability in automatic speech recognition. In: Proceedings of IEEE international conference on acoustics, speech, and signal processingGoogle Scholar
  72. 72.
    Mohan A, Ghalehjegh SH, Rose RC (2012) Dealing with acoustic mismatch for training multilingual subspace Gaussian mixture models for speech recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processingGoogle Scholar
  73. 73.
    Diehl RL, Lotto AJ, Holt LL (2004) Speech perception. Annu Rev Psychol 55:149–179CrossRefGoogle Scholar
  74. 74.
    Eysenck MW (2004) Psychology-an international perspective. Psychology Press. Available via
  75. 75.
    Jusczyk PW, Luce PA (2002) Speech perception and spoken word recognition: past and present. Ear Hear 23(1):2–40CrossRefGoogle Scholar
  76. 76.
    Bergen B (2006) Linguistics 431/631: connectionist language modeling. Meeting 10: speech perception. Available via
  77. 77.
    McClelland JL, Mirman D, Holt LL (2006) Are there interactive processes in speech perception? Trends Cogn Sci 10(8):363–369CrossRefGoogle Scholar
  78. 78.
    Sarkar M, Yegnanarayana B (1998) Fuzzy-rough neural networks for vowel classification. IEEE international conference on systems, man, and cybernetics, p 5Google Scholar
  79. 79.
    Gangashetty SV, Yegnanarayana B (2001) Neural network models for recognition of consonant-vowel (CV) utterances. In: Proceedings of international joint conference on neural networksGoogle Scholar
  80. 80.
    Khan AN, Gangashetty SV, Yegnanarayana B (2004) Neural network preprocessor for recognition of syllables. In: Proceedings of international conference on intelligent sensing and information processingGoogle Scholar
  81. 81.
    Gangashetty SV, Sekhar CC, Yegnanarayana B (2005) Spotting multilingual consonant-vowel units of speech using neural network models. In: Proceeding of international conference on non-linear speech processingGoogle Scholar
  82. 82.
    Paul AK, Das D, Kamal M (2009) Bangla speech recognition system using LPC and ANN. In: Proceedings of the seventh international conference on advances in pattern recognition, pp 171–174Google Scholar
  83. 83.
    Sunil KRK, Lajish VL (2012) Vowel phoneme recognition based on average energy information in the zerocrossing intervals and its distribution using ANN. Int J Inf Sci Tech 2(6):33–42Google Scholar
  84. 84.
    Pravin P, Jethva H (2013) Neural network based Gujarati language speech recognition. Int J Comput Sci Manage Res 2(5):2623–2627Google Scholar
  85. 85.
    Chitturi R, Keri V, Anumanchipalli G, Joshi S (2005) Lexical modeling for non native speech recognition using neural networks. In: Proceedings of international conference of natural language processingGoogle Scholar
  86. 86.
    Thasleema TM, Narayanan NK (2012) Multi resolution analysis for consonant classification in noisy environments. Int J Image Graph Sig Process 8:15–23CrossRefGoogle Scholar
  87. 87.
    Sukumar AR, Shah AF, Anto PB (2010) Isolated question words recognition from speech queries by using Artificial Neural Networks. In: Proceedings of international conference on computing communication and networking technologies, pp 1–4Google Scholar
  88. 88.
    Sarma MP, Sarma KK (2009) Assamese numeral speech recognition using multiple features and cooperative lvq-architectures. Int J Electr Electron Eng 5(1):27–37Google Scholar
  89. 89.
    Sarma M, Dutta K, Sarma KK (2009) Assamese numeral corpus for speech recognition using cooperative ANN architecture. World Acad Sci Eng Tech 28:581–590Google Scholar
  90. 90.
    Bhattacharjee U (2010) Search key identification in a spoken query using isolated keyword recognition. Int J Comput Appl 5(8):14–21MathSciNetGoogle Scholar
  91. 91.
    Dutta K, Sarma KK (2012) Multiple feature extraction for RNN-based assamese speech recognition for speech to text conversion application. In: Proceedings of international conference on communications, devices and intelligent systems (CODIS), pp 600–603Google Scholar
  92. 92.
    Bhattacharjee U (2013) A comparative study of LPCC And MFCC features for the recognition of assamese phonemes. Int J Eng Res Technol 2(1):1–6Google Scholar
  93. 93.
    Sarma M, Sarma KK (2013) An ANN based approach to recognize initial phonemes of spoken words of assamese language. Appl Soft Comput 13(5):2281–2291Google Scholar

Copyright information

© Springer India 2014

Authors and Affiliations

  1. 1.Department of Electronics and Communication EngineeringGauhati UniversityGuwahatiIndia
  2. 2.Department of Electronics and Communication TechnologyGauhati UniversityGuwahatiIndia

Personalised recommendations