Skip to main content

An Overview of Automatic Speech Recognition

  • Chapter
Book cover Automatic Speech and Speaker Recognition

Abstract

For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this chapter we review some of the key advances in several areas of automatic speech recognition. We also briefly discuss the requirements in designing successful real-world applications and address technical challenges that need to be faced in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Acero and R. Stern, “Environmental Robustness in Automatic Speech Recognition”, Proc. ICASSP-90, pp. 849–852, 1990.

    Google Scholar 

  2. B. S. Atal, “Efficient Coding of LPC Parameters by Temporal Decomposition,” Proc. ICASSP-83, Boston, pp. 81–84, 1983.

    Google Scholar 

  3. L. R. Bahl, F. Jelinek and R. L. Mercer, “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Analysis, Machine Intelligence, Vol. 5, pp. 179–190, 1983.

    Article  Google Scholar 

  4. L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition,” Proc. ICASSP-86, Tokyo, pp. 49–52, 1986.

    Google Scholar 

  5. L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Tree-Based Language Model for Natural Language Speech Recognition,” IEEE Trans. Acons., Speech, Signal Proc, Vol. 37, pp. 1001–1008, 1989.

    Article  Google Scholar 

  6. L. R. Bahl, J. R. Bellegarda, P. V. de Sousa, P. S. Gopalakrishnan, D. Nahamoo and M. A. Picheny, “Multonic Markov Word Models for Large Vocabulary Continuous Speech Recognition,” IEEE Trans. Speech and Audio Processing, Vol. 1, pp. 334–344, 1993.

    Article  Google Scholar 

  7. L. R. Bahl, S. V. de Gennaro, P. S. Gopalakrishnan and R. L. Mercer, “A Fast Approximate Acoustic Match for Large Vocabulary Speech Recognition,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 59–67, 1993.

    Article  Google Scholar 

  8. L. E. Baum, T. Petrie, G. Soules and N. Weiss, “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains,” Annal Math. Stat, Vol. 41, pp. 164–171, 1970.

    Article  MathSciNet  MATH  Google Scholar 

  9. J. R. Bellegarda and D. Nahamoo, “Tied Mixture Continuous Parameter Modeling for Speech Recognition,” IEEE Trans. Acoustics, Speech, Signal Proc, Vol. 38, pp. 2033–2045, 1990.

    Article  Google Scholar 

  10. J. R. Bellegarda, P. V. de Sousa, A. Nadas, D. Nahamoo, M. A. Picheny and L. R. Bahl, “The Metamorphic Algorithm: A Speaker Mapping Approach to Data Augmentation,” IEEE Trans. Speech and Audio Proc, Vol. 2, pp. 413–420, 1994.

    Article  Google Scholar 

  11. A. Biem, S. Katagiri and B.-H. Juang, “Discriminative Feature Extraction for Speech Recognition,” Proc. IEEE NN-SP Workshop, 1993.

    Google Scholar 

  12. H. Bourlard and C. J. Wellekens, “Links between Markov Models and Multi-Layer Perceptron,” IEEE Trans. Pattern Analysis, Machine Intelligence, Vol. 12, pp. 1167–1178, 1992.

    Article  Google Scholar 

  13. H. Bourlard and N. Morgan, Connectionist Speech Recognition — A Hybrid Approach, Kluwer Academic Publishers, 1994.

    Google Scholar 

  14. W. Chou, B.-H. Juang and C.-H. Lee, “Segmental GPD Training of HMM Based Speech Recognizer,” Proc. ICASSP-92, pp. I–473–476, 1992.

    Google Scholar 

  15. W. Chou, C.-H. Lee and B.-H. Juang, “Minimum Error Rate Training Based on the N-Best String Models,” Proc. ICASSP, pp. II-652–655, 1993.

    Google Scholar 

  16. S. J. Cox and J. S. Bridle, “Unsupervised Speaker Adaptation by Probabilistic Fitting,” Proc. ICASSP-89, Glasgow, pp. 294–297, 1989.

    Google Scholar 

  17. S. B. Davis and P. Mermelstein, “Comparison of Parametric Representations of Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 28, pp. 357–366, 1980.

    Article  Google Scholar 

  18. L. Deng, “A Stochastic Model of Speech Incorporating Hierarchical Non-stationality,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 471–475, 1993.

    Article  Google Scholar 

  19. L. Deng and D. Sun, “A Statistical Approach to Automatic Speech Recognition Using the Atomic Speech Units Constructed from Overlapping Articulator Features,” J. Acous. Soc Am., Vol. 95, pp. 2702–2719, 1994.

    Article  Google Scholar 

  20. V. V. Digalakis, J. R. Rohlicek and M. Ostendorf, “ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 431–442, 1993.

    Article  Google Scholar 

  21. V. V. Digalakis, D. Rtischev and L. G. Nuemeyer, “Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures,” IEEE Trans. Speech and Audio Proc, Vol. 3, pp. 357–366, 1995.

    Article  Google Scholar 

  22. J. L. Flanagan, Speech Analysis, Synthesis and Perception, 2nd edition, Springer-Verlag, 1972.

    Google Scholar 

  23. G. Fant, Speech Sounds and Features, MIT Press, 1973.

    Google Scholar 

  24. S. Furui, “Speaker-Independent Isolated Word Recognition Using Dynamic Features of Speech Spectrum,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 34, pp. 52–59, 1986.

    Article  Google Scholar 

  25. S. Furui, “Unsupervised Speaker Adaptation Method Based on Hierarchical Spectral Clustering,” Proc. ICASSP-89, Glasgow, pp. 286–289, 1989.

    Google Scholar 

  26. M. J. F. Gales and S. J. Young, “Parallel model combination for speech recognition in noise,” Technical Report, CUED/F-INFENG/TR135, 1993.

    Google Scholar 

  27. J.-L. Gauvain and C.-H. Lee, “Bayesian Learning for Hidden Markov Models With Gaussian Mixture State Observation Densities,” Speech Communication, Vol. 11, Nos. 2–3, pp. 205–214, 1992.

    Article  Google Scholar 

  28. J.-L. Gauvain and C.-H. Lee, “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE Trans. Speech and Audio Proc, Vol. 2, pp. 291–298, 1994.

    Article  Google Scholar 

  29. O. Ghitza, “Auditory Nerve Feedback as a Basis for Speech Processing,” Proc. ICASSP-88, pp. 91–94, 1988.

    Google Scholar 

  30. Y. Gong and J.-P. Haton, “Stochastic Trajectory Modeling for Speech Recognition,” Proc. ICASSP-94, pp. 57–60, 1994.

    Google Scholar 

  31. H. Hattori and S. Sagayama, “Vector Field Smoothing Principle for Speaker Adaptation,” Proc. ICSLP-92, Banff, pp. 381–384, 1992.

    Google Scholar 

  32. H.-W. Hon and K.-F. Lee, “Vocabulary Learning and Environmental Normalization in Vocabulary-Independent Speech Recognition”, Proc. ICASSP-92, pp. I–485–488, 1992.

    Google Scholar 

  33. X. Huang and M. A. Jack, “Semi-continuous hidden Markov models for speech signal,” Computer, Speech and Language, Vol. 3, pp. 239–251, 1989.

    Article  Google Scholar 

  34. M. Hwang and X. Huang, “Share-Distribution Hidden Markov Models for Speech Recognition,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 414–420, 1993.

    Article  Google Scholar 

  35. F. Jelinek and R. L. Mercer, “Interpolated Estimation of Markov Source Parameters from Sparse Data,” in Pattern Recognition in Practice, edited by E. S. Gelsema and L.N. Kanal, North-Holland, pp. 381–397, 1980.

    Google Scholar 

  36. F. Jelinek, “The Development of an Experimental Discrete Dictation Recognizer,” Proc. IEEE, Vol. 73, pp. 1616–1624, 1985.

    Article  Google Scholar 

  37. B.-H. Juang, “Maximum-Likelihood Estimation for Mixture Multivariate Stochastic Observations of Markov Chains,” AT&T Technical Journal, Vol. 64, 1985.

    Google Scholar 

  38. B.-H. Juang, “Speech Recognition in Adverse Conditions,” Computer, Speech and Language, Vol. 5, pp. 275–294, 1991.

    Article  Google Scholar 

  39. B.-H. Juang and S. Katagiri, “Discriminative Learning for Minimum Error Classification,” IEEE Trans. Signal Proc, Vol. 40, pp. 3043–3054, 1992.

    Article  MATH  Google Scholar 

  40. J.-C. Junqua, H. Wakita and H. Hermansky, “Evaluation and Optimization of Perceptually-Based ASR Front-End,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 39–48, 1993.

    Article  Google Scholar 

  41. S. Katagiri, C.-H. Lee and B.-H. Juang, “New Discriminative Training Algorithms Based on the Generalized Probabilistic Descent Method,” Proc IEEE NN-SP Workshop pp. 299–308, 1991.

    Google Scholar 

  42. P. Kenny, et al., “A*-Admissible Heuristics for Rapid Lexical Access,” IEEE Trans. Speech and Audio, Vol. 1, pp. 49–58, 1993.

    Article  Google Scholar 

  43. C.-H. Lee, F. K. Soong and B.-H. Juang, “A Segment Model Based Approach to Speech Recognition”, Proc ICASSP-88, pp. 501–504, 1988.

    Google Scholar 

  44. C.-H. Lee, L. R. Rabiner, R. Pieraccini and J. G. Wilpon, “Acoustic modeling for large vocabulary speech recognition,” Computer Speech and Language, Vol. 4, pp. 127–165, 1990.

    Article  Google Scholar 

  45. C.-H. Lee, C.-H. Lin and B.-H. Juang, “A Study on Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 39, pp. 806–814, 1991.

    Google Scholar 

  46. K.-F. Lee, Automatic Speech Recognition —The Development of the SPHINX-System, Kluwer Academic Publishers, Boston, 1989.

    Google Scholar 

  47. C. J. Leggetter and P. C. Woodland, “Speaker Adaptation of Continuous Density HMMs Using Linear Regression,” Proc ICSLP-94, 1994.

    Google Scholar 

  48. S. E. Levinson, “Structural Methods in Automatic Speech Recognition,” Proc IEEE, Vol. 73, pp. 1625–1650, 1985.

    Article  Google Scholar 

  49. A. Ljolje and M. D. Riley, “Optimal Speech Recognition Using Phone Recognition and Lexical Access,” Proc. ICSLP-92, pp. 313–316, 1992.

    Google Scholar 

  50. L. R. Liporace, “Maximum Likelihood Estimation for Multivariate Observations of Markov Sources,” IEEE Trans. Information Theory, Vol. 28, pp. 729–734, 1982.

    Article  MathSciNet  MATH  Google Scholar 

  51. F.-H. Liu, A. Acero and R. M. Stern, “Efficient Joint Compensation of Speech for the Effect of Additive Noise and Linear Filtering,” Proc. ICASSP-92, pp. I–257–260, 1992.

    Google Scholar 

  52. N. Merhav and C.-H. Lee, “A Minimax Classification Approach with Application to Robust Speech Recognition,” IEEE Trans. Speech and Audio, Vol. 1, pp. 90–100, 1993.

    Article  Google Scholar 

  53. H. Murveit, J. Butzberger, V. Digalakis and M. Weintraub, “Large-Vocabulary Dictation Using SRI’s DECIPHER Speech Recognition System: Progressive Search Techniques,” Proc. ICASSP, pp. 11–319–322, 1993.

    Google Scholar 

  54. H. Ney, “Dynamic Programming Parsing for Context-Free Grammar in Continuous Speech Recognition,” IEEE Trans. Signal Proc, Vol. 39, pp. 336–340, 1991.

    Article  MATH  Google Scholar 

  55. H. Ney, R. Haeb-Umbach, B.-H. Tran and M. Oerder, “Improvement in Beam Search for 10,000-Word Continuous Speech Recognition,” Proc. ICASSP-92, pp. I–9–12, 1992.

    Google Scholar 

  56. Y. Normandin and D. Morgera, “An Improved MMIE Training Algorithm for Speaker-Independent Small Vocabulary, Continuous Speech Recognition,” Proc. ICASSP-91, pp. 537–540, 1991.

    Google Scholar 

  57. M. Ostendorf and S. Roukos, “A Stochastic Segment Model for Phoneme-Based Continuous Speech Recognition,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 37, pp. 1857–1869, 1989.

    Article  Google Scholar 

  58. D. B. Paul, “Algorithm for an Optimal A* Search and Linearizing the Search in the Stack Decoder,” Proc. ICASSP-91, pp. 693–696, 1991.

    Google Scholar 

  59. S. Parthasarathy and C.-H. Coker, “On Automatic Estimation of Articulator Parameters in a Text-to-Speech System,” Computer, Speech and Language, Vol. 6, pp. 37–75, 1992.

    Article  Google Scholar 

  60. L. R. Rabiner, J. G. Wilpon and B.-H. Juang, “A Segmental K-Means Training Procedure for Connected Word Recognition,” AT&T Tech. Journal, Vol. 65, pp. 21–31, 1986.

    Google Scholar 

  61. L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, Vol. 77, pp. 257–286, 1989.

    Article  Google Scholar 

  62. L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.

    Google Scholar 

  63. M. Rahim and B.-H. Juang, “Signal Bias Removal for Robust Telephone Speech Recognition in Adverse Environments”, Proc. ICASSP-94, pp. 445–448, 1994.

    Google Scholar 

  64. M. Rahim, C.-H. Lee and B.-H. Juang, “Robust Utterance Verification for Connected Digit Recognition,” ICASSP-95, pp. 285–288, 1995.

    Google Scholar 

  65. M. Rahim and C.-H. Lee, “An Integrated ANN-HMM Speech Recognition System Based on Minimum Classification Error Training”, Proc. IEEE ASR Workshop, 1995.

    Google Scholar 

  66. M. D. Riley, “A Statistical Model for Generating Pronunciation Networks,” Proc. ICASSP-91, Vol. 2, pp. 737–740, 1991.

    Google Scholar 

  67. A. Robinson, “An Application of Recurrent Nets to Phone Probability Estimation,” IEEE Trans. Neural Networks, Vol. 5, pp. 298–305, 1994.

    Article  Google Scholar 

  68. J. R. Rohlicek, “Word Spotting”, in Modern Methods of Speech Processing, edited by R. Ramachandran and R. Mammone, Kluwer Academic Publishers, 1995.

    Google Scholar 

  69. R. C. Rose and E. M. Hofstetter, “Task-Independent Wordspotting Using Decision Tree Based Allophone Clustering,” Proc. ICASSP-93, pp. 11–467–470, 1993.

    Google Scholar 

  70. R. C. Rose, E. M. Hofstetter and D. A. Reynold, “Integrated Models of Speech and Background with Application to Speaker Identification in Noise,” IEEE Trans. Speech and Audio, Vol. 2, pp. 245–257, 1994.

    Article  Google Scholar 

  71. H. Sakoe and S. Chiba, “Dynamic Programming Optimization for Spoken Word Recognition,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 26, pp. 52–59, 1978.

    Google Scholar 

  72. A. Sankar and C.-H. Lee, “Stochastic Matching for Robust Speech Recognition,” IEEE Signal Processing Letter, pp. 124–125, Vol. 1, 1994.

    Article  Google Scholar 

  73. R. Schwartz, Y.-L. Chow and F. Kubala, “Rapid Speaker Adaptation Using a Probabilistic Spectral Mapping,” Proc. ICASSP, pp. 633–636, 1987.

    Google Scholar 

  74. R. Schwartz and Y.-L. Chow, “The JV-Best Algorithm: An Efficient and Exact Procedure for Finding the N Most Likely Sentence Hypotheses,” Proc. ICASSP-90, pp. 81–84, 1990.

    Google Scholar 

  75. R. Schwartz, S. Austin, F. Kubala, J. Makhoul, L. Nguyen and P. Placeway, “New Uses for the TV-Best Sentence Hypotheses within The BBN BYBLOS Continuous Speech Recognition System,” Proc. ICASSP-92, pp. I–1–4, 1992.

    Google Scholar 

  76. S. Seneff, “A Joint Synchrony/Mean-Rate Model of Auditory Speech Processing,” J. Phonetics, Vol. 16, pp. 55–76, 1988.

    Google Scholar 

  77. F. K. Soong and E. F. Huang, “A Tree-Trellis Based Fast Search for Finding the JV-Best Sentence Hypotheses in Continuous Speech Recognition,” Proc. ICASSP-91, pp. 703–706, 1991.

    Google Scholar 

  78. R. Sukkar, C.-H. Lee and B.-H. Juang, “A Vocabulary-Independent Discriminatively Trained Method for Rejection of Non-Keywords in Subword Based Speech Recognition”, Proc. EuroSpeech-95, Madrid, 1995.

    Google Scholar 

  79. J. Takami and S. Sagayama, “A Successive State Splitting Algorithm for Efficient Allophone Modeling,” Proc. ICASSP-92, pp. I–573–576, 1992.

    Google Scholar 

  80. A. P. Varga and R. K. Moore, “Hidden Markov Model Decomposition of Speech and Noise,” Proc. ICASSP-90, pp. 845–848, 1990.

    Google Scholar 

  81. J. G. Wilpon, L. R. Rabiner, C.-H. Lee, and E. R. Goldman, “Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models,” IEEE Trans. Acous., Speech, Signai Proc, Vol. 38, pp. 1870–1878, 1990.

    Article  Google Scholar 

  82. S. J. Young, J. J. Odell and P. C. Woodland, “Tree-Based State Tying for High Accuracy Acoustic Modeling,” Proc. ARPA Human Language Technology Workshop, Princeton, 1994.

    Google Scholar 

  83. G. Zavaliagkos, Y. Zhao, R. Schwartz and J. Makhoul, “A Hybrid Segmental Neural Net/Hidden Markov Model System for Continuous Speech Recognition,” IEEE Trans. Speech and Audio, Vol. 2, pp. 151–160, 1994.

    Article  Google Scholar 

  84. Y. Zhao, “A New Speaker Adaptation Technique Using Very Short Calibration Speech,” Proc. ICASSP-93, pp. 11–592–595, 1993.

    Google Scholar 

  85. V. Zue, J. Glass, M. Phillips and S. Seneff, “The MIT Summit Speech Recognition System: A Progress Report,” Proc. DARPA Speech and Natural Language Workshop, pp. 179–189, 1989.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Kluwer Academic Publishers

About this chapter

Cite this chapter

Rabiner, L.R., Juang, BH., Lee, CH. (1996). An Overview of Automatic Speech Recognition. In: Lee, CH., Soong, F.K., Paliwal, K.K. (eds) Automatic Speech and Speaker Recognition. The Kluwer International Series in Engineering and Computer Science, vol 355. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1367-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-1367-0_1

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4612-8590-8

  • Online ISBN: 978-1-4613-1367-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics