Abstract
For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this chapter we review some of the key advances in several areas of automatic speech recognition. We also briefly discuss the requirements in designing successful real-world applications and address technical challenges that need to be faced in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Acero and R. Stern, “Environmental Robustness in Automatic Speech Recognition”, Proc. ICASSP-90, pp. 849–852, 1990.
B. S. Atal, “Efficient Coding of LPC Parameters by Temporal Decomposition,” Proc. ICASSP-83, Boston, pp. 81–84, 1983.
L. R. Bahl, F. Jelinek and R. L. Mercer, “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Analysis, Machine Intelligence, Vol. 5, pp. 179–190, 1983.
L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition,” Proc. ICASSP-86, Tokyo, pp. 49–52, 1986.
L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Tree-Based Language Model for Natural Language Speech Recognition,” IEEE Trans. Acons., Speech, Signal Proc, Vol. 37, pp. 1001–1008, 1989.
L. R. Bahl, J. R. Bellegarda, P. V. de Sousa, P. S. Gopalakrishnan, D. Nahamoo and M. A. Picheny, “Multonic Markov Word Models for Large Vocabulary Continuous Speech Recognition,” IEEE Trans. Speech and Audio Processing, Vol. 1, pp. 334–344, 1993.
L. R. Bahl, S. V. de Gennaro, P. S. Gopalakrishnan and R. L. Mercer, “A Fast Approximate Acoustic Match for Large Vocabulary Speech Recognition,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 59–67, 1993.
L. E. Baum, T. Petrie, G. Soules and N. Weiss, “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains,” Annal Math. Stat, Vol. 41, pp. 164–171, 1970.
J. R. Bellegarda and D. Nahamoo, “Tied Mixture Continuous Parameter Modeling for Speech Recognition,” IEEE Trans. Acoustics, Speech, Signal Proc, Vol. 38, pp. 2033–2045, 1990.
J. R. Bellegarda, P. V. de Sousa, A. Nadas, D. Nahamoo, M. A. Picheny and L. R. Bahl, “The Metamorphic Algorithm: A Speaker Mapping Approach to Data Augmentation,” IEEE Trans. Speech and Audio Proc, Vol. 2, pp. 413–420, 1994.
A. Biem, S. Katagiri and B.-H. Juang, “Discriminative Feature Extraction for Speech Recognition,” Proc. IEEE NN-SP Workshop, 1993.
H. Bourlard and C. J. Wellekens, “Links between Markov Models and Multi-Layer Perceptron,” IEEE Trans. Pattern Analysis, Machine Intelligence, Vol. 12, pp. 1167–1178, 1992.
H. Bourlard and N. Morgan, Connectionist Speech Recognition — A Hybrid Approach, Kluwer Academic Publishers, 1994.
W. Chou, B.-H. Juang and C.-H. Lee, “Segmental GPD Training of HMM Based Speech Recognizer,” Proc. ICASSP-92, pp. I–473–476, 1992.
W. Chou, C.-H. Lee and B.-H. Juang, “Minimum Error Rate Training Based on the N-Best String Models,” Proc. ICASSP, pp. II-652–655, 1993.
S. J. Cox and J. S. Bridle, “Unsupervised Speaker Adaptation by Probabilistic Fitting,” Proc. ICASSP-89, Glasgow, pp. 294–297, 1989.
S. B. Davis and P. Mermelstein, “Comparison of Parametric Representations of Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 28, pp. 357–366, 1980.
L. Deng, “A Stochastic Model of Speech Incorporating Hierarchical Non-stationality,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 471–475, 1993.
L. Deng and D. Sun, “A Statistical Approach to Automatic Speech Recognition Using the Atomic Speech Units Constructed from Overlapping Articulator Features,” J. Acous. Soc Am., Vol. 95, pp. 2702–2719, 1994.
V. V. Digalakis, J. R. Rohlicek and M. Ostendorf, “ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 431–442, 1993.
V. V. Digalakis, D. Rtischev and L. G. Nuemeyer, “Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures,” IEEE Trans. Speech and Audio Proc, Vol. 3, pp. 357–366, 1995.
J. L. Flanagan, Speech Analysis, Synthesis and Perception, 2nd edition, Springer-Verlag, 1972.
G. Fant, Speech Sounds and Features, MIT Press, 1973.
S. Furui, “Speaker-Independent Isolated Word Recognition Using Dynamic Features of Speech Spectrum,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 34, pp. 52–59, 1986.
S. Furui, “Unsupervised Speaker Adaptation Method Based on Hierarchical Spectral Clustering,” Proc. ICASSP-89, Glasgow, pp. 286–289, 1989.
M. J. F. Gales and S. J. Young, “Parallel model combination for speech recognition in noise,” Technical Report, CUED/F-INFENG/TR135, 1993.
J.-L. Gauvain and C.-H. Lee, “Bayesian Learning for Hidden Markov Models With Gaussian Mixture State Observation Densities,” Speech Communication, Vol. 11, Nos. 2–3, pp. 205–214, 1992.
J.-L. Gauvain and C.-H. Lee, “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE Trans. Speech and Audio Proc, Vol. 2, pp. 291–298, 1994.
O. Ghitza, “Auditory Nerve Feedback as a Basis for Speech Processing,” Proc. ICASSP-88, pp. 91–94, 1988.
Y. Gong and J.-P. Haton, “Stochastic Trajectory Modeling for Speech Recognition,” Proc. ICASSP-94, pp. 57–60, 1994.
H. Hattori and S. Sagayama, “Vector Field Smoothing Principle for Speaker Adaptation,” Proc. ICSLP-92, Banff, pp. 381–384, 1992.
H.-W. Hon and K.-F. Lee, “Vocabulary Learning and Environmental Normalization in Vocabulary-Independent Speech Recognition”, Proc. ICASSP-92, pp. I–485–488, 1992.
X. Huang and M. A. Jack, “Semi-continuous hidden Markov models for speech signal,” Computer, Speech and Language, Vol. 3, pp. 239–251, 1989.
M. Hwang and X. Huang, “Share-Distribution Hidden Markov Models for Speech Recognition,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 414–420, 1993.
F. Jelinek and R. L. Mercer, “Interpolated Estimation of Markov Source Parameters from Sparse Data,” in Pattern Recognition in Practice, edited by E. S. Gelsema and L.N. Kanal, North-Holland, pp. 381–397, 1980.
F. Jelinek, “The Development of an Experimental Discrete Dictation Recognizer,” Proc. IEEE, Vol. 73, pp. 1616–1624, 1985.
B.-H. Juang, “Maximum-Likelihood Estimation for Mixture Multivariate Stochastic Observations of Markov Chains,” AT&T Technical Journal, Vol. 64, 1985.
B.-H. Juang, “Speech Recognition in Adverse Conditions,” Computer, Speech and Language, Vol. 5, pp. 275–294, 1991.
B.-H. Juang and S. Katagiri, “Discriminative Learning for Minimum Error Classification,” IEEE Trans. Signal Proc, Vol. 40, pp. 3043–3054, 1992.
J.-C. Junqua, H. Wakita and H. Hermansky, “Evaluation and Optimization of Perceptually-Based ASR Front-End,” IEEE Trans. Speech and Audio Proc, Vol. 1, pp. 39–48, 1993.
S. Katagiri, C.-H. Lee and B.-H. Juang, “New Discriminative Training Algorithms Based on the Generalized Probabilistic Descent Method,” Proc IEEE NN-SP Workshop pp. 299–308, 1991.
P. Kenny, et al., “A*-Admissible Heuristics for Rapid Lexical Access,” IEEE Trans. Speech and Audio, Vol. 1, pp. 49–58, 1993.
C.-H. Lee, F. K. Soong and B.-H. Juang, “A Segment Model Based Approach to Speech Recognition”, Proc ICASSP-88, pp. 501–504, 1988.
C.-H. Lee, L. R. Rabiner, R. Pieraccini and J. G. Wilpon, “Acoustic modeling for large vocabulary speech recognition,” Computer Speech and Language, Vol. 4, pp. 127–165, 1990.
C.-H. Lee, C.-H. Lin and B.-H. Juang, “A Study on Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 39, pp. 806–814, 1991.
K.-F. Lee, Automatic Speech Recognition —The Development of the SPHINX-System, Kluwer Academic Publishers, Boston, 1989.
C. J. Leggetter and P. C. Woodland, “Speaker Adaptation of Continuous Density HMMs Using Linear Regression,” Proc ICSLP-94, 1994.
S. E. Levinson, “Structural Methods in Automatic Speech Recognition,” Proc IEEE, Vol. 73, pp. 1625–1650, 1985.
A. Ljolje and M. D. Riley, “Optimal Speech Recognition Using Phone Recognition and Lexical Access,” Proc. ICSLP-92, pp. 313–316, 1992.
L. R. Liporace, “Maximum Likelihood Estimation for Multivariate Observations of Markov Sources,” IEEE Trans. Information Theory, Vol. 28, pp. 729–734, 1982.
F.-H. Liu, A. Acero and R. M. Stern, “Efficient Joint Compensation of Speech for the Effect of Additive Noise and Linear Filtering,” Proc. ICASSP-92, pp. I–257–260, 1992.
N. Merhav and C.-H. Lee, “A Minimax Classification Approach with Application to Robust Speech Recognition,” IEEE Trans. Speech and Audio, Vol. 1, pp. 90–100, 1993.
H. Murveit, J. Butzberger, V. Digalakis and M. Weintraub, “Large-Vocabulary Dictation Using SRI’s DECIPHER Speech Recognition System: Progressive Search Techniques,” Proc. ICASSP, pp. 11–319–322, 1993.
H. Ney, “Dynamic Programming Parsing for Context-Free Grammar in Continuous Speech Recognition,” IEEE Trans. Signal Proc, Vol. 39, pp. 336–340, 1991.
H. Ney, R. Haeb-Umbach, B.-H. Tran and M. Oerder, “Improvement in Beam Search for 10,000-Word Continuous Speech Recognition,” Proc. ICASSP-92, pp. I–9–12, 1992.
Y. Normandin and D. Morgera, “An Improved MMIE Training Algorithm for Speaker-Independent Small Vocabulary, Continuous Speech Recognition,” Proc. ICASSP-91, pp. 537–540, 1991.
M. Ostendorf and S. Roukos, “A Stochastic Segment Model for Phoneme-Based Continuous Speech Recognition,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 37, pp. 1857–1869, 1989.
D. B. Paul, “Algorithm for an Optimal A* Search and Linearizing the Search in the Stack Decoder,” Proc. ICASSP-91, pp. 693–696, 1991.
S. Parthasarathy and C.-H. Coker, “On Automatic Estimation of Articulator Parameters in a Text-to-Speech System,” Computer, Speech and Language, Vol. 6, pp. 37–75, 1992.
L. R. Rabiner, J. G. Wilpon and B.-H. Juang, “A Segmental K-Means Training Procedure for Connected Word Recognition,” AT&T Tech. Journal, Vol. 65, pp. 21–31, 1986.
L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, Vol. 77, pp. 257–286, 1989.
L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.
M. Rahim and B.-H. Juang, “Signal Bias Removal for Robust Telephone Speech Recognition in Adverse Environments”, Proc. ICASSP-94, pp. 445–448, 1994.
M. Rahim, C.-H. Lee and B.-H. Juang, “Robust Utterance Verification for Connected Digit Recognition,” ICASSP-95, pp. 285–288, 1995.
M. Rahim and C.-H. Lee, “An Integrated ANN-HMM Speech Recognition System Based on Minimum Classification Error Training”, Proc. IEEE ASR Workshop, 1995.
M. D. Riley, “A Statistical Model for Generating Pronunciation Networks,” Proc. ICASSP-91, Vol. 2, pp. 737–740, 1991.
A. Robinson, “An Application of Recurrent Nets to Phone Probability Estimation,” IEEE Trans. Neural Networks, Vol. 5, pp. 298–305, 1994.
J. R. Rohlicek, “Word Spotting”, in Modern Methods of Speech Processing, edited by R. Ramachandran and R. Mammone, Kluwer Academic Publishers, 1995.
R. C. Rose and E. M. Hofstetter, “Task-Independent Wordspotting Using Decision Tree Based Allophone Clustering,” Proc. ICASSP-93, pp. 11–467–470, 1993.
R. C. Rose, E. M. Hofstetter and D. A. Reynold, “Integrated Models of Speech and Background with Application to Speaker Identification in Noise,” IEEE Trans. Speech and Audio, Vol. 2, pp. 245–257, 1994.
H. Sakoe and S. Chiba, “Dynamic Programming Optimization for Spoken Word Recognition,” IEEE Trans. Acous., Speech, Signal Proc, Vol. 26, pp. 52–59, 1978.
A. Sankar and C.-H. Lee, “Stochastic Matching for Robust Speech Recognition,” IEEE Signal Processing Letter, pp. 124–125, Vol. 1, 1994.
R. Schwartz, Y.-L. Chow and F. Kubala, “Rapid Speaker Adaptation Using a Probabilistic Spectral Mapping,” Proc. ICASSP, pp. 633–636, 1987.
R. Schwartz and Y.-L. Chow, “The JV-Best Algorithm: An Efficient and Exact Procedure for Finding the N Most Likely Sentence Hypotheses,” Proc. ICASSP-90, pp. 81–84, 1990.
R. Schwartz, S. Austin, F. Kubala, J. Makhoul, L. Nguyen and P. Placeway, “New Uses for the TV-Best Sentence Hypotheses within The BBN BYBLOS Continuous Speech Recognition System,” Proc. ICASSP-92, pp. I–1–4, 1992.
S. Seneff, “A Joint Synchrony/Mean-Rate Model of Auditory Speech Processing,” J. Phonetics, Vol. 16, pp. 55–76, 1988.
F. K. Soong and E. F. Huang, “A Tree-Trellis Based Fast Search for Finding the JV-Best Sentence Hypotheses in Continuous Speech Recognition,” Proc. ICASSP-91, pp. 703–706, 1991.
R. Sukkar, C.-H. Lee and B.-H. Juang, “A Vocabulary-Independent Discriminatively Trained Method for Rejection of Non-Keywords in Subword Based Speech Recognition”, Proc. EuroSpeech-95, Madrid, 1995.
J. Takami and S. Sagayama, “A Successive State Splitting Algorithm for Efficient Allophone Modeling,” Proc. ICASSP-92, pp. I–573–576, 1992.
A. P. Varga and R. K. Moore, “Hidden Markov Model Decomposition of Speech and Noise,” Proc. ICASSP-90, pp. 845–848, 1990.
J. G. Wilpon, L. R. Rabiner, C.-H. Lee, and E. R. Goldman, “Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models,” IEEE Trans. Acous., Speech, Signai Proc, Vol. 38, pp. 1870–1878, 1990.
S. J. Young, J. J. Odell and P. C. Woodland, “Tree-Based State Tying for High Accuracy Acoustic Modeling,” Proc. ARPA Human Language Technology Workshop, Princeton, 1994.
G. Zavaliagkos, Y. Zhao, R. Schwartz and J. Makhoul, “A Hybrid Segmental Neural Net/Hidden Markov Model System for Continuous Speech Recognition,” IEEE Trans. Speech and Audio, Vol. 2, pp. 151–160, 1994.
Y. Zhao, “A New Speaker Adaptation Technique Using Very Short Calibration Speech,” Proc. ICASSP-93, pp. 11–592–595, 1993.
V. Zue, J. Glass, M. Phillips and S. Seneff, “The MIT Summit Speech Recognition System: A Progress Report,” Proc. DARPA Speech and Natural Language Workshop, pp. 179–189, 1989.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1996 Kluwer Academic Publishers
About this chapter
Cite this chapter
Rabiner, L.R., Juang, BH., Lee, CH. (1996). An Overview of Automatic Speech Recognition. In: Lee, CH., Soong, F.K., Paliwal, K.K. (eds) Automatic Speech and Speaker Recognition. The Kluwer International Series in Engineering and Computer Science, vol 355. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1367-0_1
Download citation
DOI: https://doi.org/10.1007/978-1-4613-1367-0_1
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4612-8590-8
Online ISBN: 978-1-4613-1367-0
eBook Packages: Springer Book Archive