Abstract
A statistical generative model for the speech process is described that embeds a substantially richer structure than the HMM currently in predominant use for automatic speech recognition. This switching dynamic-system model generalizes and integrates the HMM and the piece-wise stationary nonlinear dynamic system (state- space) model. Depending on the level and the nature of the switching in the model design, various key properties of the speech dynamics can be naturally represented in the model. Such properties include the temporal structure of the speech acoustics, its causal articulatory movements, and the control of such movements by the multidimensional targets correlated with the phonological (symbolic) units of speech in terms of overlapping articulatory features.
One main challenge of using this multi-level switching dynamic-system model for successful speech recognition is the computationally intractable inference (decoding with confidence measure) on the posterior probabilities of the hidden states. This leads to computationally intractable optimal parameter learning (training) also. Several versions of BayesNets have been devised with detailed dependency implementation specified to represent the switching dynamic-system model of speech. We discuss the variational technique developed for general Bayesian networks as an efficient approximate algorithm for the decoding and learning problems. Some common operations of estimating phonological states’ switching times have been shared between the variational technique and the human auditory function that uses neural transient responses to detect temporal landmarks associated with phonological features. This suggests that the variation-style learning may be related to human speech perception under an encoding-decoding theory of speech communication, which highlights the critical roles of modeling articulatory dynamics for speech recognition and which forms a main motivation for the switching dynamic system model for speech articulation and acoustics described in this chapter.
The author wishes to thank many useful discussions with and suggestions for improving the paper presentation by David Heckerman, Mari Ostendorf, Ken Stevens, B. Frey, H. Attias, G. Ramsay, J. Ma, L. Lee, Sam Roweis, and J. Bilmes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
J. Allen. “HOW do humans process and recognize speech,” IEEE Trans. Speech Audio Proc., Vol. 2, 1994, pp. 567–577.
R. Bakis. “Coarticulation modeling with continuous-state HMMs,” Proc. IEEE Workshop Automatic Speech Recognition, Harriman, New York, 1991, pp. 20–21.
N. Bitar AND C. Espy-wilson. “Speech parameterization based on phonetic features: Application to speech recognition,” Proc. Eurospeech, Vol. 2, 1995, pp. 1411–1414.
C. Blackburn AND S. Young. “Towards improved speech recognition using a speech production model,” Proc. Eurospeech, Vol. 2, 1995, pp. 1623–1626.
H. Bourlard AND S. Dupont. “A new ASR approach based on independent processing and recombination of partial frequency bands,” Proc. ICSLP, 1996, pp. 426–429.
H. Bourlard, H. Hermansky, AND N. Morgan. “Towards increasing speech recognition error rates,” Speech Communication, Vol. 18, 1996, pp. 205–231.
C. Browman AND L. Goldstein. “Articulatory phonology: An overview,” Phoetica, Vol. 49, pp. 155–180, 1992.
N. Chomsky AND M. Halle. The Sound Pattern of English, New York: Harper and Row, 1968.
N. Clements. “The geometry of phonological features,” Phonology Yearbook, Vol. 2, 1985, pp. 225–252.
L. Deng. “A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal,” Signal Processing, Vol. 27, 1992, pp. 65–78.
L. Deng. “A computational model of the phonology-phonetics interface for automatic speech recognition,” Summary Report, SLS-LCS, Massachusetts Institute of Technology, 1992–1993.
L. Deng. “Design of a feature-based speech recognizer aiming at integration of auditory processing, signal modeling, and phonological structure of speech.” J. Acoust. Soc. Am., Vol. 93, 1993, pp. 2318.
L. Deng. “Computational models for speech production,” in Computational Models of Speech Pattern Processing (NATO ASI), Springer-Verlag, 1999, pp. 67–77.
L. DEng. “A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition,” Speech Communication, Vol. 24, No. 4, 1998, pp. 299–323.
L. Deng, M. Aksmanovic, D. Sun, AND J. Wu. “Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states,” IEEE Trans. Speech Audio Proc, Vol. 2, 1994, pp. 507–520.
L. Deng, AND K. Erler. “Structural design of a hidden Markov model based speech recognizer using multi-valued phonetic features: Comparison with segmentai speech units,” J. Acoust. Soc. Am., Vol. 92, 1992, pp. 3058–3067.
L. Deng AND Z. Ma. “Spontaneous speech recognition using a statistical coarticulatory model for the hidden vocal-tract-resonance dynamics,” J. Acoust. Soc. Am., Vol. 108, No. 6, 2000, pp. 3036–3048.
L. Deng, G. Ramsay, AND D. Sun. “Production models as a structural basis for automatic speech recognition,” Speech Communication, Vol. 22, No. 2, 1997, pp. 93–111.
L. Deng AND H. Sameti. “Transitional speech units and their representation by the regressive Markov states: Applications to speech recognition,” IEEE Trans. Speech Audio Proc, Vol. 4, No. 4, July 1996, pp. 301–306.
L. Deng AND X. Shen. “Maximum likelihood in statistical estimation of dynamic systems: Decomposition algorithm and simulation results”, Signal Processing, Vol. 57, 1997, pp. 65–79.
L. Deng AND D. Sun. “A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features, ” J. Acoust. Soc. Am., Vol. 95, 1994, pp. 2702–2719.
J. Frankel AND S. King. “ASR — Articulatory speech recognition”, Proc. Eurospeech, Vol. 1, 2001, pp. 599–602.
Y. Gao, R. Bakis, J. Huang, AND B. Zhang, “Multistage coarticulation model combining articulatory, formant and cepstral features”, Proc. ICSLP, Vol. 1, 2000, pp. 25–28.
Z. Ghahramani AND S. Roweis. “Learning nonlinear dynamic systems using an EM algorithm”. Advances in Neural Information Processing Systems, Vol. 11, 1999, 1–7.
Z. Ghahramani AND G. Hinton. “Variational learning for switching state-space model”. Neural Computation, Vol. 12, 2000, pp. 831–864.
W. Holmes. “Segmental HMMs: Modeling dynamics and underlying structure in speech,” in M. Ostendorf and S. Khudanpur (eds.) Mathematical Foundations of Speech Recognition and Processing, Volume X in IMA Volumes in Mathematics and Its Applications, Springer-Verlag, New York, 2002.
M. Jordan, Z. Ghahramani, T. Jaakkola, AND L. Saul. “In introduction to variational methods for graphical models,” in Learning in Graphical Models M. Jordon (ed.), The MIT Press, Cambridge, MA, 1999.
F. Juang AND S. Furui (eds.), Proc. of the IEEE (special issue), Vol. 88, 2000.
R. Kent, G. Adams, AND G. Turner. “Models of speech production,” in Principles of Experimental Phonetics, N. Lass (ed.), Mosby: London, 1995, pp. 3–45.
C.-H. Lee, F. Soong, AND K. Paliwal (eds.) Automatic Speech and Speaker Recognition — Advanced Topics, Kluwer Academic, 1996.
A. Liberman AND I. Mattingly. “The motor theory of speech perception revised” Cognition, Vol. 21, 1985, pp. 1–36.
R. Lippman. “Speech recognition by human and machines,” Speech Communication, Vol. 22, 1997, pp. 1–15.
Z. Ma AND L. Deng. “A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech,” Computer Speech and Language, Vol. 14, 2000, pp. 101–104.
P. Macneilage. “Motor control of serial ordering in speech,” Psychological Review, Vol. 77, 1970, pp. 182–196.
R. Mcgowan. “Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests,” Speech Communication, Vol. 14, 1994, pp. 19–48.
R. Mcgowan AND A. Faber. “Speech production parameters for automatic speech recognition,” J. Acoust. Soc. Am., Vol. 101, 1997, pp. 28.
H. Nock. Techniques for Modeling Phonological Processes in Automatic Speech Recognition, Ph.D. thesis, Cambridge University, 2001, Cambridge, U.K.
M. Ostendorf, V. Digalakis, AND J. Rohlicek. “From HMMs to segment models: A unified view of stochastic modeling for speech recognition” IEEE Trans. Speech Audio Proc, Vol. 4, 1996, pp. 360–378.
V. Pavlovic, B. Frey, AND T. Huang. “Variational learning in mixed-state dynamic graphical models,” Proc. Annual Conf. in Uncertainty in Artificial Intelligence, 1999, UAI–99.
J. Perkell, M. Matthies, M. Svirsky, AND M. Jordan. “Goal-based speech motor control: a theoretical framework and some preliminary data,” J. Phonetics, Vol. 23, 1995, pp. 23–35.
J. Perkell. “Properties of the tongue help to define vowel categories: hypotheses based on physiologically-oriented modeling,” J. Phonetics Vol. 24, 1996, pp. 3–22.
P. Perrier, D. Ostry, AND R. Laboissière. “The equilibrium point hypothesis and its application to speech motor control,” J. Speech & Hearing Research, Vol. 39, 1996, pp. 365–378.
L. Pols. “Flexible human speech recognition,” Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, 1997, pp. 273–283.
M. Randolph. “Speech analysis based on articulatory behavior,” J. Acoust. Soc. Am., Vol. 95, 1994, pp. 195.
H. Richards, AND J. Bridle. “The HDM: A segmental hidden dynamic model of coarticulation”, Proc. ICASSP, Vol. 1, 1999, pp. 357–360.
R. Rose, J. Schroeter, AND M. Sondhi. “The potential role of speech production models in automatic speech recognition,” J. Acoust. Soc. Am., Vol. 99, 1996, pp. 1699–1709.
M. Russell. “Progress towards speech models that model speech,” Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, 1997, pp. 115–123.
J. Schroeter AND M. Sondhi. “Techniques for estimating vocal-tract shapes from the speech signal,” IEEE Trans. Speech Audio Proc, Vol. 2, 1994, pp. 133–150.
H. Sheikhzadeh AND L. Deng. “Speech analysis and recognition using interval statistics generated from a composite auditory model,” IEEE Trans. Speech Audio Proc, Vol. 6, 1998, pp. 50–54.
H. Sheikhzadeh AND L. Deng. “A layered neural network interfaced with a cochlear model for the study of speech encoding in the auditory system,” Computer Speech and Language, Vol. 13, 1999, pp. 39–64.
R. Shumway AND D. Stoffer. “An approach to time series smoothing and forecasting using the EM algorithm,” J. Time Series Analysis, Vol. 3, 1982, pp. 253–264.
R. Shumway AND D. Stoffer. “Dynamic linear models with switching”, J. American Statistical Association, Vol. 86, 1991, pp. 763–769.
K. Stevens. “On the quantal nature of speech,” J. Phonetics, Vol. 17, 1989, pp. 3–45.
K. Stevens. “Prom acoustic cues to segments, features and words,” Proc. ICSLP, Vol. 1, 2000, pp. A1–A8.
K. Stevens. A coustic Phonetics, The MIT Press, Cambridge, MA, 1998.
J. Sun, L. Deng, AND X. Jing. “Data-driven model construction for continuous speech recognition using overlapping articulatory features,” Proc. ICSLP, Vol. 1, 2000, pp. 437–440.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer Science+Business Media New York
About this paper
Cite this paper
Deng, L. (2004). Switching Dynamic System Models for Speech Articulation and Acoustics. In: Johnson, M., Khudanpur, S.P., Ostendorf, M., Rosenfeld, R. (eds) Mathematical Foundations of Speech and Language Processing. The IMA Volumes in Mathematics and its Applications, vol 138. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9017-4_6
Download citation
DOI: https://doi.org/10.1007/978-1-4419-9017-4_6
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4612-6484-2
Online ISBN: 978-1-4419-9017-4
eBook Packages: Springer Book Archive