Switching Dynamic System Models for Speech Articulation and Acoustics

Deng, Li

doi:10.1007/978-1-4419-9017-4_6

Li Deng⁶

Part of the book series: The IMA Volumes in Mathematics and its Applications ((IMA,volume 138))

728 Accesses
19 Citations
1 Altmetric

Abstract

A statistical generative model for the speech process is described that embeds a substantially richer structure than the HMM currently in predominant use for automatic speech recognition. This switching dynamic-system model generalizes and integrates the HMM and the piece-wise stationary nonlinear dynamic system (state- space) model. Depending on the level and the nature of the switching in the model design, various key properties of the speech dynamics can be naturally represented in the model. Such properties include the temporal structure of the speech acoustics, its causal articulatory movements, and the control of such movements by the multidimensional targets correlated with the phonological (symbolic) units of speech in terms of overlapping articulatory features.

One main challenge of using this multi-level switching dynamic-system model for successful speech recognition is the computationally intractable inference (decoding with confidence measure) on the posterior probabilities of the hidden states. This leads to computationally intractable optimal parameter learning (training) also. Several versions of BayesNets have been devised with detailed dependency implementation specified to represent the switching dynamic-system model of speech. We discuss the variational technique developed for general Bayesian networks as an efficient approximate algorithm for the decoding and learning problems. Some common operations of estimating phonological states’ switching times have been shared between the variational technique and the human auditory function that uses neural transient responses to detect temporal landmarks associated with phonological features. This suggests that the variation-style learning may be related to human speech perception under an encoding-decoding theory of speech communication, which highlights the critical roles of modeling articulatory dynamics for speech recognition and which forms a main motivation for the switching dynamic system model for speech articulation and acoustics described in this chapter.

The author wishes to thank many useful discussions with and suggestions for improving the paper presentation by David Heckerman, Mari Ostendorf, Ken Stevens, B. Frey, H. Attias, G. Ramsay, J. Ma, L. Lee, Sam Roweis, and J. Bilmes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. Allen. “HOW do humans process and recognize speech,” IEEE Trans. Speech Audio Proc., Vol. 2, 1994, pp. 567–577.
Article Google Scholar
R. Bakis. “Coarticulation modeling with continuous-state HMMs,” Proc. IEEE Workshop Automatic Speech Recognition, Harriman, New York, 1991, pp. 20–21.
Google Scholar
N. Bitar AND C. Espy-wilson. “Speech parameterization based on phonetic features: Application to speech recognition,” Proc. Eurospeech, Vol. 2, 1995, pp. 1411–1414.
Google Scholar
C. Blackburn AND S. Young. “Towards improved speech recognition using a speech production model,” Proc. Eurospeech, Vol. 2, 1995, pp. 1623–1626.
Google Scholar
H. Bourlard AND S. Dupont. “A new ASR approach based on independent processing and recombination of partial frequency bands,” Proc. ICSLP, 1996, pp. 426–429.
Google Scholar
H. Bourlard, H. Hermansky, AND N. Morgan. “Towards increasing speech recognition error rates,” Speech Communication, Vol. 18, 1996, pp. 205–231.
Article Google Scholar
C. Browman AND L. Goldstein. “Articulatory phonology: An overview,” Phoetica, Vol. 49, pp. 155–180, 1992.
Article Google Scholar
N. Chomsky AND M. Halle. The Sound Pattern of English, New York: Harper and Row, 1968.
Google Scholar
N. Clements. “The geometry of phonological features,” Phonology Yearbook, Vol. 2, 1985, pp. 225–252.
Article Google Scholar
L. Deng. “A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal,” Signal Processing, Vol. 27, 1992, pp. 65–78.
Article MATH Google Scholar
L. Deng. “A computational model of the phonology-phonetics interface for automatic speech recognition,” Summary Report, SLS-LCS, Massachusetts Institute of Technology, 1992–1993.
Google Scholar
L. Deng. “Design of a feature-based speech recognizer aiming at integration of auditory processing, signal modeling, and phonological structure of speech.” J. Acoust. Soc. Am., Vol. 93, 1993, pp. 2318.
Article Google Scholar
L. Deng. “Computational models for speech production,” in Computational Models of Speech Pattern Processing (NATO ASI), Springer-Verlag, 1999, pp. 67–77.
Google Scholar
L. DEng. “A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition,” Speech Communication, Vol. 24, No. 4, 1998, pp. 299–323.
Article Google Scholar
L. Deng, M. Aksmanovic, D. Sun, AND J. Wu. “Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states,” IEEE Trans. Speech Audio Proc, Vol. 2, 1994, pp. 507–520.
Article Google Scholar
L. Deng, AND K. Erler. “Structural design of a hidden Markov model based speech recognizer using multi-valued phonetic features: Comparison with segmentai speech units,” J. Acoust. Soc. Am., Vol. 92, 1992, pp. 3058–3067.
Article Google Scholar
L. Deng AND Z. Ma. “Spontaneous speech recognition using a statistical coarticulatory model for the hidden vocal-tract-resonance dynamics,” J. Acoust. Soc. Am., Vol. 108, No. 6, 2000, pp. 3036–3048.
Article Google Scholar
L. Deng, G. Ramsay, AND D. Sun. “Production models as a structural basis for automatic speech recognition,” Speech Communication, Vol. 22, No. 2, 1997, pp. 93–111.
Article Google Scholar
L. Deng AND H. Sameti. “Transitional speech units and their representation by the regressive Markov states: Applications to speech recognition,” IEEE Trans. Speech Audio Proc, Vol. 4, No. 4, July 1996, pp. 301–306.
Article Google Scholar
L. Deng AND X. Shen. “Maximum likelihood in statistical estimation of dynamic systems: Decomposition algorithm and simulation results”, Signal Processing, Vol. 57, 1997, pp. 65–79.
Article MATH Google Scholar
L. Deng AND D. Sun. “A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features, ” J. Acoust. Soc. Am., Vol. 95, 1994, pp. 2702–2719.
Article Google Scholar
J. Frankel AND S. King. “ASR — Articulatory speech recognition”, Proc. Eurospeech, Vol. 1, 2001, pp. 599–602.
Google Scholar
Y. Gao, R. Bakis, J. Huang, AND B. Zhang, “Multistage coarticulation model combining articulatory, formant and cepstral features”, Proc. ICSLP, Vol. 1, 2000, pp. 25–28.
Google Scholar
Z. Ghahramani AND S. Roweis. “Learning nonlinear dynamic systems using an EM algorithm”. Advances in Neural Information Processing Systems, Vol. 11, 1999, 1–7.
Google Scholar
Z. Ghahramani AND G. Hinton. “Variational learning for switching state-space model”. Neural Computation, Vol. 12, 2000, pp. 831–864.
Article Google Scholar
W. Holmes. “Segmental HMMs: Modeling dynamics and underlying structure in speech,” in M. Ostendorf and S. Khudanpur (eds.) Mathematical Foundations of Speech Recognition and Processing, Volume X in IMA Volumes in Mathematics and Its Applications, Springer-Verlag, New York, 2002.
Google Scholar
M. Jordan, Z. Ghahramani, T. Jaakkola, AND L. Saul. “In introduction to variational methods for graphical models,” in Learning in Graphical Models M. Jordon (ed.), The MIT Press, Cambridge, MA, 1999.
Google Scholar
F. Juang AND S. Furui (eds.), Proc. of the IEEE (special issue), Vol. 88, 2000.
Google Scholar
R. Kent, G. Adams, AND G. Turner. “Models of speech production,” in Principles of Experimental Phonetics, N. Lass (ed.), Mosby: London, 1995, pp. 3–45.
Google Scholar
C.-H. Lee, F. Soong, AND K. Paliwal (eds.) Automatic Speech and Speaker Recognition — Advanced Topics, Kluwer Academic, 1996.
Google Scholar
A. Liberman AND I. Mattingly. “The motor theory of speech perception revised” Cognition, Vol. 21, 1985, pp. 1–36.
Article Google Scholar
R. Lippman. “Speech recognition by human and machines,” Speech Communication, Vol. 22, 1997, pp. 1–15.
Article Google Scholar
Z. Ma AND L. Deng. “A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech,” Computer Speech and Language, Vol. 14, 2000, pp. 101–104.
Article Google Scholar
P. Macneilage. “Motor control of serial ordering in speech,” Psychological Review, Vol. 77, 1970, pp. 182–196.
Article Google Scholar
R. Mcgowan. “Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests,” Speech Communication, Vol. 14, 1994, pp. 19–48.
Article Google Scholar
R. Mcgowan AND A. Faber. “Speech production parameters for automatic speech recognition,” J. Acoust. Soc. Am., Vol. 101, 1997, pp. 28.
Article Google Scholar
H. Nock. Techniques for Modeling Phonological Processes in Automatic Speech Recognition, Ph.D. thesis, Cambridge University, 2001, Cambridge, U.K.
Google Scholar
M. Ostendorf, V. Digalakis, AND J. Rohlicek. “From HMMs to segment models: A unified view of stochastic modeling for speech recognition” IEEE Trans. Speech Audio Proc, Vol. 4, 1996, pp. 360–378.
Article Google Scholar
V. Pavlovic, B. Frey, AND T. Huang. “Variational learning in mixed-state dynamic graphical models,” Proc. Annual Conf. in Uncertainty in Artificial Intelligence, 1999, UAI–99.
Google Scholar
J. Perkell, M. Matthies, M. Svirsky, AND M. Jordan. “Goal-based speech motor control: a theoretical framework and some preliminary data,” J. Phonetics, Vol. 23, 1995, pp. 23–35.
Article Google Scholar
J. Perkell. “Properties of the tongue help to define vowel categories: hypotheses based on physiologically-oriented modeling,” J. Phonetics Vol. 24, 1996, pp. 3–22.
Article Google Scholar
P. Perrier, D. Ostry, AND R. Laboissière. “The equilibrium point hypothesis and its application to speech motor control,” J. Speech & Hearing Research, Vol. 39, 1996, pp. 365–378.
Google Scholar
L. Pols. “Flexible human speech recognition,” Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, 1997, pp. 273–283.
Google Scholar
M. Randolph. “Speech analysis based on articulatory behavior,” J. Acoust. Soc. Am., Vol. 95, 1994, pp. 195.
Article Google Scholar
H. Richards, AND J. Bridle. “The HDM: A segmental hidden dynamic model of coarticulation”, Proc. ICASSP, Vol. 1, 1999, pp. 357–360.
Google Scholar
R. Rose, J. Schroeter, AND M. Sondhi. “The potential role of speech production models in automatic speech recognition,” J. Acoust. Soc. Am., Vol. 99, 1996, pp. 1699–1709.
Article Google Scholar
M. Russell. “Progress towards speech models that model speech,” Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, 1997, pp. 115–123.
Google Scholar
J. Schroeter AND M. Sondhi. “Techniques for estimating vocal-tract shapes from the speech signal,” IEEE Trans. Speech Audio Proc, Vol. 2, 1994, pp. 133–150.
Article Google Scholar
H. Sheikhzadeh AND L. Deng. “Speech analysis and recognition using interval statistics generated from a composite auditory model,” IEEE Trans. Speech Audio Proc, Vol. 6, 1998, pp. 50–54.
Article Google Scholar
H. Sheikhzadeh AND L. Deng. “A layered neural network interfaced with a cochlear model for the study of speech encoding in the auditory system,” Computer Speech and Language, Vol. 13, 1999, pp. 39–64.
Article Google Scholar
R. Shumway AND D. Stoffer. “An approach to time series smoothing and forecasting using the EM algorithm,” J. Time Series Analysis, Vol. 3, 1982, pp. 253–264.
Article MATH Google Scholar
R. Shumway AND D. Stoffer. “Dynamic linear models with switching”, J. American Statistical Association, Vol. 86, 1991, pp. 763–769.
Article MathSciNet Google Scholar
K. Stevens. “On the quantal nature of speech,” J. Phonetics, Vol. 17, 1989, pp. 3–45.
Google Scholar
K. Stevens. “Prom acoustic cues to segments, features and words,” Proc. ICSLP, Vol. 1, 2000, pp. A1–A8.
Google Scholar
K. Stevens. A coustic Phonetics, The MIT Press, Cambridge, MA, 1998.
Google Scholar
J. Sun, L. Deng, AND X. Jing. “Data-driven model construction for continuous speech recognition using overlapping articulatory features,” Proc. ICSLP, Vol. 1, 2000, pp. 437–440.
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, One Microsoft Way, Redmond, WA, 98052, USA
Li Deng

Authors

Li Deng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Cognitive and Linguistic Studies, Brown University, Providence, RI, 02912, USA
Mark Johnson
Dept. of ECE and Dept. of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
Sanjeev P. Khudanpur
Dept. of Electrical Engineering, University of Washington, Seattle, WA, 98195, USA
Mari Ostendorf
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Roni Rosenfeld

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deng, L. (2004). Switching Dynamic System Models for Speech Articulation and Acoustics. In: Johnson, M., Khudanpur, S.P., Ostendorf, M., Rosenfeld, R. (eds) Mathematical Foundations of Speech and Language Processing. The IMA Volumes in Mathematics and its Applications, vol 138. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9017-4_6

Download citation

DOI: https://doi.org/10.1007/978-1-4419-9017-4_6
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4612-6484-2
Online ISBN: 978-1-4419-9017-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics