Skip to main content

Switching Dynamic System Models for Speech Articulation and Acoustics

  • Conference paper
Mathematical Foundations of Speech and Language Processing

Part of the book series: The IMA Volumes in Mathematics and its Applications ((IMA,volume 138))

Abstract

A statistical generative model for the speech process is described that embeds a substantially richer structure than the HMM currently in predominant use for automatic speech recognition. This switching dynamic-system model generalizes and integrates the HMM and the piece-wise stationary nonlinear dynamic system (state- space) model. Depending on the level and the nature of the switching in the model design, various key properties of the speech dynamics can be naturally represented in the model. Such properties include the temporal structure of the speech acoustics, its causal articulatory movements, and the control of such movements by the multidimensional targets correlated with the phonological (symbolic) units of speech in terms of overlapping articulatory features.

One main challenge of using this multi-level switching dynamic-system model for successful speech recognition is the computationally intractable inference (decoding with confidence measure) on the posterior probabilities of the hidden states. This leads to computationally intractable optimal parameter learning (training) also. Several versions of BayesNets have been devised with detailed dependency implementation specified to represent the switching dynamic-system model of speech. We discuss the variational technique developed for general Bayesian networks as an efficient approximate algorithm for the decoding and learning problems. Some common operations of estimating phonological states’ switching times have been shared between the variational technique and the human auditory function that uses neural transient responses to detect temporal landmarks associated with phonological features. This suggests that the variation-style learning may be related to human speech perception under an encoding-decoding theory of speech communication, which highlights the critical roles of modeling articulatory dynamics for speech recognition and which forms a main motivation for the switching dynamic system model for speech articulation and acoustics described in this chapter.

The author wishes to thank many useful discussions with and suggestions for improving the paper presentation by David Heckerman, Mari Ostendorf, Ken Stevens, B. Frey, H. Attias, G. Ramsay, J. Ma, L. Lee, Sam Roweis, and J. Bilmes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. Allen. “HOW do humans process and recognize speech,” IEEE Trans. Speech Audio Proc., Vol. 2, 1994, pp. 567–577.

    Article  Google Scholar 

  2. R. Bakis. “Coarticulation modeling with continuous-state HMMs,” Proc. IEEE Workshop Automatic Speech Recognition, Harriman, New York, 1991, pp. 20–21.

    Google Scholar 

  3. N. Bitar AND C. Espy-wilson. “Speech parameterization based on phonetic features: Application to speech recognition,” Proc. Eurospeech, Vol. 2, 1995, pp. 1411–1414.

    Google Scholar 

  4. C. Blackburn AND S. Young. “Towards improved speech recognition using a speech production model,” Proc. Eurospeech, Vol. 2, 1995, pp. 1623–1626.

    Google Scholar 

  5. H. Bourlard AND S. Dupont. “A new ASR approach based on independent processing and recombination of partial frequency bands,” Proc. ICSLP, 1996, pp. 426–429.

    Google Scholar 

  6. H. Bourlard, H. Hermansky, AND N. Morgan. “Towards increasing speech recognition error rates,” Speech Communication, Vol. 18, 1996, pp. 205–231.

    Article  Google Scholar 

  7. C. Browman AND L. Goldstein. “Articulatory phonology: An overview,” Phoetica, Vol. 49, pp. 155–180, 1992.

    Article  Google Scholar 

  8. N. Chomsky AND M. Halle. The Sound Pattern of English, New York: Harper and Row, 1968.

    Google Scholar 

  9. N. Clements. “The geometry of phonological features,” Phonology Yearbook, Vol. 2, 1985, pp. 225–252.

    Article  Google Scholar 

  10. L. Deng. “A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal,” Signal Processing, Vol. 27, 1992, pp. 65–78.

    Article  MATH  Google Scholar 

  11. L. Deng. “A computational model of the phonology-phonetics interface for automatic speech recognition,” Summary Report, SLS-LCS, Massachusetts Institute of Technology, 1992–1993.

    Google Scholar 

  12. L. Deng. “Design of a feature-based speech recognizer aiming at integration of auditory processing, signal modeling, and phonological structure of speech.” J. Acoust. Soc. Am., Vol. 93, 1993, pp. 2318.

    Article  Google Scholar 

  13. L. Deng. “Computational models for speech production,” in Computational Models of Speech Pattern Processing (NATO ASI), Springer-Verlag, 1999, pp. 67–77.

    Google Scholar 

  14. L. DEng. “A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition,” Speech Communication, Vol. 24, No. 4, 1998, pp. 299–323.

    Article  Google Scholar 

  15. L. Deng, M. Aksmanovic, D. Sun, AND J. Wu. “Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states,” IEEE Trans. Speech Audio Proc, Vol. 2, 1994, pp. 507–520.

    Article  Google Scholar 

  16. L. Deng, AND K. Erler. “Structural design of a hidden Markov model based speech recognizer using multi-valued phonetic features: Comparison with segmentai speech units,” J. Acoust. Soc. Am., Vol. 92, 1992, pp. 3058–3067.

    Article  Google Scholar 

  17. L. Deng AND Z. Ma. “Spontaneous speech recognition using a statistical coarticulatory model for the hidden vocal-tract-resonance dynamics,” J. Acoust. Soc. Am., Vol. 108, No. 6, 2000, pp. 3036–3048.

    Article  Google Scholar 

  18. L. Deng, G. Ramsay, AND D. Sun. “Production models as a structural basis for automatic speech recognition,” Speech Communication, Vol. 22, No. 2, 1997, pp. 93–111.

    Article  Google Scholar 

  19. L. Deng AND H. Sameti. “Transitional speech units and their representation by the regressive Markov states: Applications to speech recognition,” IEEE Trans. Speech Audio Proc, Vol. 4, No. 4, July 1996, pp. 301–306.

    Article  Google Scholar 

  20. L. Deng AND X. Shen. “Maximum likelihood in statistical estimation of dynamic systems: Decomposition algorithm and simulation results”, Signal Processing, Vol. 57, 1997, pp. 65–79.

    Article  MATH  Google Scholar 

  21. L. Deng AND D. Sun. “A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features, ” J. Acoust. Soc. Am., Vol. 95, 1994, pp. 2702–2719.

    Article  Google Scholar 

  22. J. Frankel AND S. King. “ASR — Articulatory speech recognition”, Proc. Eurospeech, Vol. 1, 2001, pp. 599–602.

    Google Scholar 

  23. Y. Gao, R. Bakis, J. Huang, AND B. Zhang, “Multistage coarticulation model combining articulatory, formant and cepstral features”, Proc. ICSLP, Vol. 1, 2000, pp. 25–28.

    Google Scholar 

  24. Z. Ghahramani AND S. Roweis. “Learning nonlinear dynamic systems using an EM algorithm”. Advances in Neural Information Processing Systems, Vol. 11, 1999, 1–7.

    Google Scholar 

  25. Z. Ghahramani AND G. Hinton. “Variational learning for switching state-space model”. Neural Computation, Vol. 12, 2000, pp. 831–864.

    Article  Google Scholar 

  26. W. Holmes. “Segmental HMMs: Modeling dynamics and underlying structure in speech,” in M. Ostendorf and S. Khudanpur (eds.) Mathematical Foundations of Speech Recognition and Processing, Volume X in IMA Volumes in Mathematics and Its Applications, Springer-Verlag, New York, 2002.

    Google Scholar 

  27. M. Jordan, Z. Ghahramani, T. Jaakkola, AND L. Saul. “In introduction to variational methods for graphical models,” in Learning in Graphical Models M. Jordon (ed.), The MIT Press, Cambridge, MA, 1999.

    Google Scholar 

  28. F. Juang AND S. Furui (eds.), Proc. of the IEEE (special issue), Vol. 88, 2000.

    Google Scholar 

  29. R. Kent, G. Adams, AND G. Turner. “Models of speech production,” in Principles of Experimental Phonetics, N. Lass (ed.), Mosby: London, 1995, pp. 3–45.

    Google Scholar 

  30. C.-H. Lee, F. Soong, AND K. Paliwal (eds.) Automatic Speech and Speaker Recognition — Advanced Topics, Kluwer Academic, 1996.

    Google Scholar 

  31. A. Liberman AND I. Mattingly. “The motor theory of speech perception revised” Cognition, Vol. 21, 1985, pp. 1–36.

    Article  Google Scholar 

  32. R. Lippman. “Speech recognition by human and machines,” Speech Communication, Vol. 22, 1997, pp. 1–15.

    Article  Google Scholar 

  33. Z. Ma AND L. Deng. “A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech,” Computer Speech and Language, Vol. 14, 2000, pp. 101–104.

    Article  Google Scholar 

  34. P. Macneilage. “Motor control of serial ordering in speech,” Psychological Review, Vol. 77, 1970, pp. 182–196.

    Article  Google Scholar 

  35. R. Mcgowan. “Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests,” Speech Communication, Vol. 14, 1994, pp. 19–48.

    Article  Google Scholar 

  36. R. Mcgowan AND A. Faber. “Speech production parameters for automatic speech recognition,” J. Acoust. Soc. Am., Vol. 101, 1997, pp. 28.

    Article  Google Scholar 

  37. H. Nock. Techniques for Modeling Phonological Processes in Automatic Speech Recognition, Ph.D. thesis, Cambridge University, 2001, Cambridge, U.K.

    Google Scholar 

  38. M. Ostendorf, V. Digalakis, AND J. Rohlicek. “From HMMs to segment models: A unified view of stochastic modeling for speech recognition” IEEE Trans. Speech Audio Proc, Vol. 4, 1996, pp. 360–378.

    Article  Google Scholar 

  39. V. Pavlovic, B. Frey, AND T. Huang. “Variational learning in mixed-state dynamic graphical models,” Proc. Annual Conf. in Uncertainty in Artificial Intelligence, 1999, UAI–99.

    Google Scholar 

  40. J. Perkell, M. Matthies, M. Svirsky, AND M. Jordan. “Goal-based speech motor control: a theoretical framework and some preliminary data,” J. Phonetics, Vol. 23, 1995, pp. 23–35.

    Article  Google Scholar 

  41. J. Perkell. “Properties of the tongue help to define vowel categories: hypotheses based on physiologically-oriented modeling,” J. Phonetics Vol. 24, 1996, pp. 3–22.

    Article  Google Scholar 

  42. P. Perrier, D. Ostry, AND R. Laboissière. “The equilibrium point hypothesis and its application to speech motor control,” J. Speech & Hearing Research, Vol. 39, 1996, pp. 365–378.

    Google Scholar 

  43. L. Pols. “Flexible human speech recognition,” Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, 1997, pp. 273–283.

    Google Scholar 

  44. M. Randolph. “Speech analysis based on articulatory behavior,” J. Acoust. Soc. Am., Vol. 95, 1994, pp. 195.

    Article  Google Scholar 

  45. H. Richards, AND J. Bridle. “The HDM: A segmental hidden dynamic model of coarticulation”, Proc. ICASSP, Vol. 1, 1999, pp. 357–360.

    Google Scholar 

  46. R. Rose, J. Schroeter, AND M. Sondhi. “The potential role of speech production models in automatic speech recognition,” J. Acoust. Soc. Am., Vol. 99, 1996, pp. 1699–1709.

    Article  Google Scholar 

  47. M. Russell. “Progress towards speech models that model speech,” Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, 1997, pp. 115–123.

    Google Scholar 

  48. J. Schroeter AND M. Sondhi. “Techniques for estimating vocal-tract shapes from the speech signal,” IEEE Trans. Speech Audio Proc, Vol. 2, 1994, pp. 133–150.

    Article  Google Scholar 

  49. H. Sheikhzadeh AND L. Deng. “Speech analysis and recognition using interval statistics generated from a composite auditory model,” IEEE Trans. Speech Audio Proc, Vol. 6, 1998, pp. 50–54.

    Article  Google Scholar 

  50. H. Sheikhzadeh AND L. Deng. “A layered neural network interfaced with a cochlear model for the study of speech encoding in the auditory system,” Computer Speech and Language, Vol. 13, 1999, pp. 39–64.

    Article  Google Scholar 

  51. R. Shumway AND D. Stoffer. “An approach to time series smoothing and forecasting using the EM algorithm,” J. Time Series Analysis, Vol. 3, 1982, pp. 253–264.

    Article  MATH  Google Scholar 

  52. R. Shumway AND D. Stoffer. “Dynamic linear models with switching”, J. American Statistical Association, Vol. 86, 1991, pp. 763–769.

    Article  MathSciNet  Google Scholar 

  53. K. Stevens. “On the quantal nature of speech,” J. Phonetics, Vol. 17, 1989, pp. 3–45.

    Google Scholar 

  54. K. Stevens. “Prom acoustic cues to segments, features and words,” Proc. ICSLP, Vol. 1, 2000, pp. A1–A8.

    Google Scholar 

  55. K. Stevens. A coustic Phonetics, The MIT Press, Cambridge, MA, 1998.

    Google Scholar 

  56. J. Sun, L. Deng, AND X. Jing. “Data-driven model construction for continuous speech recognition using overlapping articulatory features,” Proc. ICSLP, Vol. 1, 2000, pp. 437–440.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer Science+Business Media New York

About this paper

Cite this paper

Deng, L. (2004). Switching Dynamic System Models for Speech Articulation and Acoustics. In: Johnson, M., Khudanpur, S.P., Ostendorf, M., Rosenfeld, R. (eds) Mathematical Foundations of Speech and Language Processing. The IMA Volumes in Mathematics and its Applications, vol 138. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9017-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-9017-4_6

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4612-6484-2

  • Online ISBN: 978-1-4419-9017-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics