Skip to main content

Gaussian Mixture Models

  • Chapter
  • First Online:
Automatic Speech Recognition

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

In this chapter we first introduce the basic concepts of random variables and the associated distributions. These concepts are then applied to Gaussian random variables and mixture-of-Gaussian random variables. Both scalar and vector-valued cases are discussed and the probability density functions for these random variables are given with their parameters specified. This introduction leads to the Gaussian mixture model (GMM) when the distribution of mixture-of-Gaussian random variables is used to fit the real-world data such as speech features. The GMM as a statistical model for Fourier-spectrum-based speech features plays an important role in acoustic modeling of conventional speech recognition systems. We discuss some key advantages of GMMs in acoustic modeling, among which is the easy way of using them to fit the data of a wide range of speech features using the EM algorithm. We describe the principle of maximum likelihood and the related EM algorithm for parameter estimation of the GMM in some detail as it is still a widely used method in speech recognition. We finally discuss a serious weakness of using GMMs in acoustic modeling for speech recognition, motivating new models and methods that form the bulk part of this book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Detailed derivation of these formulae can be found in [1], which we omit here. Related derivations for similar but more general models can be found in [2, 3, 6, 15, 18].

References

  1. Bilmes, J.: A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report, TR-97-021, ICSI (1997)

    Google Scholar 

  2. Bilmes, J.: What HMMs can do. IEICE Trans. Inf. Syst. E89-D(3), 869–891 (2006)

    Google Scholar 

  3. Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)

    MATH  Google Scholar 

  4. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio, Speech Lang. Process. 19(4), 788–798 (2011)

    Article  Google Scholar 

  5. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Ser. B. 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  6. Deng, L.: A generalized hidden markov model with state-conditioned trend functions of time for the speech signal. Signal Process. 27(1), 65–78 (1992)

    Article  MATH  Google Scholar 

  7. Deng, L.: Computational models for speech production. In: Computational Models of Speech Pattern Processing, pp. 199–213. Springer, New York (1999)

    Google Scholar 

  8. Deng, L.: Switching dynamic system models for speech articulation and acoustics. In: Mathematical Foundations of Speech and Language Processing, pp. 115–134. Springer, New York (2003)

    Google Scholar 

  9. Deng, L.: Dynamic Speech Models—Theory, Algorithm, and Applications. Morgan and Claypool, New York (2006)

    Google Scholar 

  10. Deng, L., Acero, A., Plumpe, M., Huang, X.: Large vocabulary speech recognition under adverse acoustic environment. In: Proceedings of International Conference on Spoken Language Processing (ICSLP), pp. 806–809 (2000)

    Google Scholar 

  11. Deng, L., Droppo, J.: A. Acero: recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. IEEE Trans. Speech Audio Process. 11, 568–580 (2003)

    Article  Google Scholar 

  12. Deng, L., Droppo, J., Acero, A.: A Bayesian approach to speech feature enhancement using the dynamic cepstral prior. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I-829–I-832 (2002)

    Google Scholar 

  13. Deng, L., Droppo, J., Acero, A.: Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Trans. Speech Audio Process. 12(2), 133–143 (2004)

    Article  Google Scholar 

  14. Deng, L., Kenny, P., Lennig, M., Gupta, V., Seitz, F., Mermelsten, P.: Phonemic hidden markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans. Acoust, Speech Signal Process. 39(7), 1677–1681 (1991)

    Article  Google Scholar 

  15. Deng, L., Mark, J.: Parameter estimation for markov modulated poisson processes via the em algorithm with time discretization. In: Telecommunication Systems (1993)

    Google Scholar 

  16. Deng, L., O’Shaughnessy, D.: Speech Processing—A Dynamic and Optimization-Oriented Approach. Marcel Dekker Inc, New York (2003)

    Google Scholar 

  17. Deng, L., Ramsay, G., Sun, D.: Production models as a structural basis for automatic speech recognition. Speech Commun. 33(2–3), 93–111 (1997)

    Article  Google Scholar 

  18. Deng, L., Rathinavelu, C.: A Markov model containing state-conditioned second-order non-stationarity: application to speech recognition. Comput. Speech Lang. 9(1), 63–86 (1995)

    Article  Google Scholar 

  19. Deng, L., Wang, K., Acero, A., Hon, H., Droppo, J., Boulis, C., Wang, Y., Jacoby, D., Mahajan, M., Chelba, C., Huang, X.: Distributed speech processing in mipad’s multimodal user interface. IEEE Trans. Audio Speech Lang. Process. 20(9), 2409–2419 (2012)

    Article  Google Scholar 

  20. Divenyi, P., Greenberg, S., Meyer, G.: Dynamics of Speech Production and Perception. IOS Press, Washington (2006)

    Google Scholar 

  21. Frey, B., Deng, L., Acero, A., Kristjansson, T.: Algonquin: iterating laplaces method to remove multiple types of acoustic distortion for robust speech recognition. In: Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH) (2000)

    Google Scholar 

  22. He, X., Deng, L.: Discriminative Learning for Speech Recognition: Theory and Practice. Morgan and Claypool, New York (2008)

    Google Scholar 

  23. Huang, X., Acero, A., Hon, H.W., et al.: Spoken Language Processing. Prentice Hall, Englewood Cliffs (2001)

    Google Scholar 

  24. Huang, X., Deng, L.: An overview of modern speech recognition. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn. CRC Press, Taylor and Francis Group, Boca Raton, FL (2010). ISBN 978-1420085921

    Google Scholar 

  25. Jiang, H., Li, X.: Discriminative learning in sequential pattern recognition—a unifying review for optimization-oriented speech recognition. IEEE Signal Process. Mag. 27(3), 115–127 (2010)

    Article  Google Scholar 

  26. Jiang, H., Li, X., Liu, C.: Large margin hidden markov models for speech recognition. IEEE Trans. Audio, Speech Lang. Process. 14(5), 1584–1595 (2006)

    Article  Google Scholar 

  27. Juang, B.H., Levinson, S.E., Sondhi, M.M.: Maximum likelihood estimation for mixture multivariate stochastic observations of markov chains. In: IEEE International Symposium on Information Theory vol. 32(2), pp. 307–309 (1986)

    Google Scholar 

  28. Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms. CRIM, Montreal, (Report) CRIM-06/08-13 (2005)

    Google Scholar 

  29. King, S., Frankel, J., Livescu, K., McDermott, E., Richmond, K., Wester, M.: Speech production knowledge in automatic speech recognition. J. Acoust. Soc. Am. 121, 723–742 (2007)

    Article  Google Scholar 

  30. Lee, L.J., Fieguth, P., Deng, L.: A functional articulatory dynamic model for speech production. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 797–800. Salt Lake City (2001)

    Google Scholar 

  31. Rasmussen, C.E.: The infinite gaussian mixture model. In: Proceedings of Neural Information Processing Systems (NIPS) (1999)

    Google Scholar 

  32. Reynolds, D., Rose, R.: Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)

    Article  Google Scholar 

  33. Xiao, L., Deng, L.: A geometric perspective of large-margin training of Gaussian models. IEEE Signal Process. Mag. 27, 118–123 (2010)

    Article  Google Scholar 

  34. Yin, S.C., Rose, R., Kenny, P.: A joint factor analysis approach to progressive model adaptation in text-independent speaker verification. IEEE Trans. Audio Speech Lang. Process. 15(7), 1999–2010 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Yu .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag London

About this chapter

Cite this chapter

Yu, D., Deng, L. (2015). Gaussian Mixture Models. In: Automatic Speech Recognition. Signals and Communication Technology. Springer, London. https://doi.org/10.1007/978-1-4471-5779-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-5779-3_2

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-5778-6

  • Online ISBN: 978-1-4471-5779-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics