Abstract
In this chapter we first introduce the basic concepts of random variables and the associated distributions. These concepts are then applied to Gaussian random variables and mixture-of-Gaussian random variables. Both scalar and vector-valued cases are discussed and the probability density functions for these random variables are given with their parameters specified. This introduction leads to the Gaussian mixture model (GMM) when the distribution of mixture-of-Gaussian random variables is used to fit the real-world data such as speech features. The GMM as a statistical model for Fourier-spectrum-based speech features plays an important role in acoustic modeling of conventional speech recognition systems. We discuss some key advantages of GMMs in acoustic modeling, among which is the easy way of using them to fit the data of a wide range of speech features using the EM algorithm. We describe the principle of maximum likelihood and the related EM algorithm for parameter estimation of the GMM in some detail as it is still a widely used method in speech recognition. We finally discuss a serious weakness of using GMMs in acoustic modeling for speech recognition, motivating new models and methods that form the bulk part of this book.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bilmes, J.: A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report, TR-97-021, ICSI (1997)
Bilmes, J.: What HMMs can do. IEICE Trans. Inf. Syst. E89-D(3), 869–891 (2006)
Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio, Speech Lang. Process. 19(4), 788–798 (2011)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Ser. B. 39, 1–38 (1977)
Deng, L.: A generalized hidden markov model with state-conditioned trend functions of time for the speech signal. Signal Process. 27(1), 65–78 (1992)
Deng, L.: Computational models for speech production. In: Computational Models of Speech Pattern Processing, pp. 199–213. Springer, New York (1999)
Deng, L.: Switching dynamic system models for speech articulation and acoustics. In: Mathematical Foundations of Speech and Language Processing, pp. 115–134. Springer, New York (2003)
Deng, L.: Dynamic Speech Models—Theory, Algorithm, and Applications. Morgan and Claypool, New York (2006)
Deng, L., Acero, A., Plumpe, M., Huang, X.: Large vocabulary speech recognition under adverse acoustic environment. In: Proceedings of International Conference on Spoken Language Processing (ICSLP), pp. 806–809 (2000)
Deng, L., Droppo, J.: A. Acero: recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. IEEE Trans. Speech Audio Process. 11, 568–580 (2003)
Deng, L., Droppo, J., Acero, A.: A Bayesian approach to speech feature enhancement using the dynamic cepstral prior. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I-829–I-832 (2002)
Deng, L., Droppo, J., Acero, A.: Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Trans. Speech Audio Process. 12(2), 133–143 (2004)
Deng, L., Kenny, P., Lennig, M., Gupta, V., Seitz, F., Mermelsten, P.: Phonemic hidden markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans. Acoust, Speech Signal Process. 39(7), 1677–1681 (1991)
Deng, L., Mark, J.: Parameter estimation for markov modulated poisson processes via the em algorithm with time discretization. In: Telecommunication Systems (1993)
Deng, L., O’Shaughnessy, D.: Speech Processing—A Dynamic and Optimization-Oriented Approach. Marcel Dekker Inc, New York (2003)
Deng, L., Ramsay, G., Sun, D.: Production models as a structural basis for automatic speech recognition. Speech Commun. 33(2–3), 93–111 (1997)
Deng, L., Rathinavelu, C.: A Markov model containing state-conditioned second-order non-stationarity: application to speech recognition. Comput. Speech Lang. 9(1), 63–86 (1995)
Deng, L., Wang, K., Acero, A., Hon, H., Droppo, J., Boulis, C., Wang, Y., Jacoby, D., Mahajan, M., Chelba, C., Huang, X.: Distributed speech processing in mipad’s multimodal user interface. IEEE Trans. Audio Speech Lang. Process. 20(9), 2409–2419 (2012)
Divenyi, P., Greenberg, S., Meyer, G.: Dynamics of Speech Production and Perception. IOS Press, Washington (2006)
Frey, B., Deng, L., Acero, A., Kristjansson, T.: Algonquin: iterating laplaces method to remove multiple types of acoustic distortion for robust speech recognition. In: Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH) (2000)
He, X., Deng, L.: Discriminative Learning for Speech Recognition: Theory and Practice. Morgan and Claypool, New York (2008)
Huang, X., Acero, A., Hon, H.W., et al.: Spoken Language Processing. Prentice Hall, Englewood Cliffs (2001)
Huang, X., Deng, L.: An overview of modern speech recognition. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn. CRC Press, Taylor and Francis Group, Boca Raton, FL (2010). ISBN 978-1420085921
Jiang, H., Li, X.: Discriminative learning in sequential pattern recognition—a unifying review for optimization-oriented speech recognition. IEEE Signal Process. Mag. 27(3), 115–127 (2010)
Jiang, H., Li, X., Liu, C.: Large margin hidden markov models for speech recognition. IEEE Trans. Audio, Speech Lang. Process. 14(5), 1584–1595 (2006)
Juang, B.H., Levinson, S.E., Sondhi, M.M.: Maximum likelihood estimation for mixture multivariate stochastic observations of markov chains. In: IEEE International Symposium on Information Theory vol. 32(2), pp. 307–309 (1986)
Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms. CRIM, Montreal, (Report) CRIM-06/08-13 (2005)
King, S., Frankel, J., Livescu, K., McDermott, E., Richmond, K., Wester, M.: Speech production knowledge in automatic speech recognition. J. Acoust. Soc. Am. 121, 723–742 (2007)
Lee, L.J., Fieguth, P., Deng, L.: A functional articulatory dynamic model for speech production. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 797–800. Salt Lake City (2001)
Rasmussen, C.E.: The infinite gaussian mixture model. In: Proceedings of Neural Information Processing Systems (NIPS) (1999)
Reynolds, D., Rose, R.: Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)
Xiao, L., Deng, L.: A geometric perspective of large-margin training of Gaussian models. IEEE Signal Process. Mag. 27, 118–123 (2010)
Yin, S.C., Rose, R., Kenny, P.: A joint factor analysis approach to progressive model adaptation in text-independent speaker verification. IEEE Trans. Audio Speech Lang. Process. 15(7), 1999–2010 (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2015 Springer-Verlag London
About this chapter
Cite this chapter
Yu, D., Deng, L. (2015). Gaussian Mixture Models. In: Automatic Speech Recognition. Signals and Communication Technology. Springer, London. https://doi.org/10.1007/978-1-4471-5779-3_2
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5779-3_2
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5778-6
Online ISBN: 978-1-4471-5779-3
eBook Packages: EngineeringEngineering (R0)