Advertisement

International Journal of Speech Technology

, Volume 7, Issue 1, pp 55–68 | Cite as

An Acoustic-Phonetic and a Model-Theoretic Analysis of Subspace Distribution Clustering Hidden Markov Models

  • Brian Mak
Article

Abstract

Recently, we proposed a new derivative to conventional continuous density hidden Markov modeling (CDHMM) that we call “subspace distribution clustering hidden Markov modeling” (SDCHMM). SDCHMMs can be created by tying low-dimensional subspace Gaussians in CDHMMs. In tasks we tried, usually only 32–256 subspace Gaussian prototypes were needed in SDCHMM-based system to maintain recognition performance of its original CDHMM-based system—a reduction of Gaussian parameters by one to three orders of magnitude. Consequently, both recognition time and memory were greatly reduced. We also have showed that if the underlying subspace distribution tying structure is known, it may be used to train an SDCHMM-based system with as little as eight minutes of speech from scratch. All the results suggest that there is substantial redundancy in conventional CDHMM and that SDCHMM is a more compact model. In this paper, we analyze the tying structure from two perspectives: from the acoustic-phonetic perspective showing that the tying structure seems to capture prominent relationship among phones; and, from the model-theoretic perspective showing that SDCHMMs, if properly created from CDHMMs, may be preferred over the latter as they are less complex and have the potential of greater generalization power.

distribution clustering parameter tying model complexity Bayesian information criterion 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aiyer, A., Gales, M., and Picheny, M. (2000). Rapid likelihood calculation of subspace clustered Gaussian components. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1519-1522.Google Scholar
  2. Akaike, H. (1974). A new look at statistical model identification. IEEE Transactions on Automatic Control 19(6):716–723.Google Scholar
  3. Astrov, S. (2002). Memory space reduction for hidden Markov models in low-resource speech recognition systems. Proceedings of the International Conference on Spoken Language Processing, pp. 1585-1588.Google Scholar
  4. Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics 41:164–171.Google Scholar
  5. Bellegarda, J. and Nahamoo, D. (1990), Tied mixture continuous parameter modeling for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing 38(12):2033–2045.Google Scholar
  6. Beyerlein, P. and Ullrich, M. (1995). Hamming distance approximation for a fast log-likelihood computation for mixture densities. Proceedings of the European Conference on Speech Communication and Technology, vol. 2, pp. 1083–1086.Google Scholar
  7. Bocchieri, E. (1993). Vector quantization for the efficient computation of continuous density likelihoods. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. pp. 692–695.Google Scholar
  8. Bocchieri, E. and Mak, B. (2001). Subspace distribution clustering hidden Markov model. IEEE Transactions on Speech and Audio Processing 9(3):264–275.Google Scholar
  9. Chan, Y.C., Siu, M., and Mak, B. (2000). Pruning of state-tying tree using Bayesian information criterion with multiple mixtures. Proceedings of the International Conference on Spoken Language Processing, vol. IV. Beijing, China, pp. 294–297.Google Scholar
  10. Chen, S.S. and Gopalakrishnan, P.S. (1998). Clustering via the Bayesian information criterion with applications in speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 645-648.Google Scholar
  11. Gopalakrishnan, P. and Bahl, L. (1996). Fast match techniques. C. Lee, F. Soong, and K. Paliwal (Eds.), Automatic Speech and Speaker Recognition (Advanced Topics). Kluwer Academic Publishers, Chap. 17, pp. 413-428.Google Scholar
  12. Hemphill, C., Godfrey, J., and Doddington, G. (1990). The ATIS spoken language systems pilot corpus. Proceedings of the DARPA Speech and Natural LanguageWorkshop. Morgan Kaufmann Publishers.Google Scholar
  13. Huang, X. and Jack, M. (1989). Semi-continuous hidden Markov models for speech signals. Journal of Computer Speech and Language 3(3):239–251.Google Scholar
  14. Hwang, M. (1993). Shared distribution hidden Markov models for speech recognition. IEEE Transactions on Speech and Audio Processing 1(4):414–420.Google Scholar
  15. Komori, Y., Yamada, M., Yamamoto, H., and Ohora, Y. (1995). An efficient output probability computation for continuous HMM using rough and detail models. Proceedings of the European Conference on Speech Communication and Technology, vol. 2. pp. 1087–1090.Google Scholar
  16. Ladefoged, P. (1993). A Course in Phonetics. 3rd edition. Harcourt Brace Jovanovich College Publishers.Google Scholar
  17. Lanterman, A.D. (2001). Schwarz, Wallace, and Rissanen: Intertwining themes in theories of model selection. International Statistical Review 69(2):185–212.Google Scholar
  18. Lee, K., Hayamizu, S., Hon, H., Huang, C., Swartz, J., and Weide, R. (1990) Allophone clustering for continuous speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. pp. 749–752.Google Scholar
  19. Li, M. and Vitanyi, P. (1997). An Introduction to Kolmogorov Complexity and Its Applications. 2nd edition. New York: Springer-Verlag.Google Scholar
  20. Liang, Z., Jaszczak, R., and Coleman, R. (1992). Parameter estimation of finite mixtures using the EM algorithm and information criteria with application to medical image processing. IEEE Transactions on Nuclear Science 39:1126–1133.Google Scholar
  21. Mak, B. and Bocchieri, E. (2001). Direct training of subspace distribution clustering hidden Markov model. IEEE Transactions on Speech and Audio Processing 9(4):378–387.Google Scholar
  22. Padmanabhan, M., Bahl, D.N.L.R., and de Souza, P. (1997). Decision-tree based quantization of the feature space of a speech recognizer. Proceedings of the European Conference on Speech Communication and Technology, pp. 147-150.Google Scholar
  23. Price, P., Fisher, W., Bernstein, J., and Pallett, D. (1988). The DARPA 1000-word resource management database for continuous speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1. pp. 651–654.Google Scholar
  24. Rigazio, L., Tsakam, B., and Junqua, J. (2000). An optimal Bhattacharyya centroid algorithm for Gaussian clustering with applications in automatic speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3. pp. 1599–1602.Google Scholar
  25. Rissanen, J. (1978). Modeling by shortest data description. Automatica 14:465–471.Google Scholar
  26. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6(2):461–464.Google Scholar
  27. Seide, F. (1995). Fast likelihood computation for continuous-mixture densities using a tree-based nearest neighbor search. Proceedings of the European Conference on Speech Communication and Technology, vol. 2. pp. 1079–1082.Google Scholar
  28. Singer, E. and Lippmann, R. (1992). A speech recognizer using radial basis function neural networks in an HMM framework. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1. pp. 629–632.Google Scholar
  29. Takahashi, S. and Sagayama, S. (1995). Four-level tied-structure for efficient representation of acoustic modeling. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1. pp. 520–523.Google Scholar
  30. Wallace, C. and Boulton, D. (1968). An information measure for classification. The Computer Journal 11(2):195–209.Google Scholar
  31. Wax, M. and Kailath, T. (1985). Detection of signals by information theoretic criteria. IEEE Transactions on ASSP 33:387–392.Google Scholar
  32. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. (1999). The HTK Book (for HTK Version 2.2). Entropic Ltd.Google Scholar
  33. Young, S. and Woodland, P. (1993). The use of state tying in continuous speech recognition. Proceedings of the European Conference on Speech Communication and Technology, vol. 3. pp. 2203–2206.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Brian Mak
    • 1
  1. 1.Department of Computer ScienceHong Kong University of Science and TechnologyHong Kong

Personalised recommendations