Statistics and Computing

, Volume 10, Issue 1, pp 63–72 | Cite as

Model selection for probabilistic clustering using cross-validated likelihood

  • Padhraic Smyth
Article

Abstract

Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modeling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to model selection is straightforward in the sense that models are judged directly on their estimated out-of-sample predictive performance. The cross-validation approach, as well as penalized likelihood and McLachlan's bootstrap method, are applied to two data sets and the results from all three methods are in close agreement. The second data set involves a well-known clustering problem from the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Cross-validated likelihood provides an interpretable and objective solution to the atmospheric clustering problem. The clusters found are in agreement with prior analyses of the same data based on non-probabilistic clustering techniques.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aitkin M., Anderson D., and Hinde J. 1981. Statistical modelling of data on teaching styles (with discussion). J. R. Statist. Soc. A 144: 419–461.Google Scholar
  2. Burman P. 1989. A comparative study of ordinary cross-validation, vfold cross-validation, and the repeated learning-testing methods. Biometrika 76(3): 503–514.Google Scholar
  3. Celeux G. and Govaert G. 1995. Gaussian parsimonious clustering models. Pattern Recognition 28: 781–793.Google Scholar
  4. Cheng X. and Wallace J.M. 1993. Cluster analysis of the Northern hemisphere winter-time 500-hPa height field: spatial patterns. J. Atmos. Sci. 50(16): 2674–2696.Google Scholar
  5. Chickering D.M. and Heckerman D. 1997. Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning 29(2/3): 181–244.Google Scholar
  6. Cover T.A. and Thomas J.M. 1991. Elements of Information Theory, New York, John Wiley.Google Scholar
  7. Dawid A.P. 1984. Present position and potential developments: some personal views. Statistical theory: the prequential approach. J. R. Statist. Soc. A 147: 278–292 (with discussion).Google Scholar
  8. Diebolt J. and Robert C.P. 1994. Bayesian estimation of finite mixture distributions. J. R. Statist. Soc. B 56: 363–375.Google Scholar
  9. Everitt B.S. and Hand D.J. 1981. Finite Mixture Distributions, London, Chapman and Hall.Google Scholar
  10. Feng Z.D. and McCulloch C.E. 1996. Using bootstrap likelihood ratios in finite mixture models. J. R. Statist. Soc. B 58(3): 609–617.Google Scholar
  11. Fraley C. and Raftery A.E. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal 41: 578–588.Google Scholar
  12. Good I.J. 1952. Rational decisions. J. R. Statist. Soc. B 14, 107–114.Google Scholar
  13. Hjorth J.S.U. 1994. Computer Intensive Statistical Methods: Validation, Model Selection and Bootstrap, Chapman and Hall, UK.Google Scholar
  14. Kass R.E. and Raftery A.E. 1995. Bayes factors. J. Am. Stat. Assoc. 90. 773–795.Google Scholar
  15. Kearns M. 1996. A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split. In: Touretzky D. S., Mozer M. C., and Hasselmo M.E. (Eds.), Advances in Neural Information Processing 8. Cambridge, MA, The MIT Press, pp. 183–189.Google Scholar
  16. Kimoto M. and Ghil M. 1993. Multiple flow regimes in the Northern hemisphere winter: Part I: methodology and hemispheric regimes. J. Atmos. Sci. 50(16): 2625–2643.Google Scholar
  17. Lavine M. and West M. 1992. A Bayesian method for classification and discrimination. Can. J. Statist. 20: 451–461.Google Scholar
  18. McLachlan G.J. 1987. On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Statist. 36: 318–324.Google Scholar
  19. McLachlan G.J. and Basford K.E. 1988. Mixture Models: Inference and Applications to Clustering, New York, Marcel Dekker.Google Scholar
  20. McLachlan G.J. and Krishnan T. 1997. The EM Algorithm and Extensions, New York, John Wiley and Sons.Google Scholar
  21. McLachlan G.J. and Peel D. 1997. On a resampling approach to choosing the number of components in normal mixture models. In: L. Billard and N.I. Fisher (Eds.). Computing Science and Statistics (Vol. 28), Fairfax Station, Virginia, Interface Foundation of North America, pp. 260–266.Google Scholar
  22. McLachlan G.J. and Peel D. 1998. MIXFIT: An algorithm for the automatic fitting and testing of normal mixture models. In: Proceedings of the 14th International Conference on Pattern Recognition, Vol. I, Los Alamitos, CA, IEEE Computer Society, pp. 553–557.Google Scholar
  23. Michelangeli P.-A., Vautard R., and Legras B. 1995. Weather regimes: recurrence and quasi-stationarity. J. Atmos. Sci. 52(8): 1237–1256.Google Scholar
  24. Mo K. and Ghil M. 1988. Cluster analysis of multiple planetary flow regimes. J. Geophys. Res. 93, D9: 10927–10952.Google Scholar
  25. Preisendorfer R.W. 1988. In: C.D. Mobley (Ed.), Principal Component Analysis in Meteorology and Oceanography. Elsevier, Amsterdam.Google Scholar
  26. Raftery A.E., Madigan D., and Volinsky C. 1996. ‘Accounting for model uncertainty in survival analysis improves predictive performance,’ In: Bernardo J.M., Berger J.O., Dawid A.P., and Smith A.F.M. (Eds.), Bayesian Statistics 5. Oxford University Press, pp. 323–349.Google Scholar
  27. Reaven G.M. and Miller R.G. 1979. An attempt to define the nature of chemical diabetes using a multi-dimensional analysis. Diabetologia 16: 17–24.Google Scholar
  28. Schwarz G. 1978. Estimating the dimensions of a model. Annals of Statistics 6: 461–462.Google Scholar
  29. Shao J. 1993. Linear model selection by cross-validation. J. Am. Stat. Assoc. 88(422): 486–494.Google Scholar
  30. Silverman B.W. 1986. Density Estimation for Statistics and Data Analysis, Chapman and Hall.Google Scholar
  31. Smyth P. 1996. Clustering using Monte-Carlo cross validation. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, AAAI Press, pp. 126–133.Google Scholar
  32. Smyth P. 1997. Clustering sequences using hidden Markov models. In: Mozer M.C., Jordan M.I., and Petsche T. (Eds.), Advances in Neural Information Processing 9. Cambridge, MA: MIT Press, 648–654.Google Scholar
  33. Smyth P., Ide K., and Ghil M. 1999. Multiple regimes in Northern hemisphere height fields via mixture model clustering. Journal of Atmospheric Sciences 56(21): 3704–3723.Google Scholar
  34. Smyth P. and Wolpert D. 1999. Linearly combining density estimators via stacking. Machine Learning 36(1): 59–83.Google Scholar
  35. Symons M. 1981. Clustering criteria and multivariate normal mixtures. Biometrics 37: 35–43.Google Scholar
  36. Thiesson B., Meek C., Chickering D.M., and Heckerman D. 1997. Learning mixtures of Bayesian networks. Technical Report MSRTR-97-30, Microsoft Research, Redmond, WA.Google Scholar
  37. Titterington D.M., Smith A.F.M., and Makov U.E. 1985. Statistical Analysis of Finite Mixture Distributions. Chichester, UK, John Wiley and Sons.Google Scholar
  38. Wallace J.M. 1996. Observed Climatic Variability: Spatial Structure. In: Anderson D.L.T. and Willebrand J. (Eds.), Decadal Climate Variability, NATO ASI Series, Springer Verlag.Google Scholar
  39. Zhang P. 1993. Model selection via multifold cross validation. Ann. Statist. 21(1): 299–313.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Padhraic Smyth
    • 1
    • 2
  1. 1.Information and Computer ScienceUniversity of CaliforniaIrvine
  2. 2.Jet Propulsion Laboratory 126-347California Institute of TechnologyPasadena

Personalised recommendations