Abstract
Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.
Similar content being viewed by others
References
Altman, E.I.: Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J. Finance 23(4), 589–609 (1968)
Andrews, J.L., McNicholas, P.D.: Extending mixtures of multivariate t-factor analyzers. Stat. Comput. (2010, in press). doi: 10.1007/s11222-010-9175-2
Atkinson, A.C.: Transformations unmasked. Technometrics 30, 311–318 (1988)
Azzalini, A.: A class of distributions which includes the normal ones. Scand. J. Statist. 12, 171–178 (1985)
Azzalini, A., Capitanio, A.: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. J. R. Stat. Soc. Ser. B 65(2), 367–389 (2003)
Azzalini, A., Dalla Valle, A.: The multivariate skew-normal distribution. Biometrika 83(4), 715–726 (1996)
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P.: Inference in model-based cluster analysis. Stat. Comput. 7, 1–10 (1997)
Bickel, P.J., Doksum, K.A.: An analysis of transformations revisited. J. Am. Stat. Assoc. 76(374), 296–311 (1981)
Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52, 502–519 (2007)
Box, G.E.P., Cox, D.R.: An analysis of transformations. J. R. Stat. Soc. Ser. B 26, 211–252 (1964)
Brent, R.: Algorithms for Minimization without Derivatives. Prentice-Hall, Englewood Cliffs (1973)
Campbell, N.A., Mahon, R.J.: A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Aust. J. Zool. 22(3), 417–425 (1974)
Carroll, R.J.: Prediction and power transformations when the choice of power is restricted to a finite set. J. Am. Stat. Assoc. 77(380), 908–915 (1982)
Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14(3), 315–332 (1992)
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28(5), 781–793 (1995)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
Forbes, F., Peyrard, N., Fraley, C., Georgian-Smith, D., Goldhaber, D.M., Raftery, A.E.: Model-based region-of-interest selection in dynamic breast MRI. J. Comput. Assist. Tomogr. 30, 675–687 (2006)
Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25(3), 189–201 (1986)
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)
Fraley, C., Raftery, A., Wehrens, R.: Incremental model-based clustering for large datasets with small clusters. J. Comput. Graph. Stat. 14(3), 529–546 (2005)
Fraley, C., Raftery, A.E.: MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Tech. Rep. 504, University of Washington, Department of Statistics (2006, revised 2009)
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10), R80 (2004)
Gutierrez, R.G., Carroll, R.J., Wang, N., Lee, G.H., Taylor, B.H.: Analysis of tomato root initiation using a normal mixture distribution. Biometrics 51, 1461–1468 (1995)
Hurley, C.: Clustering visualizations of multivariate data. J. Comput. Graph. Stat. 13(4), 788–806 (2004)
Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice Hall, Upper Saddle River (2002)
Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90(430), 773–795 (1995)
Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā Ser. A 62(1), 49–66 (2000)
Kotz, S., Nadarajah, S.: Multivariate t Distributions and Their Applications. Cambridge University Press, Cambridge (2004)
Kriessler, J.R., Beers, T.C.: Substructure in galaxy clusters: a two-dimensional approach. Astron. J. 113, 80–100 (1997)
Lange, K.L., Little, R.J.A., Taylor, J.M.G.: Robust statistical modeling using the t-distribution. J. Am. Stat. Assoc. 84, 881–896 (1989)
Leroux, M.: Consistent estimation of a mixing distribution. Ann. Stat. 20, 1350–1360 (1992)
Li, Q., Fraley, C., Bumgarner, R.E., Yeung, K.Y., Raftery, A.E.: Donuts, scratches and blanks: Robust model-based segmentation of microarray images. Bioinformatics 21(12), 2875–2882 (2005)
Lin, T.I.: Maximum likelihood estimation for multivariate skew normal mixture models. J. Multivar. Anal. 100(2), 257–265 (2009a)
Lin, T.I.: Robust mixture modeling using multivariate skew t distributions. Stat. Comput. 20(3), 343–356 (2010)
Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007a)
Lin, T.I., Lee, J.C., Yen, S.Y.: Finite mixture modelling using the skew normal distribution. Stat. Sin. 17, 909–927 (2007b)
Liu, C.: ML estimation of the multivariate t distribution and the EM algorithm. J. Multivar. Anal. 63, 296–312 (1997)
Liu, C., Rubin, D.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)
Liu, C., Rubin, D.: ML estimation of the t distribution using EM and its extensions, ECM and ECME. Stat. Sin. 5, 19–39 (1995)
Lo, K., Brinkman, R.R., Gottardo, R.: Automated gating of flow cytometry data via robust model-based clustering. Cytometry A 73A(4), 321–332 (2008)
Lo, K., Hahne, F., Brinkman, R.R., Gottardo, R.: flowClust: a Bioconductor package for automated gating of flow cytometry data. BMC Bioinformatics 10, 145 (2009)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: LeCam, L., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
McLachlan, G.J.: The classification and mixture maximum likelihood approaches to cluster analysis. In: Krishnaiah, P.R., Kanal, L. (eds.) Handbook of Statistics. vol. 2, pp. 199–208. North-Holland, Amsterdam (1982)
McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering. Dekker, New York (1988)
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley-Interscience, New York (2000)
McLachlan, G.J., Bean, R.W., Peel, D.: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18(3), 413–422 (2002)
McLachlan, G., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)
McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)
Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278 (1993)
Mukherjee, S., Feigelson, E.D., Babu, G.J., Murtagh, F., Fraley, C., Raftery, A.E.: Three types of gamma ray bursts. Astrophys. J. 508, 314–327 (1998)
Pan, W., Lin, J., Le, C.T.: Model-based cluster analysis of microarray gene-expression data. Genome Biol. 3(2), R9 (2002)
Peel, D., McLachlan, G.J.: Robust mixture modelling using the t distribution. Stat. Comput. 10(4), 339–348 (2000)
Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.I., Maier, L.M., Baecher-Allan, C., McLachlan, G.J., Tamayo, P., Hafler, D.A., De Jager, P.L., Mesirov, J.P.: Automated high-dimensional flow cytometric data analysis. Proc. Natl. Acad. Sci. USA 106(21), 8519–8524 (2009)
Raftery, A.E., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. Assoc. 101(473), 168–178 (2006)
Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to Bayesian regression. Can. J. Stat. 31(2), 129–150 (2003)
Schork, N.J., Schork, M.A.: Skewness and mixtures of normal distributions. Commun. Stat. Theory Methods 17, 3951–3969 (1988)
Schroeter, P., Vesin, J.M., Langenberger, T., Meuli, R.: Robust parameter estimation of intensity distributions for brain magnetic resonance images. IEEE Trans. Med. Imag. 17(2), 172–186 (1998)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Scrucca, L.: Dimension reduction for model-based clustering. Stat. Comput. 20(4), 471–484 (2010)
Stephens, M.: Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods. Ann. Stat. 28, 40–74 (2000)
Titterington, D.M., Smith, A.F.M., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester (1985)
Wang, K., Ng, S.K., McLachlan, G.J.: Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In: Conference Proceedings of Digital Image Computing: Techniques and Applications, pp. 526–531. IEEE Computer Society, Los Alamitos (2009)
Wehrens, R., Buydens, L.M.C., Fraley, C., Raftery, A.E.: Model-based clustering for image segmentation and large datasets via sampling. J. Classif. 21, 231–253 (2004)
Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10), 977–987 (2001)
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Lo, K., Gottardo, R. Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution. Stat Comput 22, 33–52 (2012). https://doi.org/10.1007/s11222-010-9204-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-010-9204-1