Skip to main content
Log in

Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Altman, E.I.: Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J. Finance 23(4), 589–609 (1968)

    Article  Google Scholar 

  • Andrews, J.L., McNicholas, P.D.: Extending mixtures of multivariate t-factor analyzers. Stat. Comput. (2010, in press). doi: 10.1007/s11222-010-9175-2

  • Atkinson, A.C.: Transformations unmasked. Technometrics 30, 311–318 (1988)

    Article  MATH  Google Scholar 

  • Azzalini, A.: A class of distributions which includes the normal ones. Scand. J. Statist. 12, 171–178 (1985)

    MathSciNet  MATH  Google Scholar 

  • Azzalini, A., Capitanio, A.: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. J. R. Stat. Soc. Ser. B 65(2), 367–389 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  • Azzalini, A., Dalla Valle, A.: The multivariate skew-normal distribution. Biometrika 83(4), 715–726 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  • Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  • Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P.: Inference in model-based cluster analysis. Stat. Comput. 7, 1–10 (1997)

    Article  Google Scholar 

  • Bickel, P.J., Doksum, K.A.: An analysis of transformations revisited. J. Am. Stat. Assoc. 76(374), 296–311 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  • Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52, 502–519 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Box, G.E.P., Cox, D.R.: An analysis of transformations. J. R. Stat. Soc. Ser. B 26, 211–252 (1964)

    MathSciNet  MATH  Google Scholar 

  • Brent, R.: Algorithms for Minimization without Derivatives. Prentice-Hall, Englewood Cliffs (1973)

    MATH  Google Scholar 

  • Campbell, N.A., Mahon, R.J.: A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Aust. J. Zool. 22(3), 417–425 (1974)

    Article  Google Scholar 

  • Carroll, R.J.: Prediction and power transformations when the choice of power is restricted to a finite set. J. Am. Stat. Assoc. 77(380), 908–915 (1982)

    Article  MATH  Google Scholar 

  • Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14(3), 315–332 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  • Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28(5), 781–793 (1995)

    Article  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  • Forbes, F., Peyrard, N., Fraley, C., Georgian-Smith, D., Goldhaber, D.M., Raftery, A.E.: Model-based region-of-interest selection in dynamic breast MRI. J. Comput. Assist. Tomogr. 30, 675–687 (2006)

    Article  Google Scholar 

  • Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25(3), 189–201 (1986)

    Google Scholar 

  • Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)

    Article  MATH  Google Scholar 

  • Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley, C., Raftery, A., Wehrens, R.: Incremental model-based clustering for large datasets with small clusters. J. Comput. Graph. Stat. 14(3), 529–546 (2005)

    Article  MathSciNet  Google Scholar 

  • Fraley, C., Raftery, A.E.: MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Tech. Rep. 504, University of Washington, Department of Statistics (2006, revised 2009)

  • Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10), R80 (2004)

    Article  Google Scholar 

  • Gutierrez, R.G., Carroll, R.J., Wang, N., Lee, G.H., Taylor, B.H.: Analysis of tomato root initiation using a normal mixture distribution. Biometrics 51, 1461–1468 (1995)

    Article  MATH  Google Scholar 

  • Hurley, C.: Clustering visualizations of multivariate data. J. Comput. Graph. Stat. 13(4), 788–806 (2004)

    Article  MathSciNet  Google Scholar 

  • Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice Hall, Upper Saddle River (2002)

    Google Scholar 

  • Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90(430), 773–795 (1995)

    Article  MATH  Google Scholar 

  • Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā Ser. A 62(1), 49–66 (2000)

    MathSciNet  MATH  Google Scholar 

  • Kotz, S., Nadarajah, S.: Multivariate t Distributions and Their Applications. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  • Kriessler, J.R., Beers, T.C.: Substructure in galaxy clusters: a two-dimensional approach. Astron. J. 113, 80–100 (1997)

    Article  Google Scholar 

  • Lange, K.L., Little, R.J.A., Taylor, J.M.G.: Robust statistical modeling using the t-distribution. J. Am. Stat. Assoc. 84, 881–896 (1989)

    Article  MathSciNet  Google Scholar 

  • Leroux, M.: Consistent estimation of a mixing distribution. Ann. Stat. 20, 1350–1360 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  • Li, Q., Fraley, C., Bumgarner, R.E., Yeung, K.Y., Raftery, A.E.: Donuts, scratches and blanks: Robust model-based segmentation of microarray images. Bioinformatics 21(12), 2875–2882 (2005)

    Article  Google Scholar 

  • Lin, T.I.: Maximum likelihood estimation for multivariate skew normal mixture models. J. Multivar. Anal. 100(2), 257–265 (2009a)

    Article  MATH  Google Scholar 

  • Lin, T.I.: Robust mixture modeling using multivariate skew t distributions. Stat. Comput. 20(3), 343–356 (2010)

    Article  MathSciNet  Google Scholar 

  • Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007a)

    Article  MathSciNet  Google Scholar 

  • Lin, T.I., Lee, J.C., Yen, S.Y.: Finite mixture modelling using the skew normal distribution. Stat. Sin. 17, 909–927 (2007b)

    MathSciNet  MATH  Google Scholar 

  • Liu, C.: ML estimation of the multivariate t distribution and the EM algorithm. J. Multivar. Anal. 63, 296–312 (1997)

    Article  MATH  Google Scholar 

  • Liu, C., Rubin, D.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, C., Rubin, D.: ML estimation of the t distribution using EM and its extensions, ECM and ECME. Stat. Sin. 5, 19–39 (1995)

    MathSciNet  MATH  Google Scholar 

  • Lo, K., Brinkman, R.R., Gottardo, R.: Automated gating of flow cytometry data via robust model-based clustering. Cytometry A 73A(4), 321–332 (2008)

    Article  Google Scholar 

  • Lo, K., Hahne, F., Brinkman, R.R., Gottardo, R.: flowClust: a Bioconductor package for automated gating of flow cytometry data. BMC Bioinformatics 10, 145 (2009)

    Article  Google Scholar 

  • MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: LeCam, L., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 281–297. University of California Press, Berkeley (1967)

    Google Scholar 

  • McLachlan, G.J.: The classification and mixture maximum likelihood approaches to cluster analysis. In: Krishnaiah, P.R., Kanal, L. (eds.) Handbook of Statistics. vol. 2, pp. 199–208. North-Holland, Amsterdam (1982)

    Google Scholar 

  • McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering. Dekker, New York (1988)

    MATH  Google Scholar 

  • McLachlan, G., Peel, D.: Finite Mixture Models. Wiley-Interscience, New York (2000)

    Book  MATH  Google Scholar 

  • McLachlan, G.J., Bean, R.W., Peel, D.: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18(3), 413–422 (2002)

    Article  Google Scholar 

  • McLachlan, G., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)

    Article  MathSciNet  Google Scholar 

  • McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)

    Article  MathSciNet  Google Scholar 

  • Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  • Mukherjee, S., Feigelson, E.D., Babu, G.J., Murtagh, F., Fraley, C., Raftery, A.E.: Three types of gamma ray bursts. Astrophys. J. 508, 314–327 (1998)

    Article  Google Scholar 

  • Pan, W., Lin, J., Le, C.T.: Model-based cluster analysis of microarray gene-expression data. Genome Biol. 3(2), R9 (2002)

    Article  Google Scholar 

  • Peel, D., McLachlan, G.J.: Robust mixture modelling using the t distribution. Stat. Comput. 10(4), 339–348 (2000)

    Article  Google Scholar 

  • Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.I., Maier, L.M., Baecher-Allan, C., McLachlan, G.J., Tamayo, P., Hafler, D.A., De Jager, P.L., Mesirov, J.P.: Automated high-dimensional flow cytometric data analysis. Proc. Natl. Acad. Sci. USA 106(21), 8519–8524 (2009)

    Article  Google Scholar 

  • Raftery, A.E., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. Assoc. 101(473), 168–178 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to Bayesian regression. Can. J. Stat. 31(2), 129–150 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  • Schork, N.J., Schork, M.A.: Skewness and mixtures of normal distributions. Commun. Stat. Theory Methods 17, 3951–3969 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  • Schroeter, P., Vesin, J.M., Langenberger, T., Meuli, R.: Robust parameter estimation of intensity distributions for brain magnetic resonance images. IEEE Trans. Med. Imag. 17(2), 172–186 (1998)

    Article  Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)

    Article  MATH  Google Scholar 

  • Scrucca, L.: Dimension reduction for model-based clustering. Stat. Comput. 20(4), 471–484 (2010)

    Article  MathSciNet  Google Scholar 

  • Stephens, M.: Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods. Ann. Stat. 28, 40–74 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Titterington, D.M., Smith, A.F.M., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester (1985)

    MATH  Google Scholar 

  • Wang, K., Ng, S.K., McLachlan, G.J.: Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In: Conference Proceedings of Digital Image Computing: Techniques and Applications, pp. 526–531. IEEE Computer Society, Los Alamitos (2009)

    Chapter  Google Scholar 

  • Wehrens, R., Buydens, L.M.C., Fraley, C., Raftery, A.E.: Model-based clustering for image segmentation and large datasets via sampling. J. Classif. 21, 231–253 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10), 977–987 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raphael Gottardo.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 446 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lo, K., Gottardo, R. Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution. Stat Comput 22, 33–52 (2012). https://doi.org/10.1007/s11222-010-9204-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-010-9204-1

Keywords

Navigation