Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution

Lo, Kenneth; Gottardo, Raphael

doi:10.1007/s11222-010-9204-1

Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution

Published: 08 October 2010

Volume 22, pages 33–52, (2012)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Kenneth Lo¹ &
Raphael Gottardo²

537 Accesses
35 Citations
Explore all metrics

Abstract

Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The orthogonal skew model: computationally efficient multivariate skew-normal and skew-t distributions with applications to model-based clustering

Article 26 February 2024

Ryan P. Browne & Jeffrey L. Andrews

Robust mixture regression modeling based on scale mixtures of skew-normal distributions

Article 19 July 2015

Camila B. Zeller, Celso R. B. Cabral & Víctor H. Lachos

Flexible Modelling via Multivariate Skew Distributions

References

Altman, E.I.: Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J. Finance 23(4), 589–609 (1968)
Article Google Scholar
Andrews, J.L., McNicholas, P.D.: Extending mixtures of multivariate t-factor analyzers. Stat. Comput. (2010, in press). doi: 10.1007/s11222-010-9175-2
Atkinson, A.C.: Transformations unmasked. Technometrics 30, 311–318 (1988)
Article MATH Google Scholar
Azzalini, A.: A class of distributions which includes the normal ones. Scand. J. Statist. 12, 171–178 (1985)
MathSciNet MATH Google Scholar
Azzalini, A., Capitanio, A.: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. J. R. Stat. Soc. Ser. B 65(2), 367–389 (2003)
Article MathSciNet MATH Google Scholar
Azzalini, A., Dalla Valle, A.: The multivariate skew-normal distribution. Biometrika 83(4), 715–726 (1996)
Article MathSciNet MATH Google Scholar
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Article MathSciNet MATH Google Scholar
Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P.: Inference in model-based cluster analysis. Stat. Comput. 7, 1–10 (1997)
Article Google Scholar
Bickel, P.J., Doksum, K.A.: An analysis of transformations revisited. J. Am. Stat. Assoc. 76(374), 296–311 (1981)
Article MathSciNet MATH Google Scholar
Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52, 502–519 (2007)
Article MathSciNet MATH Google Scholar
Box, G.E.P., Cox, D.R.: An analysis of transformations. J. R. Stat. Soc. Ser. B 26, 211–252 (1964)
MathSciNet MATH Google Scholar
Brent, R.: Algorithms for Minimization without Derivatives. Prentice-Hall, Englewood Cliffs (1973)
MATH Google Scholar
Campbell, N.A., Mahon, R.J.: A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Aust. J. Zool. 22(3), 417–425 (1974)
Article Google Scholar
Carroll, R.J.: Prediction and power transformations when the choice of power is restricted to a finite set. J. Am. Stat. Assoc. 77(380), 908–915 (1982)
Article MATH Google Scholar
Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14(3), 315–332 (1992)
Article MathSciNet MATH Google Scholar
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28(5), 781–793 (1995)
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Forbes, F., Peyrard, N., Fraley, C., Georgian-Smith, D., Goldhaber, D.M., Raftery, A.E.: Model-based region-of-interest selection in dynamic breast MRI. J. Comput. Assist. Tomogr. 30, 675–687 (2006)
Article Google Scholar
Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25(3), 189–201 (1986)
Google Scholar
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)
Article MATH Google Scholar
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)
Article MathSciNet MATH Google Scholar
Fraley, C., Raftery, A., Wehrens, R.: Incremental model-based clustering for large datasets with small clusters. J. Comput. Graph. Stat. 14(3), 529–546 (2005)
Article MathSciNet Google Scholar
Fraley, C., Raftery, A.E.: MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Tech. Rep. 504, University of Washington, Department of Statistics (2006, revised 2009)
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10), R80 (2004)
Article Google Scholar
Gutierrez, R.G., Carroll, R.J., Wang, N., Lee, G.H., Taylor, B.H.: Analysis of tomato root initiation using a normal mixture distribution. Biometrics 51, 1461–1468 (1995)
Article MATH Google Scholar
Hurley, C.: Clustering visualizations of multivariate data. J. Comput. Graph. Stat. 13(4), 788–806 (2004)
Article MathSciNet Google Scholar
Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice Hall, Upper Saddle River (2002)
Google Scholar
Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90(430), 773–795 (1995)
Article MATH Google Scholar
Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā Ser. A 62(1), 49–66 (2000)
MathSciNet MATH Google Scholar
Kotz, S., Nadarajah, S.: Multivariate t Distributions and Their Applications. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Kriessler, J.R., Beers, T.C.: Substructure in galaxy clusters: a two-dimensional approach. Astron. J. 113, 80–100 (1997)
Article Google Scholar
Lange, K.L., Little, R.J.A., Taylor, J.M.G.: Robust statistical modeling using the t-distribution. J. Am. Stat. Assoc. 84, 881–896 (1989)
Article MathSciNet Google Scholar
Leroux, M.: Consistent estimation of a mixing distribution. Ann. Stat. 20, 1350–1360 (1992)
Article MathSciNet MATH Google Scholar
Li, Q., Fraley, C., Bumgarner, R.E., Yeung, K.Y., Raftery, A.E.: Donuts, scratches and blanks: Robust model-based segmentation of microarray images. Bioinformatics 21(12), 2875–2882 (2005)
Article Google Scholar
Lin, T.I.: Maximum likelihood estimation for multivariate skew normal mixture models. J. Multivar. Anal. 100(2), 257–265 (2009a)
Article MATH Google Scholar
Lin, T.I.: Robust mixture modeling using multivariate skew t distributions. Stat. Comput. 20(3), 343–356 (2010)
Article MathSciNet Google Scholar
Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007a)
Article MathSciNet Google Scholar
Lin, T.I., Lee, J.C., Yen, S.Y.: Finite mixture modelling using the skew normal distribution. Stat. Sin. 17, 909–927 (2007b)
MathSciNet MATH Google Scholar
Liu, C.: ML estimation of the multivariate t distribution and the EM algorithm. J. Multivar. Anal. 63, 296–312 (1997)
Article MATH Google Scholar
Liu, C., Rubin, D.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)
Article MathSciNet MATH Google Scholar
Liu, C., Rubin, D.: ML estimation of the t distribution using EM and its extensions, ECM and ECME. Stat. Sin. 5, 19–39 (1995)
MathSciNet MATH Google Scholar
Lo, K., Brinkman, R.R., Gottardo, R.: Automated gating of flow cytometry data via robust model-based clustering. Cytometry A 73A(4), 321–332 (2008)
Article Google Scholar
Lo, K., Hahne, F., Brinkman, R.R., Gottardo, R.: flowClust: a Bioconductor package for automated gating of flow cytometry data. BMC Bioinformatics 10, 145 (2009)
Article Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: LeCam, L., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Google Scholar
McLachlan, G.J.: The classification and mixture maximum likelihood approaches to cluster analysis. In: Krishnaiah, P.R., Kanal, L. (eds.) Handbook of Statistics. vol. 2, pp. 199–208. North-Holland, Amsterdam (1982)
Google Scholar
McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering. Dekker, New York (1988)
MATH Google Scholar
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley-Interscience, New York (2000)
Book MATH Google Scholar
McLachlan, G.J., Bean, R.W., Peel, D.: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18(3), 413–422 (2002)
Article Google Scholar
McLachlan, G., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)
Article MathSciNet Google Scholar
McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)
Article MathSciNet Google Scholar
Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278 (1993)
Article MathSciNet MATH Google Scholar
Mukherjee, S., Feigelson, E.D., Babu, G.J., Murtagh, F., Fraley, C., Raftery, A.E.: Three types of gamma ray bursts. Astrophys. J. 508, 314–327 (1998)
Article Google Scholar
Pan, W., Lin, J., Le, C.T.: Model-based cluster analysis of microarray gene-expression data. Genome Biol. 3(2), R9 (2002)
Article Google Scholar
Peel, D., McLachlan, G.J.: Robust mixture modelling using the t distribution. Stat. Comput. 10(4), 339–348 (2000)
Article Google Scholar
Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.I., Maier, L.M., Baecher-Allan, C., McLachlan, G.J., Tamayo, P., Hafler, D.A., De Jager, P.L., Mesirov, J.P.: Automated high-dimensional flow cytometric data analysis. Proc. Natl. Acad. Sci. USA 106(21), 8519–8524 (2009)
Article Google Scholar
Raftery, A.E., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. Assoc. 101(473), 168–178 (2006)
Article MathSciNet MATH Google Scholar
Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to Bayesian regression. Can. J. Stat. 31(2), 129–150 (2003)
Article MathSciNet MATH Google Scholar
Schork, N.J., Schork, M.A.: Skewness and mixtures of normal distributions. Commun. Stat. Theory Methods 17, 3951–3969 (1988)
Article MathSciNet MATH Google Scholar
Schroeter, P., Vesin, J.M., Langenberger, T., Meuli, R.: Robust parameter estimation of intensity distributions for brain magnetic resonance images. IEEE Trans. Med. Imag. 17(2), 172–186 (1998)
Article Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Article MATH Google Scholar
Scrucca, L.: Dimension reduction for model-based clustering. Stat. Comput. 20(4), 471–484 (2010)
Article MathSciNet Google Scholar
Stephens, M.: Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods. Ann. Stat. 28, 40–74 (2000)
Article MathSciNet MATH Google Scholar
Titterington, D.M., Smith, A.F.M., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester (1985)
MATH Google Scholar
Wang, K., Ng, S.K., McLachlan, G.J.: Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In: Conference Proceedings of Digital Image Computing: Techniques and Applications, pp. 526–531. IEEE Computer Society, Los Alamitos (2009)
Chapter Google Scholar
Wehrens, R., Buydens, L.M.C., Fraley, C., Raftery, A.E.: Model-based clustering for image segmentation and large datasets via sampling. J. Classif. 21, 231–253 (2004)
Article MathSciNet MATH Google Scholar
Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10), 977–987 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Microbiology, University of Washington, Seattle, WA, USA
Kenneth Lo
Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Raphael Gottardo

Authors

Kenneth Lo
View author publications
You can also search for this author in PubMed Google Scholar
Raphael Gottardo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raphael Gottardo.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 446 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lo, K., Gottardo, R. Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution. Stat Comput 22, 33–52 (2012). https://doi.org/10.1007/s11222-010-9204-1

Download citation

Received: 12 August 2009
Accepted: 14 September 2010
Published: 08 October 2010
Issue Date: January 2012
DOI: https://doi.org/10.1007/s11222-010-9204-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution

Abstract

Access this article

Similar content being viewed by others

The orthogonal skew model: computationally efficient multivariate skew-normal and skew-t distributions with applications to model-based clustering

Robust mixture regression modeling based on scale mixtures of skew-normal distributions

Flexible Modelling via Multivariate Skew Distributions

References

Author information

Authors and Affiliations

Corresponding author

Electronic Supplementary Material

(PDF 446 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution

Abstract

Access this article

Similar content being viewed by others

The orthogonal skew model: computationally efficient multivariate skew-normal and skew-t distributions with applications to model-based clustering

Robust mixture regression modeling based on scale mixtures of skew-normal distributions

Flexible Modelling via Multivariate Skew Distributions

References

Author information

Authors and Affiliations

Corresponding author

Electronic Supplementary Material

(PDF 446 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation