Model-based clustering with determinant-and-shape constraint

Abstract

Model-based approaches to cluster analysis and mixture modeling often involve maximizing classification and mixture likelihoods. Without appropriate constrains on the scatter matrices of the components, these maximizations result in ill-posed problems. Moreover, without constrains, non-interesting or “spurious” clusters are often detected by the EM and CEM algorithms traditionally used for the maximization of the likelihood criteria. Considering an upper bound on the maximal ratio between the determinants of the scatter matrices seems to be a sensible way to overcome these problems by affine equivariant constraints. Unfortunately, problems still arise without also controlling the elements of the “shape” matrices. A new methodology is proposed that allows both control of the scatter matrices determinants and also the shape matrices elements. Some theoretical justification is given. A fast algorithm is proposed for this doubly constrained maximization. The methodology is also extended to robust model-based clustering problems.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

References

  1. Andrews, J., Wickins, J., Boers, N., McNicholas, P.: teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J. Stat. Softw. 83, 1–32 (2018)

    Google Scholar 

  2. Bagnato, L., Punzo, A., Zoia, M.G.: The multivariate leptokurtic-normal distribution and its application in model-based clustering. Can. J. Stat. 45, 95–119 (2017)

    MathSciNet  MATH  Google Scholar 

  3. Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)

    MathSciNet  MATH  Google Scholar 

  4. Baudry, J.P., Celeux, G.: EM for mixtures—initialization requires special care. Stat. Comput. 25, 713–726 (2015)

    MathSciNet  MATH  Google Scholar 

  5. Biernacki, C., Chretien, S.: Degeneracy in the maximum likelihood estimation of univariate. Stat. Probab. Lett. 61, 373–382 (2003)

    MATH  Google Scholar 

  6. Biernacki, C., Lourme, A.: Stable and visualizable Gaussian parsimonious clustering models. Stat. Comput. 24, 953–969 (2014)

    MathSciNet  MATH  Google Scholar 

  7. Browne, R., Subedi, S., McNicholas, P.: Constrained optimization for a subset of the Gaussian parsimonious clustering models (2013). preprint available at arXiv:1306.5824

  8. Celeux, G., Govaert, A.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data. 14, 315–332 (1992)

    MathSciNet  MATH  Google Scholar 

  9. Cerioli, A., García-Escudero, L., Mayo-Iscar, A., Riani, M.: Finding the number of normal groups in model-based clustering via constrained likelihoods. J. Comput. Graph Stat. 27, 404–416 (2018)

    MathSciNet  Google Scholar 

  10. Coretto, P., Hennig, C.: Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. J. Am. Stat. Assoc. 111, 1648–1659 (2016)

    MathSciNet  Google Scholar 

  11. Dang, U., Browne, R., McNicholas, P.D.: Mixtures of multivariate power exponential distributions. Biometrics 71, 1081–1089 (2015)

    MathSciNet  MATH  Google Scholar 

  12. Day, N.: Estimating the components of a mixture of normal distributions. Biometrika 56, 463–474 (1969)

    MathSciNet  MATH  Google Scholar 

  13. Dotto, F., Farcomeni, A., García-Escudero, L., Mayo-Iscar, A.: A reweighting approach to robust clustering. Stat. Comput. 28, 477–493 (2018)

    MathSciNet  MATH  Google Scholar 

  14. Flury, B., Riedwyl, H.: Multivariate Statistics, A Practical Approach. Cambridge University Press, Cambridge (1988)

    MATH  Google Scholar 

  15. Friedman, H., Rubin, J.: On some invariant criteria for grouping data. J. Am. Stat. Assoc. 63, 1159–1178 (1967)

    MathSciNet  Google Scholar 

  16. Fritz, H., García-Escudero, L., Mayo-Iscar, A.: A fast algorithm for robust constrained clustering. Comput. Stat. Data Anal. 61, 124–136 (2013)

    MathSciNet  MATH  Google Scholar 

  17. Gallegos, M., Ritter, G.: A robust method for cluster analysis. Ann. Stat. 33, 347–380 (2005)

    MathSciNet  MATH  Google Scholar 

  18. Gallegos, M., Ritter, G.: Trimming algorithms for clustering contaminated grouped data and their robustness. Adv. Data Anal. Classif. 10, 135–167 (2009)

    MathSciNet  MATH  Google Scholar 

  19. Gallegos, M.T.: Maximum likelihood clustering with outliers. In: Jajuga, K., Sokolowski, A., Bock, H. (eds.) Classification, Clustering and Data Analysis: Recent Advances and Applications, pp. 247–255. Springer, Berlin (2002)

    Google Scholar 

  20. García-Escudero, L., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)

    MathSciNet  MATH  Google Scholar 

  21. García-Escudero, L., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Exploring the number of groups in robust model-based clustering. Stat. Comput. 21, 585–599 (2011)

    MathSciNet  MATH  Google Scholar 

  22. García-Escudero, L., Gordaliza, A., Mayo-Iscar, A.: A review of robust clustering methods. Adv. Data Anal. Classif. 8, 27–43 (2014a)

    MathSciNet  MATH  Google Scholar 

  23. García-Escudero, L., Gordaliza, A., Mayo-Iscar, A.: A constrained robust proposal for mixture modeling avoiding spurious solutions. Adv. Data Anal. Classif. 8, 27–43 (2014b)

    MathSciNet  MATH  Google Scholar 

  24. García-Escudero, L., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Avoiding spurious local maximizers in mixture modeling. Stat. Comput. 25, 619–633 (2015)

    MathSciNet  MATH  Google Scholar 

  25. García-Escudero, L., Gordaliza, A., Greselin, F., Ingrassia, S., Mayo-Iscar, A.: Eigenvalues and constraints in mixture modeling: geometric and computational issues. Adv. Data Anal. Classif. 12, 203–233 (2018)

    MathSciNet  MATH  Google Scholar 

  26. Hathaway, R.: A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Stat. 13, 795–800 (1985)

    MathSciNet  MATH  Google Scholar 

  27. Hennig, C., Liao, T.F.: How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. J. R. Stat. Soc. Ser. C 62, 309–369 (2013)

    MathSciNet  Google Scholar 

  28. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)

    MATH  Google Scholar 

  29. Ingrassia, S., Rocci, R.: Constrained monotone EM algorithms for finite mixture of multivariate Gaussians. Comput. Stat. Data Anal. 51, 5339–5351 (2007)

    MathSciNet  MATH  Google Scholar 

  30. Kiefer, J., Wolfowitz, J.: Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Stat. 27, 887–906 (1956)

    MathSciNet  MATH  Google Scholar 

  31. Maitra, R., Melnykov, V.: Simulating data to study performance of finite mixture modeling and clustering algorithms. J. Comput. Graph Stat. 19, 354–376 (2010)

    MathSciNet  Google Scholar 

  32. Maronna, R., Jacovkis, P.: Multivariate clustering procedures with variable metrics. Biometrics 30, 499–505 (1974)

    MATH  Google Scholar 

  33. McLachlan, G., Peel, D.: Finite Mixture Models. Wiley Series in Probability and Statistics. Wiley, New York (2000)

    MATH  Google Scholar 

  34. Neykov, N., Filzmoser, P., Dimova, R., Neytchev, P.: Robust fitting of mixtures using the trimmed likelihood estimator. Comput. Stat. Data Anal. 52, 299–308 (2007)

    MathSciNet  MATH  Google Scholar 

  35. Peel, D., McLachlan, G.J.: Robust mixture modelling using the \(t\) distribution. Stat. Comput. 10, 339–348 (2000)

    Google Scholar 

  36. Punzo, A., McNicholas, P.D.: Parsimonious mixtures of multivariate contaminated normal distributions. Biomet. J. 58, 1506–1537 (2016)

    MathSciNet  MATH  Google Scholar 

  37. Punzo, A., Mazza, A., McNicholas, P.D.: Contaminatedmixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J. Stat. Softw. 85, 1–25 (2018)

    Google Scholar 

  38. Riani, M., Perrotta, D., Torti, F.: FSDA: a Matlab toolbox for robust analysis and interactive data exploration. Chemom. Intell. Lab. Syst. 116, 17–32 (2012)

    Google Scholar 

  39. Riani, M., Cerioli, A., Perrotta, D., Torti, F.: Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library. Adv. Data Anal. Classif. 9, 461–481 (2015)

    MathSciNet  MATH  Google Scholar 

  40. Riani, M., Atkinson, A., Cerioli, A., Corbellini, A.: Efficient robust methods via monitoring for clustering and multivariate data analysis. Pattern Recognit. 88, 246–260 (2019)

    Google Scholar 

  41. Ritter, G.: Cluster Analysis and Variable Selection. CRC Press, Boca Raton (2014)

    MATH  Google Scholar 

  42. Rocci, R., Gattone, S., Di Mari, R.: A data driven equivariant approach to constrained Gaussian mixture modeling. Adv. Data Anal. Classif. 12, 235–260 (2018)

    MathSciNet  MATH  Google Scholar 

  43. Rousseeuw, P., Van Driessen, K.: A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223 (1999)

    Google Scholar 

  44. Seo, B., Kim, D.: Root selection in normal mixture models. Comput. Stat. Data Anal. 56, 2454–2470 (2012)

    MathSciNet  MATH  Google Scholar 

  45. Zhang, J., Liang, F.: Robust clustering using exponential power mixtures. Biometrics 66, 1078–1086 (2010)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Luis Angel García-Escudero.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is partially supported by Spanish Ministerio de Economía y Competitividad, Grant MTM2017-86061-C2-1-P, and by Consejería de Educación de la Junta de Castilla y León and FEDER, Grant VA005P17 and VA002G18. This research benefits from the HPC (High Performance Computing) facility of the University of Parma, Italy. M.R. gratefully acknowledges support from the CRoNoS project, reference CRoNoS COST Action IC1408 and the University of Parma project “Statistics for fraud detection, with 237 applications to trade data and financial statement”. The authors also thank the editor, the associate editor, and the anonymous referees for their constructive comments.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

García-Escudero, L.A., Mayo-Iscar, A. & Riani, M. Model-based clustering with determinant-and-shape constraint. Stat Comput 30, 1363–1380 (2020). https://doi.org/10.1007/s11222-020-09950-w

Download citation

Keywords

  • Clustering
  • Constraints
  • Mixture modeling
  • Robustness