Skip to main content
Log in

Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

One of the challenges in cluster analysis is the evaluation of the obtained clustering results without using auxiliary information. To this end, a common approach is to use internal validity criteria. For mixtures of linear regressions whose parameters are estimated by maximum likelihood, we propose a three-term decomposition of the total sum of squares as a starting point to define some internal validity criteria. In particular, three types of mixtures of regressions are considered: with fixed covariates, with concomitant variables, and with random covariates. A ternary diagram is also suggested for easier joint interpretation of the three terms of the proposed decomposition. Furthermore, local and overall coefficients of determination are respectively defined to judge how well the model fits the data group-by-group but also taken as a whole. Artificial data are considered to find out more about the proposed decomposition, including violations of the model assumptions. Finally, an application to real data illustrates the use and the usefulness of these proposals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Aitchison, J. (2003). The Statistical Analysis of Compositional Data. Caldwell: Blackburn Press.

    MATH  Google Scholar 

  • Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256.

    Google Scholar 

  • Bagnato, L., & Punzo, A. (2013). Finite mixtures of unimodal beta and gamma densities and the k-bumps algorithm. Computational Statistics, 28(4), 1571–1597.

    MathSciNet  MATH  Google Scholar 

  • Berta, P., Ingrassia, S., Punzo, A., & Vittadini, G. (2016). Multilevel cluster-weighted models for the evaluation of hospitals. METRON, 74(3), 275–292.

    MathSciNet  MATH  Google Scholar 

  • Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4), 561–575.

    MathSciNet  MATH  Google Scholar 

  • Buse, A. (1973). Goodness of fit in generalized least squares estimation. The American Statistician, 27(3), 106–108.

    Google Scholar 

  • Cameron, A.C., & Windmeijer, F.A.G. (1996). R-squared measures for count data regression models with applications to health-care utilization. Journal of Business & Economic Statistics, 14(2), 209–220.

    Google Scholar 

  • Cameron, A.C., & Windmeijer, F.A.G. (1997). An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics, 77(2), 329–342.

    MathSciNet  MATH  Google Scholar 

  • Cellini, R., & Cuccia, T. (2013). Museum and monument attendance and tourism flow: a time series approach. Applied Economics, 45, 3473–3482.

    Google Scholar 

  • Cerdeira, J.O., Martins, M.J., & Silva, P.C. (2012). A combinatorial approach to assess the separability of clusters. Journal of Classification, 29(1), 7–22.

    MathSciNet  MATH  Google Scholar 

  • Chatterjee, S., & Hadi, A.S. (2006). Regression Analysis by Example, volume 607 of Wiley Series in Probability and Statistics. Hoboken: Wiley.

    Google Scholar 

  • Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., & Browne, R.P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classification, 34(1), 4–34.

    MathSciNet  MATH  Google Scholar 

  • Davidson, R., & MacKinnon, J.G. (2004). Econometric Theory and Methods. Oxford: Oxford University Press.

    Google Scholar 

  • Dayton, C.M., & Macready, G.B. (1988). Concomitant-variable latent-class models. Journal of the American Statistical Association, 83(401), 173–178.

    MathSciNet  Google Scholar 

  • de Amorim, R.C. (2016). A survey on feature weighting based k-means algorithms. Journal of Classification, 33(2), 210–242.

    MathSciNet  MATH  Google Scholar 

  • Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39(1), 1–38.

    MathSciNet  MATH  Google Scholar 

  • DeSarbo, W.S., & Cron, W.L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5(2), 249–282.

    MathSciNet  MATH  Google Scholar 

  • Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. New York: Springer.

    MATH  Google Scholar 

  • Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Annals of the New York Academy of Sciences, 808(1), 18–24.

    Google Scholar 

  • Grün, B., & Leisch, F. (2008). FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. Journal of Statistical Software, 28(4), 1–35.

    Google Scholar 

  • Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145.

    MATH  Google Scholar 

  • Hennig, C. (2000). Identifiablity of models for clusterwise linear regression. Journal of Classification, 17(2), 273–296.

    MathSciNet  MATH  Google Scholar 

  • Hosmer, D.W. (1974). Maximum likelihood estimates of the parameters of a mixture of two regression lines. Communications in Statistics-Theory and Methods, 3(10), 995–1006.

    MATH  Google Scholar 

  • Huitema, B.E. (2011). The Analysis of Covariance and Alternatives: Statistical Methods for Experiments, Quasi-Experiments, and Single-Case Studies, volume 608 of Wiley Series in Probability and Statistics. New Jersey: Wiley.

    MATH  Google Scholar 

  • Ingrassia, S., & Punzo, A. (2016). Decision boundaries for mixtures of regressions. Journal of the Korean Statistical Society, 45(2), 295–306.

    MathSciNet  MATH  Google Scholar 

  • Ingrassia, S., Minotti, S., & Vittadini, G. (2012). Local statistical modeling via the cluster-weighted approach with elliptical distributions. Journal of Classification, 29(3), 363–401.

    MathSciNet  MATH  Google Scholar 

  • Ingrassia, S., Minotti, S.C., & Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Computational Statistics and Data Analysis, 71, 159–182.

    MathSciNet  MATH  Google Scholar 

  • Ingrassia, S., Punzo, A., Vittadini, G., & Minotti, S.C. (2015). The generalized linear mixed cluster-weighted model. Journal of Classification, 32(1), 85–113.

    MathSciNet  MATH  Google Scholar 

  • Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics & Data Analysis, 41(3–4), 577–590.

    MathSciNet  MATH  Google Scholar 

  • Lange, K.L., Little, R.J.A., & Taylor, J.M.G. (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84(408), 881–896.

    MathSciNet  Google Scholar 

  • Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8), 1–18.

    Google Scholar 

  • Maddala, G.S. (1986). Limited-Dependent and Qualitative Variables in Econometrics. Econometric Society Monographs. Cambridge: Cambridge University Press.

    Google Scholar 

  • Mazza, A., & Punzo, A. (2018). Mixtures of multivariate contaminated normal regression models. Statistical Papers. https://doi.org/10.1007/s00362-017-0964-y.

  • Mazza, A., Punzo, A., & Ingrassia, S. (2018). flexCWM: Flexible cluster-weighted modeling. Journal of Statistical Software, 86(2), 1–30.

    Google Scholar 

  • Mazza, A., Battisti, M., Ingrassia, S., & Punzo, A. (2019). Modeling return to education in heterogeneous populations. An application to Italy. In Greselin, I., Deldossi, L., Vichi, M., & Bagnato, L. (Eds.) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization. Switzerland: Springer International Publishing.

  • McNicholas, P.D. (2016). Model-based clustering. Journal of Classification, 33 (3), 331–373.

    MathSciNet  MATH  Google Scholar 

  • Milligan, G.W., & Cheng, R. (1996). Measuring the influence of individual data points in a cluster analysis. Journal of Classification, 13(2), 315–335.

    MATH  Google Scholar 

  • Panagiotakis, C. (2015). Point clustering via voting maximization. Journal of Classification, 32(2), 212–240.

    MathSciNet  MATH  Google Scholar 

  • Punzo, A. (2014). Flexible mixture modeling with the polynomial Gaussian cluster-weighted model. Statistical Modelling, 14(3), 257–291.

    MathSciNet  Google Scholar 

  • Punzo, A., & Ingrassia, S. (2015). Parsimonious generalized linear Gaussian cluster-weighted models. In Morlini, I.s, Minerva, T., & Vichi, M. (Eds.) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization (pp. 201–209). Switzerland: Springer International Publishing.

  • Punzo, A., & Ingrassia, S. (2016). Clustering bivariate mixed-type data via the cluster-weighted model. Computational Statistics, 31(3), 989–1013.

    MathSciNet  MATH  Google Scholar 

  • Punzo, A., & McNicholas, P.D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. Journal of Classification, 34 (2), 249–293.

    MathSciNet  MATH  Google Scholar 

  • Punzo, A., Ingrassia, S., & Maruotti, A. (2018). Multivariate generalized hidden Markov regression models with random covariates: physical exercise in an elderly population. Statistics in Medicine, 37(19), 2797–2808.

    MathSciNet  Google Scholar 

  • Quandt, R.E. (1972). A new approach to estimating switching regressions. Journal of the American Statistical Association, 67(338), 306–310.

    MATH  Google Scholar 

  • Quandt, R.E., & Ramsey, J.B. (1978). Estimating mixtures of normal distributions and switching regressions. Journal of the American Statistical Association, 73(364), 730–738.

    MathSciNet  MATH  Google Scholar 

  • R Core Team. (2016). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.

    Google Scholar 

  • Rezaee, M.R., Lelieveldt, B.P.F., & Reiber, J.H.C. (1998). A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters, 19(3-4), 237–246.

    MATH  Google Scholar 

  • Rousseeuw, P.J., & Van Zomeren, B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85(411), 633–639.

    Google Scholar 

  • Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.

    MathSciNet  MATH  Google Scholar 

  • Steinley, D., Hendrickson, G., & Brusco, M.J. (2015). A note on maximizing the agreement between partitions: a stepwise optimal algorithm and some properties. Journal of Classification, 32(1), 114–126.

    MathSciNet  MATH  Google Scholar 

  • Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2013). Clustering and classification via cluster-weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40.

    MathSciNet  MATH  Google Scholar 

  • Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2015). Cluster-weighted t-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications, 24(4), 623–649.

    MathSciNet  MATH  Google Scholar 

  • Theodoridis, S., & Koutroumbas, K. (2008). Pattern Recognition. London: Academic Press.

    MATH  Google Scholar 

  • Veall, M.R., & Zimmermann, K.F. (1996). Pseudo-R2 measures for some common limited dependent variable models. Journal of Economic Surveys, 10(3), 241–259.

    Google Scholar 

  • Wedel, M. (1990). Clusterwise Regression and Market Segmentation: Developments and Applications. Landbouwuniversiteit te Wageningen.

  • Wedel, M. (2002). Concomitant variables in finite mixture models. Statistica Neerlandica, 56(3), 362–375.

    MathSciNet  MATH  Google Scholar 

  • Wedel, M., & De Sarbo, W. (1995). A mixture likelihood approach for generalized linear models. Journal of Classification, 12(3), 21–55.

    MATH  Google Scholar 

  • Wedel, M., & Kamakura, W.A. (2000). Market Segmentation: Conceptual and Methodological Foundations, 2nd edn. Boston: Kluwer Academic Publishers.

    Google Scholar 

  • Willett, J.B., & Singer, J.D. (1988). Another cautionary note about r2: Its use in weighted least-squares regression analysis. The American Statistician, 42(3), 236–238.

    Google Scholar 

  • Windmeijer, F.A.G. (1995). Goodness-of-fit measures in binary choice models. Econometric Reviews, 14(1), 101–116.

    MathSciNet  MATH  Google Scholar 

  • Zarei, S., Mohammadpour, A., Ingrassia, S., & Punzo, A. (2018). On the use of the sub-Gaussian α-stable distribution in the cluster-weighted model. Iranian Journal of Science and Technology, Transactions A: Science. https://doi.org/10.1007/s40995-018-0526-8.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Punzo.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ingrassia, S., Punzo, A. Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition. J Classif 37, 526–547 (2020). https://doi.org/10.1007/s00357-019-09326-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-019-09326-4

Keywords

Navigation