Abstract
One of the challenges in cluster analysis is the evaluation of the obtained clustering results without using auxiliary information. To this end, a common approach is to use internal validity criteria. For mixtures of linear regressions whose parameters are estimated by maximum likelihood, we propose a three-term decomposition of the total sum of squares as a starting point to define some internal validity criteria. In particular, three types of mixtures of regressions are considered: with fixed covariates, with concomitant variables, and with random covariates. A ternary diagram is also suggested for easier joint interpretation of the three terms of the proposed decomposition. Furthermore, local and overall coefficients of determination are respectively defined to judge how well the model fits the data group-by-group but also taken as a whole. Artificial data are considered to find out more about the proposed decomposition, including violations of the model assumptions. Finally, an application to real data illustrates the use and the usefulness of these proposals.
Similar content being viewed by others
References
Aitchison, J. (2003). The Statistical Analysis of Compositional Data. Caldwell: Blackburn Press.
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256.
Bagnato, L., & Punzo, A. (2013). Finite mixtures of unimodal beta and gamma densities and the k-bumps algorithm. Computational Statistics, 28(4), 1571–1597.
Berta, P., Ingrassia, S., Punzo, A., & Vittadini, G. (2016). Multilevel cluster-weighted models for the evaluation of hospitals. METRON, 74(3), 275–292.
Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4), 561–575.
Buse, A. (1973). Goodness of fit in generalized least squares estimation. The American Statistician, 27(3), 106–108.
Cameron, A.C., & Windmeijer, F.A.G. (1996). R-squared measures for count data regression models with applications to health-care utilization. Journal of Business & Economic Statistics, 14(2), 209–220.
Cameron, A.C., & Windmeijer, F.A.G. (1997). An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics, 77(2), 329–342.
Cellini, R., & Cuccia, T. (2013). Museum and monument attendance and tourism flow: a time series approach. Applied Economics, 45, 3473–3482.
Cerdeira, J.O., Martins, M.J., & Silva, P.C. (2012). A combinatorial approach to assess the separability of clusters. Journal of Classification, 29(1), 7–22.
Chatterjee, S., & Hadi, A.S. (2006). Regression Analysis by Example, volume 607 of Wiley Series in Probability and Statistics. Hoboken: Wiley.
Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., & Browne, R.P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classification, 34(1), 4–34.
Davidson, R., & MacKinnon, J.G. (2004). Econometric Theory and Methods. Oxford: Oxford University Press.
Dayton, C.M., & Macready, G.B. (1988). Concomitant-variable latent-class models. Journal of the American Statistical Association, 83(401), 173–178.
de Amorim, R.C. (2016). A survey on feature weighting based k-means algorithms. Journal of Classification, 33(2), 210–242.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39(1), 1–38.
DeSarbo, W.S., & Cron, W.L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5(2), 249–282.
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. New York: Springer.
Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Annals of the New York Academy of Sciences, 808(1), 18–24.
Grün, B., & Leisch, F. (2008). FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. Journal of Statistical Software, 28(4), 1–35.
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145.
Hennig, C. (2000). Identifiablity of models for clusterwise linear regression. Journal of Classification, 17(2), 273–296.
Hosmer, D.W. (1974). Maximum likelihood estimates of the parameters of a mixture of two regression lines. Communications in Statistics-Theory and Methods, 3(10), 995–1006.
Huitema, B.E. (2011). The Analysis of Covariance and Alternatives: Statistical Methods for Experiments, Quasi-Experiments, and Single-Case Studies, volume 608 of Wiley Series in Probability and Statistics. New Jersey: Wiley.
Ingrassia, S., & Punzo, A. (2016). Decision boundaries for mixtures of regressions. Journal of the Korean Statistical Society, 45(2), 295–306.
Ingrassia, S., Minotti, S., & Vittadini, G. (2012). Local statistical modeling via the cluster-weighted approach with elliptical distributions. Journal of Classification, 29(3), 363–401.
Ingrassia, S., Minotti, S.C., & Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Computational Statistics and Data Analysis, 71, 159–182.
Ingrassia, S., Punzo, A., Vittadini, G., & Minotti, S.C. (2015). The generalized linear mixed cluster-weighted model. Journal of Classification, 32(1), 85–113.
Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics & Data Analysis, 41(3–4), 577–590.
Lange, K.L., Little, R.J.A., & Taylor, J.M.G. (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84(408), 881–896.
Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8), 1–18.
Maddala, G.S. (1986). Limited-Dependent and Qualitative Variables in Econometrics. Econometric Society Monographs. Cambridge: Cambridge University Press.
Mazza, A., & Punzo, A. (2018). Mixtures of multivariate contaminated normal regression models. Statistical Papers. https://doi.org/10.1007/s00362-017-0964-y.
Mazza, A., Punzo, A., & Ingrassia, S. (2018). flexCWM: Flexible cluster-weighted modeling. Journal of Statistical Software, 86(2), 1–30.
Mazza, A., Battisti, M., Ingrassia, S., & Punzo, A. (2019). Modeling return to education in heterogeneous populations. An application to Italy. In Greselin, I., Deldossi, L., Vichi, M., & Bagnato, L. (Eds.) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization. Switzerland: Springer International Publishing.
McNicholas, P.D. (2016). Model-based clustering. Journal of Classification, 33 (3), 331–373.
Milligan, G.W., & Cheng, R. (1996). Measuring the influence of individual data points in a cluster analysis. Journal of Classification, 13(2), 315–335.
Panagiotakis, C. (2015). Point clustering via voting maximization. Journal of Classification, 32(2), 212–240.
Punzo, A. (2014). Flexible mixture modeling with the polynomial Gaussian cluster-weighted model. Statistical Modelling, 14(3), 257–291.
Punzo, A., & Ingrassia, S. (2015). Parsimonious generalized linear Gaussian cluster-weighted models. In Morlini, I.s, Minerva, T., & Vichi, M. (Eds.) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization (pp. 201–209). Switzerland: Springer International Publishing.
Punzo, A., & Ingrassia, S. (2016). Clustering bivariate mixed-type data via the cluster-weighted model. Computational Statistics, 31(3), 989–1013.
Punzo, A., & McNicholas, P.D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. Journal of Classification, 34 (2), 249–293.
Punzo, A., Ingrassia, S., & Maruotti, A. (2018). Multivariate generalized hidden Markov regression models with random covariates: physical exercise in an elderly population. Statistics in Medicine, 37(19), 2797–2808.
Quandt, R.E. (1972). A new approach to estimating switching regressions. Journal of the American Statistical Association, 67(338), 306–310.
Quandt, R.E., & Ramsey, J.B. (1978). Estimating mixtures of normal distributions and switching regressions. Journal of the American Statistical Association, 73(364), 730–738.
R Core Team. (2016). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.
Rezaee, M.R., Lelieveldt, B.P.F., & Reiber, J.H.C. (1998). A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters, 19(3-4), 237–246.
Rousseeuw, P.J., & Van Zomeren, B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85(411), 633–639.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Steinley, D., Hendrickson, G., & Brusco, M.J. (2015). A note on maximizing the agreement between partitions: a stepwise optimal algorithm and some properties. Journal of Classification, 32(1), 114–126.
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2013). Clustering and classification via cluster-weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40.
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2015). Cluster-weighted t-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications, 24(4), 623–649.
Theodoridis, S., & Koutroumbas, K. (2008). Pattern Recognition. London: Academic Press.
Veall, M.R., & Zimmermann, K.F. (1996). Pseudo-R2 measures for some common limited dependent variable models. Journal of Economic Surveys, 10(3), 241–259.
Wedel, M. (1990). Clusterwise Regression and Market Segmentation: Developments and Applications. Landbouwuniversiteit te Wageningen.
Wedel, M. (2002). Concomitant variables in finite mixture models. Statistica Neerlandica, 56(3), 362–375.
Wedel, M., & De Sarbo, W. (1995). A mixture likelihood approach for generalized linear models. Journal of Classification, 12(3), 21–55.
Wedel, M., & Kamakura, W.A. (2000). Market Segmentation: Conceptual and Methodological Foundations, 2nd edn. Boston: Kluwer Academic Publishers.
Willett, J.B., & Singer, J.D. (1988). Another cautionary note about r2: Its use in weighted least-squares regression analysis. The American Statistician, 42(3), 236–238.
Windmeijer, F.A.G. (1995). Goodness-of-fit measures in binary choice models. Econometric Reviews, 14(1), 101–116.
Zarei, S., Mohammadpour, A., Ingrassia, S., & Punzo, A. (2018). On the use of the sub-Gaussian α-stable distribution in the cluster-weighted model. Iranian Journal of Science and Technology, Transactions A: Science. https://doi.org/10.1007/s40995-018-0526-8.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ingrassia, S., Punzo, A. Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition. J Classif 37, 526–547 (2020). https://doi.org/10.1007/s00357-019-09326-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-019-09326-4