Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition

Ingrassia, Salvatore; Punzo, Antonio

doi:10.1007/s00357-019-09326-4

Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition

Published: 16 July 2019

Volume 37, pages 526–547, (2020)
Cite this article

Journal of Classification Aims and scope Submit manuscript

11 Citations
Explore all metrics

Abstract

One of the challenges in cluster analysis is the evaluation of the obtained clustering results without using auxiliary information. To this end, a common approach is to use internal validity criteria. For mixtures of linear regressions whose parameters are estimated by maximum likelihood, we propose a three-term decomposition of the total sum of squares as a starting point to define some internal validity criteria. In particular, three types of mixtures of regressions are considered: with fixed covariates, with concomitant variables, and with random covariates. A ternary diagram is also suggested for easier joint interpretation of the three terms of the proposed decomposition. Furthermore, local and overall coefficients of determination are respectively defined to judge how well the model fits the data group-by-group but also taken as a whole. Artificial data are considered to find out more about the proposed decomposition, including violations of the model assumptions. Finally, an application to real data illustrates the use and the usefulness of these proposals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data clustering: application and trends

Article 27 November 2022

Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?

Article 18 October 2014

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Article Open access 01 July 2018

References

Aitchison, J. (2003). The Statistical Analysis of Compositional Data. Caldwell: Blackburn Press.
MATH Google Scholar
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256.
Google Scholar
Bagnato, L., & Punzo, A. (2013). Finite mixtures of unimodal beta and gamma densities and the k-bumps algorithm. Computational Statistics, 28(4), 1571–1597.
MathSciNet MATH Google Scholar
Berta, P., Ingrassia, S., Punzo, A., & Vittadini, G. (2016). Multilevel cluster-weighted models for the evaluation of hospitals. METRON, 74(3), 275–292.
MathSciNet MATH Google Scholar
Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4), 561–575.
MathSciNet MATH Google Scholar
Buse, A. (1973). Goodness of fit in generalized least squares estimation. The American Statistician, 27(3), 106–108.
Google Scholar
Cameron, A.C., & Windmeijer, F.A.G. (1996). R-squared measures for count data regression models with applications to health-care utilization. Journal of Business & Economic Statistics, 14(2), 209–220.
Google Scholar
Cameron, A.C., & Windmeijer, F.A.G. (1997). An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics, 77(2), 329–342.
MathSciNet MATH Google Scholar
Cellini, R., & Cuccia, T. (2013). Museum and monument attendance and tourism flow: a time series approach. Applied Economics, 45, 3473–3482.
Google Scholar
Cerdeira, J.O., Martins, M.J., & Silva, P.C. (2012). A combinatorial approach to assess the separability of clusters. Journal of Classification, 29(1), 7–22.
MathSciNet MATH Google Scholar
Chatterjee, S., & Hadi, A.S. (2006). Regression Analysis by Example, volume 607 of Wiley Series in Probability and Statistics. Hoboken: Wiley.
Google Scholar
Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., & Browne, R.P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classification, 34(1), 4–34.
MathSciNet MATH Google Scholar
Davidson, R., & MacKinnon, J.G. (2004). Econometric Theory and Methods. Oxford: Oxford University Press.
Google Scholar
Dayton, C.M., & Macready, G.B. (1988). Concomitant-variable latent-class models. Journal of the American Statistical Association, 83(401), 173–178.
MathSciNet Google Scholar
de Amorim, R.C. (2016). A survey on feature weighting based k-means algorithms. Journal of Classification, 33(2), 210–242.
MathSciNet MATH Google Scholar
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39(1), 1–38.
MathSciNet MATH Google Scholar
DeSarbo, W.S., & Cron, W.L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5(2), 249–282.
MathSciNet MATH Google Scholar
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. New York: Springer.
MATH Google Scholar
Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Annals of the New York Academy of Sciences, 808(1), 18–24.
Google Scholar
Grün, B., & Leisch, F. (2008). FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. Journal of Statistical Software, 28(4), 1–35.
Google Scholar
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145.
MATH Google Scholar
Hennig, C. (2000). Identifiablity of models for clusterwise linear regression. Journal of Classification, 17(2), 273–296.
MathSciNet MATH Google Scholar
Hosmer, D.W. (1974). Maximum likelihood estimates of the parameters of a mixture of two regression lines. Communications in Statistics-Theory and Methods, 3(10), 995–1006.
MATH Google Scholar
Huitema, B.E. (2011). The Analysis of Covariance and Alternatives: Statistical Methods for Experiments, Quasi-Experiments, and Single-Case Studies, volume 608 of Wiley Series in Probability and Statistics. New Jersey: Wiley.
MATH Google Scholar
Ingrassia, S., & Punzo, A. (2016). Decision boundaries for mixtures of regressions. Journal of the Korean Statistical Society, 45(2), 295–306.
MathSciNet MATH Google Scholar
Ingrassia, S., Minotti, S., & Vittadini, G. (2012). Local statistical modeling via the cluster-weighted approach with elliptical distributions. Journal of Classification, 29(3), 363–401.
MathSciNet MATH Google Scholar
Ingrassia, S., Minotti, S.C., & Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Computational Statistics and Data Analysis, 71, 159–182.
MathSciNet MATH Google Scholar
Ingrassia, S., Punzo, A., Vittadini, G., & Minotti, S.C. (2015). The generalized linear mixed cluster-weighted model. Journal of Classification, 32(1), 85–113.
MathSciNet MATH Google Scholar
Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics & Data Analysis, 41(3–4), 577–590.
MathSciNet MATH Google Scholar
Lange, K.L., Little, R.J.A., & Taylor, J.M.G. (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84(408), 881–896.
MathSciNet Google Scholar
Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8), 1–18.
Google Scholar
Maddala, G.S. (1986). Limited-Dependent and Qualitative Variables in Econometrics. Econometric Society Monographs. Cambridge: Cambridge University Press.
Google Scholar
Mazza, A., & Punzo, A. (2018). Mixtures of multivariate contaminated normal regression models. Statistical Papers. https://doi.org/10.1007/s00362-017-0964-y.
Mazza, A., Punzo, A., & Ingrassia, S. (2018). flexCWM: Flexible cluster-weighted modeling. Journal of Statistical Software, 86(2), 1–30.
Google Scholar
Mazza, A., Battisti, M., Ingrassia, S., & Punzo, A. (2019). Modeling return to education in heterogeneous populations. An application to Italy. In Greselin, I., Deldossi, L., Vichi, M., & Bagnato, L. (Eds.) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization. Switzerland: Springer International Publishing.
McNicholas, P.D. (2016). Model-based clustering. Journal of Classification, 33 (3), 331–373.
MathSciNet MATH Google Scholar
Milligan, G.W., & Cheng, R. (1996). Measuring the influence of individual data points in a cluster analysis. Journal of Classification, 13(2), 315–335.
MATH Google Scholar
Panagiotakis, C. (2015). Point clustering via voting maximization. Journal of Classification, 32(2), 212–240.
MathSciNet MATH Google Scholar
Punzo, A. (2014). Flexible mixture modeling with the polynomial Gaussian cluster-weighted model. Statistical Modelling, 14(3), 257–291.
MathSciNet Google Scholar
Punzo, A., & Ingrassia, S. (2015). Parsimonious generalized linear Gaussian cluster-weighted models. In Morlini, I.s, Minerva, T., & Vichi, M. (Eds.) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization (pp. 201–209). Switzerland: Springer International Publishing.
Punzo, A., & Ingrassia, S. (2016). Clustering bivariate mixed-type data via the cluster-weighted model. Computational Statistics, 31(3), 989–1013.
MathSciNet MATH Google Scholar
Punzo, A., & McNicholas, P.D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. Journal of Classification, 34 (2), 249–293.
MathSciNet MATH Google Scholar
Punzo, A., Ingrassia, S., & Maruotti, A. (2018). Multivariate generalized hidden Markov regression models with random covariates: physical exercise in an elderly population. Statistics in Medicine, 37(19), 2797–2808.
MathSciNet Google Scholar
Quandt, R.E. (1972). A new approach to estimating switching regressions. Journal of the American Statistical Association, 67(338), 306–310.
MATH Google Scholar
Quandt, R.E., & Ramsey, J.B. (1978). Estimating mixtures of normal distributions and switching regressions. Journal of the American Statistical Association, 73(364), 730–738.
MathSciNet MATH Google Scholar
R Core Team. (2016). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.
Google Scholar
Rezaee, M.R., Lelieveldt, B.P.F., & Reiber, J.H.C. (1998). A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters, 19(3-4), 237–246.
MATH Google Scholar
Rousseeuw, P.J., & Van Zomeren, B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85(411), 633–639.
Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
MathSciNet MATH Google Scholar
Steinley, D., Hendrickson, G., & Brusco, M.J. (2015). A note on maximizing the agreement between partitions: a stepwise optimal algorithm and some properties. Journal of Classification, 32(1), 114–126.
MathSciNet MATH Google Scholar
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2013). Clustering and classification via cluster-weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40.
MathSciNet MATH Google Scholar
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2015). Cluster-weighted t-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications, 24(4), 623–649.
MathSciNet MATH Google Scholar
Theodoridis, S., & Koutroumbas, K. (2008). Pattern Recognition. London: Academic Press.
MATH Google Scholar
Veall, M.R., & Zimmermann, K.F. (1996). Pseudo-R² measures for some common limited dependent variable models. Journal of Economic Surveys, 10(3), 241–259.
Google Scholar
Wedel, M. (1990). Clusterwise Regression and Market Segmentation: Developments and Applications. Landbouwuniversiteit te Wageningen.
Wedel, M. (2002). Concomitant variables in finite mixture models. Statistica Neerlandica, 56(3), 362–375.
MathSciNet MATH Google Scholar
Wedel, M., & De Sarbo, W. (1995). A mixture likelihood approach for generalized linear models. Journal of Classification, 12(3), 21–55.
MATH Google Scholar
Wedel, M., & Kamakura, W.A. (2000). Market Segmentation: Conceptual and Methodological Foundations, 2nd edn. Boston: Kluwer Academic Publishers.
Google Scholar
Willett, J.B., & Singer, J.D. (1988). Another cautionary note about r²: Its use in weighted least-squares regression analysis. The American Statistician, 42(3), 236–238.
Google Scholar
Windmeijer, F.A.G. (1995). Goodness-of-fit measures in binary choice models. Econometric Reviews, 14(1), 101–116.
MathSciNet MATH Google Scholar
Zarei, S., Mohammadpour, A., Ingrassia, S., & Punzo, A. (2018). On the use of the sub-Gaussian α-stable distribution in the cluster-weighted model. Iranian Journal of Science and Technology, Transactions A: Science. https://doi.org/10.1007/s40995-018-0526-8.

Download references

Author information

Authors and Affiliations

Department of Economics and Business, University of Catania, Corso Italia 55, 95129, Catania, Italy
Salvatore Ingrassia & Antonio Punzo

Authors

Salvatore Ingrassia
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Punzo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Punzo.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ingrassia, S., Punzo, A. Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition. J Classif 37, 526–547 (2020). https://doi.org/10.1007/s00357-019-09326-4

Download citation

Published: 16 July 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s00357-019-09326-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation