Abstract
Normal cluster-weighted models constitute a modern approach to linear regression which simultaneously perform model-based cluster analysis and multivariate linear regression analysis with random quantitative regressors. Robustified models have been recently developed, based on the use of the contaminated normal distribution, which can manage the presence of mildly atypical observations. A more flexible class of contaminated normal linear cluster-weighted models is specified here, in which the researcher is free to use a different vector of regressors for each response. The novel class also includes parsimonious models, where parsimony is attained by imposing suitable constraints on the component-covariance matrices of either the responses or the regressors. Identifiability conditions are illustrated and discussed. An expectation-conditional maximisation algorithm is provided for the maximum likelihood estimation of the model parameters. The effectiveness and usefulness of the proposed models are shown through the analysis of simulated and real datasets.
Similar content being viewed by others
Data Availability
The sources of the real-world data supporting the findings of the study reported in Section 4 are the websites of Emilia-Romagna (https://statistica.regione.emilia-romagna.it/turismo) and Veneto (https://www.veneto.eu/web/area-operatori/statistiche) regional governments (tourist arrivals and overnights) and the website of the Italian Ministry of Cultural Heritage (http://www.statistica.beniculturali.it) (visits to state museums, monuments and museum networks).
References
Aitken, A. C. (1926). A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb, 45(1), 14–22.
Aitkin, M., & Wilson, T. G. (1980). Mixture models, outliers, and the EM algorithm. Technometrics, 22(3), 325–331.
Andrews, J. L., & McNicholas, P. D. (2011). Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput, 21(3), 361–373.
Baek, J., & McLachlan, G. J. (2011). Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27(9), 1269–1276.
Bai, X., Yao, W., & Boyer, J. E. (2012). Robust fitting of mixture regression models. Comput Stat Data Anal, 56(7), 2347–2359.
Baudry, J. P., Raftery, A. E., Celeux, G., Lo, K., & Gottardo, R. (2010). Combining mixture components for clustering. J Comput Graph Stat, 19(2), 332–353.
Biernacki, C., Celeux, G., & Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell, 22(7), 719–725.
Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal, 41(3–4), 561–575.
Boldea, O., & Magnus, J. R. (2009). Maximum likelihood estimation of the multivariate normal mixture model. J Am Stat Assoc, 104, 1539–1549.
Browne, R. P., & McNicholas, P. D. (2014). Estimating common principal components in high dimensions. Adv Data Anal Classif, 8, 217–226.
Browne, R. P., & McNicholas, P. D. (2014). Orthogonal Stiefel manifold optimization for eigen-decomposed covariance parameter estimation in mixture models. Stat Comput, 24, 203–210.
Cadavez, V. A. P., & Henningsen, A. (2012). The use of seemingly unrelated regression (SUR) to predict the carcass composition of lambs. Meat Sci, 92(4), 548–553.
Cappozzo, A., García-Escudero, L. A., Greselin, F., & Mayo-Iscar, A. (2021). Parameter choice, stability and validity for robust cluster weighted modeling. Stats, 4, 602–615.
Cappozzo, A., García-Escudero, L. A., Greselin, F., & Mayo-Iscar, A. (2023). Graphical and computational tools to guide parameter choice for the cluster weighted robust model. J Comput Graph Stat. https://doi.org/10.1080/10618600.2022.2154218
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognit, 28(5), 781–793.
Chatterjee, S., Laudato, M., & Lynch, L. A. (1996). Genetic algorithms and their statistical applications: An introduction. Comput Stat Data Anal, 22, 633–651.
Cuesta-Albertos, J. A., Gordaliza, A., & Matran, C. (1997). Trimmed \(k\) means: An attempt to robustify quantizers. Ann Stat, 25(2), 553–576.
Dang, U. J., Punzo, A., McNicholas, P. D., Ingrassia, S., & Browne, R. P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. J Classif, 34(1), 4–34.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc: Ser B, 39(1), 1–38.
Diani, C., Galimberti, G., & Soffritti, G. (2022). Multivariate cluster-weighted models based on seemingly unrelated linear regression. Comput Stat Data Anal, 171, 107451.
Disegna, M., & Osti, L. (2016). Tourists’ expenditure behaviour: The influence of satisfaction and the dependence of spending categories. Tour Econ, 22(1), 5–30.
Farcomeni, A., & Punzo, A. (2020). Robust model-based clustering with mild and gross outliers. Test, 29, 989–1007.
Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. New York: Springer.
Gallaugher, M. P. B., Tomarchio, S. D., McNicholas, P. D., & Punzo, A. (2022). Multivariate cluster weighted models using skewed distributions. Adv Data Anal Classif, 16, 93–124.
Galimberti, G., Manisi, A., & Soffritti, G. (2018). Modelling the role of variables in model-based cluster analysis. Stat Comput, 28(1), 145–169.
Galimberti, G., Nuzzi, L., & Soffritti, G. (2021). Covariance matrix estimation of the maximum likelihood estimation in multivariate clusterwise linear regression. Stat Methods Appl, 30, 235–268.
García-Escudero, L. A., Gordaliza, A., Greselin, F., Ingrassia, S., & Mayo-Iscar, A. (2017). Robust estimation of mixtures of regressions with random covariates, via trimming and constraints. Stat Comput, 27, 377–402.
Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Ann. N. Y. Acad. Sci., 808, 18–24.
Giles, S., & Hampton, P. (1984). Regional production relationships during the industrialization of New Zealand, 1935–1948. Reg Sci, 24(4), 519–532.
Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading
Hastie, Tibshirani, & Friedman,. (2009). The elements of statistical learning (2nd ed.). New York: Springer.
Hennig, C. (2000). Identifiability of models for clusterwise linear regression. J Classif, 17, 273–296.
Hennig, C. (2004). Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann Stat, 32, 1313–1340.
Henningsen, A., & Hamann, J. D. (2007). systemfit: A package for estimating systems of simultaneous equations in R. J Stat Softw, 23(4), 1–40.
Hubert, L., & Arabie, P. (1985). Comparing partitions. J Classif, 2(1), 193–218.
Ingrassia, S., Minotti, S. C., & Vittadini, G. (2012). Local statistical modeling via a cluster-weighted approach with elliptical distributions. J Classif, 29(3), 363–401.
Ingrassia, S., Minotti, S. C., & Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal, 71, 159–182.
Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal, 41(3–4), 577–590.
Lin, T.-I., & Wang, W.-L. (2022). Multivariate linear mixed models with censored and nonignorable missing outcomes, with application to AIDS studies. Biom J, 64, 1325–1339.
Lin, T.-I., & Wang, W.-L. (2023). Flexible modeling of multiple nonlinear longitudinal trajectories with censored and non-ignorable missing outcomes. Stat Methods Med Res, 32(3), 593–608.
Magnus, J. R., & Neudecker, H. (1988). Matrix differential calculus with applications in statistics and econometrics. New York: Wiley.
Maronna, R. A., Martin, R. D., & Yohai, V. J. (2006). Robust statistics: Theory and methods. Chichester: Wiley.
Mazza, A., & Punzo, A. (2020). Mixtures of multivariate contaminated normal regression models. Stat Papers, 61(2), 787–822.
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley.
McNicholas, P. D. (2010). Model-based classification using latent Gaussian mixture models. J Stat Plan Inference, 140(5), 1175–1181.
Meng, X. L., & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2), 267–278.
Miller, A. J. (1991). Subset selection in regression (2nd ed.). Boca Raton: Chapman and Hall.
Park, T. (1993). Equivalence of maximum likelihood estimation and iterative two-stage estimation for seemingly unrelated regression models. Commun Stat Theory Methods, 22(8), 2285–2296.
Perrone, G., & Soffritti, G. (2023). Seemingly unrelated clusterwise linear regression for contaminated data. Stat Papers, 64, 883–921.
Punzo, A., & McNicholas, P. D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. J Classif, 34(2), 249–293.
Punzo, A., Mazza, A., & McNicholas, P. D. (2018). ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J Stat Softw, 85(10), 1–25.
R Core Team (2022) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
Ritter, G. (2015). Robust cluster analysis and variable selection. Boca Raton: Chapman and Hall.
Rousseeuw, P. J., & Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), 212–223.
Rousseeuw, P. J., & Leroy, A. M. (2005). Robust regression and outlier detection. New York: Wiley.
Ruwet, C., García-Escudero, L. A., Gordaliza, A., & Mayo-Iscar, A. (2013). On the breakdown behavior of the TCLUST clustering procedure. Test, 22(3), 466–487.
Schwarz, G. (1978). Estimating the dimension of a model. Ann Stat, 6(2), 461–464.
Scrucca, L. (2013). GA: A package for genetic algorithms in R. J Stat Softw, 53(4), 1–37.
Scrucca, L. (2016). Genetic algorithms for subset selection in model-based clustering. In M. E. Celebi & K. Aydin (Eds.), Unsupervised learning algorithms (pp. 55–70). Berlin: Springer.
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2017). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. R J, 8(1), 205–223.
Soffritti, G. (2021). Estimating the covariance matrix of the maximum likelihood estimator under linear cluster-weighted models. J Classif, 38, 594–625.
Srivastava, V. K., & Giles, D. E. A. (1987). Seemingly unrelated regression equations models. New York: Marcel Dekker.
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2015). Cluster-weighted \(t\)-factor analyzers for robust model-based clustering and dimension reduction. Stat Methods Appl, 24, 623–649.
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In I. Olkin (Ed.), Contributions to probability and statistics: essays in honor of Harold Hotelling, Stanford studies in mathematics and statistics (pp. 448–485). California: Stanford University Press.
Wang, W.-L., & Lin, T.-I. (2016). Maximum likelihood inference for the multivariate t mixture model. J Multivar Anal, 149, 54–64.
White, E. N., & Hewings, G. J. D. (1982). Space-time employment modelling: Some results using seemingly unrelated regression estimators. J Reg Sci, 22(3), 283–302.
Yao, W., Wei, Y., & Yu, C. (2014). Robust mixture regression using the \(t\)-distribution. Comput Stat Data Anal, 71, 116–127.
Acknowledgements
The authors are grateful to three anonymous reviewers for their constructive comments and valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical Approval
The authors have no relevant interests to disclose.
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Perrone, G., Soffritti, G. Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models. J Classif (2024). https://doi.org/10.1007/s00357-023-09458-8
Accepted:
Published:
DOI: https://doi.org/10.1007/s00357-023-09458-8
Keywords
- Contaminated normal distribution
- ECM algorithm
- Mixture model
- Model-based cluster analysis
- Parsimonious model
- Seemingly unrelated regression