Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models

Perrone, Gabriele; Soffritti, Gabriele

doi:10.1007/s00357-023-09458-8

Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models

Published: 08 January 2024

(2024)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Abstract

Normal cluster-weighted models constitute a modern approach to linear regression which simultaneously perform model-based cluster analysis and multivariate linear regression analysis with random quantitative regressors. Robustified models have been recently developed, based on the use of the contaminated normal distribution, which can manage the presence of mildly atypical observations. A more flexible class of contaminated normal linear cluster-weighted models is specified here, in which the researcher is free to use a different vector of regressors for each response. The novel class also includes parsimonious models, where parsimony is attained by imposing suitable constraints on the component-covariance matrices of either the responses or the regressors. Identifiability conditions are illustrated and discussed. An expectation-conditional maximisation algorithm is provided for the maximum likelihood estimation of the model parameters. The effectiveness and usefulness of the proposed models are shown through the analysis of simulated and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Seemingly unrelated clusterwise linear regression for contaminated data

Article Open access 06 August 2022

Multivariate cluster weighted models using skewed distributions

Article 15 November 2021

Parsimonious Mixtures of Seemingly Unrelated Contaminated Normal Regression Models

Data Availability

The sources of the real-world data supporting the findings of the study reported in Section 4 are the websites of Emilia-Romagna (https://statistica.regione.emilia-romagna.it/turismo) and Veneto (https://www.veneto.eu/web/area-operatori/statistiche) regional governments (tourist arrivals and overnights) and the website of the Italian Ministry of Cultural Heritage (http://www.statistica.beniculturali.it) (visits to state museums, monuments and museum networks).

Notes

References

Aitken, A. C. (1926). A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb, 45(1), 14–22.
Article Google Scholar
Aitkin, M., & Wilson, T. G. (1980). Mixture models, outliers, and the EM algorithm. Technometrics, 22(3), 325–331.
Article Google Scholar
Andrews, J. L., & McNicholas, P. D. (2011). Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput, 21(3), 361–373.
Article MathSciNet Google Scholar
Baek, J., & McLachlan, G. J. (2011). Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27(9), 1269–1276.
Article Google Scholar
Bai, X., Yao, W., & Boyer, J. E. (2012). Robust fitting of mixture regression models. Comput Stat Data Anal, 56(7), 2347–2359.
Article MathSciNet Google Scholar
Baudry, J. P., Raftery, A. E., Celeux, G., Lo, K., & Gottardo, R. (2010). Combining mixture components for clustering. J Comput Graph Stat, 19(2), 332–353.
Article MathSciNet Google Scholar
Biernacki, C., Celeux, G., & Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell, 22(7), 719–725.
Article Google Scholar
Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal, 41(3–4), 561–575.
Article MathSciNet Google Scholar
Boldea, O., & Magnus, J. R. (2009). Maximum likelihood estimation of the multivariate normal mixture model. J Am Stat Assoc, 104, 1539–1549.
Article MathSciNet Google Scholar
Browne, R. P., & McNicholas, P. D. (2014). Estimating common principal components in high dimensions. Adv Data Anal Classif, 8, 217–226.
Article MathSciNet Google Scholar
Browne, R. P., & McNicholas, P. D. (2014). Orthogonal Stiefel manifold optimization for eigen-decomposed covariance parameter estimation in mixture models. Stat Comput, 24, 203–210.
Article MathSciNet Google Scholar
Cadavez, V. A. P., & Henningsen, A. (2012). The use of seemingly unrelated regression (SUR) to predict the carcass composition of lambs. Meat Sci, 92(4), 548–553.
Article Google Scholar
Cappozzo, A., García-Escudero, L. A., Greselin, F., & Mayo-Iscar, A. (2021). Parameter choice, stability and validity for robust cluster weighted modeling. Stats, 4, 602–615.
Article Google Scholar
Cappozzo, A., García-Escudero, L. A., Greselin, F., & Mayo-Iscar, A. (2023). Graphical and computational tools to guide parameter choice for the cluster weighted robust model. J Comput Graph Stat. https://doi.org/10.1080/10618600.2022.2154218
Article MathSciNet Google Scholar
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognit, 28(5), 781–793.
Article Google Scholar
Chatterjee, S., Laudato, M., & Lynch, L. A. (1996). Genetic algorithms and their statistical applications: An introduction. Comput Stat Data Anal, 22, 633–651.
Article Google Scholar
Cuesta-Albertos, J. A., Gordaliza, A., & Matran, C. (1997). Trimmed \(k\) means: An attempt to robustify quantizers. Ann Stat, 25(2), 553–576.
Article MathSciNet Google Scholar
Dang, U. J., Punzo, A., McNicholas, P. D., Ingrassia, S., & Browne, R. P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. J Classif, 34(1), 4–34.
Article MathSciNet Google Scholar
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc: Ser B, 39(1), 1–38.
MathSciNet Google Scholar
Diani, C., Galimberti, G., & Soffritti, G. (2022). Multivariate cluster-weighted models based on seemingly unrelated linear regression. Comput Stat Data Anal, 171, 107451.
Article MathSciNet Google Scholar
Disegna, M., & Osti, L. (2016). Tourists’ expenditure behaviour: The influence of satisfaction and the dependence of spending categories. Tour Econ, 22(1), 5–30.
Article Google Scholar
Farcomeni, A., & Punzo, A. (2020). Robust model-based clustering with mild and gross outliers. Test, 29, 989–1007.
Article MathSciNet Google Scholar
Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. New York: Springer.
Google Scholar
Gallaugher, M. P. B., Tomarchio, S. D., McNicholas, P. D., & Punzo, A. (2022). Multivariate cluster weighted models using skewed distributions. Adv Data Anal Classif, 16, 93–124.
Article MathSciNet Google Scholar
Galimberti, G., Manisi, A., & Soffritti, G. (2018). Modelling the role of variables in model-based cluster analysis. Stat Comput, 28(1), 145–169.
Article MathSciNet Google Scholar
Galimberti, G., Nuzzi, L., & Soffritti, G. (2021). Covariance matrix estimation of the maximum likelihood estimation in multivariate clusterwise linear regression. Stat Methods Appl, 30, 235–268.
Article MathSciNet Google Scholar
García-Escudero, L. A., Gordaliza, A., Greselin, F., Ingrassia, S., & Mayo-Iscar, A. (2017). Robust estimation of mixtures of regressions with random covariates, via trimming and constraints. Stat Comput, 27, 377–402.
Article MathSciNet Google Scholar
Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Ann. N. Y. Acad. Sci., 808, 18–24.
Article Google Scholar
Giles, S., & Hampton, P. (1984). Regional production relationships during the industrialization of New Zealand, 1935–1948. Reg Sci, 24(4), 519–532.
Article Google Scholar
Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading
Hastie, Tibshirani, & Friedman,. (2009). The elements of statistical learning (2nd ed.). New York: Springer.
Hennig, C. (2000). Identifiability of models for clusterwise linear regression. J Classif, 17, 273–296.
Article MathSciNet Google Scholar
Hennig, C. (2004). Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann Stat, 32, 1313–1340.
Article MathSciNet Google Scholar
Henningsen, A., & Hamann, J. D. (2007). systemfit: A package for estimating systems of simultaneous equations in R. J Stat Softw, 23(4), 1–40.
Article Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. J Classif, 2(1), 193–218.
Article Google Scholar
Ingrassia, S., Minotti, S. C., & Vittadini, G. (2012). Local statistical modeling via a cluster-weighted approach with elliptical distributions. J Classif, 29(3), 363–401.
Article MathSciNet Google Scholar
Ingrassia, S., Minotti, S. C., & Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal, 71, 159–182.
Article MathSciNet Google Scholar
Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal, 41(3–4), 577–590.
Article MathSciNet Google Scholar
Lin, T.-I., & Wang, W.-L. (2022). Multivariate linear mixed models with censored and nonignorable missing outcomes, with application to AIDS studies. Biom J, 64, 1325–1339.
Article MathSciNet Google Scholar
Lin, T.-I., & Wang, W.-L. (2023). Flexible modeling of multiple nonlinear longitudinal trajectories with censored and non-ignorable missing outcomes. Stat Methods Med Res, 32(3), 593–608.
Article MathSciNet Google Scholar
Magnus, J. R., & Neudecker, H. (1988). Matrix differential calculus with applications in statistics and econometrics. New York: Wiley.
Google Scholar
Maronna, R. A., Martin, R. D., & Yohai, V. J. (2006). Robust statistics: Theory and methods. Chichester: Wiley.
Book Google Scholar
Mazza, A., & Punzo, A. (2020). Mixtures of multivariate contaminated normal regression models. Stat Papers, 61(2), 787–822.
Article MathSciNet Google Scholar
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley.
Book Google Scholar
McNicholas, P. D. (2010). Model-based classification using latent Gaussian mixture models. J Stat Plan Inference, 140(5), 1175–1181.
Article MathSciNet Google Scholar
Meng, X. L., & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2), 267–278.
Article MathSciNet Google Scholar
Miller, A. J. (1991). Subset selection in regression (2nd ed.). Boca Raton: Chapman and Hall.
Google Scholar
Park, T. (1993). Equivalence of maximum likelihood estimation and iterative two-stage estimation for seemingly unrelated regression models. Commun Stat Theory Methods, 22(8), 2285–2296.
Article MathSciNet Google Scholar
Perrone, G., & Soffritti, G. (2023). Seemingly unrelated clusterwise linear regression for contaminated data. Stat Papers, 64, 883–921.
Article MathSciNet Google Scholar
Punzo, A., & McNicholas, P. D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. J Classif, 34(2), 249–293.
Article MathSciNet Google Scholar
Punzo, A., Mazza, A., & McNicholas, P. D. (2018). ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J Stat Softw, 85(10), 1–25.
Article Google Scholar
R Core Team (2022) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
Ritter, G. (2015). Robust cluster analysis and variable selection. Boca Raton: Chapman and Hall.
Google Scholar
Rousseeuw, P. J., & Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), 212–223.
Article Google Scholar
Rousseeuw, P. J., & Leroy, A. M. (2005). Robust regression and outlier detection. New York: Wiley.
Google Scholar
Ruwet, C., García-Escudero, L. A., Gordaliza, A., & Mayo-Iscar, A. (2013). On the breakdown behavior of the TCLUST clustering procedure. Test, 22(3), 466–487.
Article MathSciNet Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Ann Stat, 6(2), 461–464.
Article MathSciNet Google Scholar
Scrucca, L. (2013). GA: A package for genetic algorithms in R. J Stat Softw, 53(4), 1–37.
Article Google Scholar
Scrucca, L. (2016). Genetic algorithms for subset selection in model-based clustering. In M. E. Celebi & K. Aydin (Eds.), Unsupervised learning algorithms (pp. 55–70). Berlin: Springer.
Chapter Google Scholar
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2017). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. R J, 8(1), 205–223.
Google Scholar
Soffritti, G. (2021). Estimating the covariance matrix of the maximum likelihood estimator under linear cluster-weighted models. J Classif, 38, 594–625.
Article MathSciNet Google Scholar
Srivastava, V. K., & Giles, D. E. A. (1987). Seemingly unrelated regression equations models. New York: Marcel Dekker.
Google Scholar
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2015). Cluster-weighted \(t\)-factor analyzers for robust model-based clustering and dimension reduction. Stat Methods Appl, 24, 623–649.
Article MathSciNet Google Scholar
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In I. Olkin (Ed.), Contributions to probability and statistics: essays in honor of Harold Hotelling, Stanford studies in mathematics and statistics (pp. 448–485). California: Stanford University Press.
Google Scholar
Wang, W.-L., & Lin, T.-I. (2016). Maximum likelihood inference for the multivariate t mixture model. J Multivar Anal, 149, 54–64.
Article MathSciNet Google Scholar
White, E. N., & Hewings, G. J. D. (1982). Space-time employment modelling: Some results using seemingly unrelated regression estimators. J Reg Sci, 22(3), 283–302.
Article Google Scholar
Yao, W., Wei, Y., & Yu, C. (2014). Robust mixture regression using the \(t\)-distribution. Comput Stat Data Anal, 71, 116–127.
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors are grateful to three anonymous reviewers for their constructive comments and valuable suggestions.

Author information

Authors and Affiliations

Department of Statistical Sciences, University of Bologna, via delle Belle Arti 41, 40126, Bologna, Italy
Gabriele Perrone & Gabriele Soffritti

Authors

Gabriele Perrone
View author publications
You can also search for this author in PubMed Google Scholar
Gabriele Soffritti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriele Soffritti.

Ethics declarations

Ethical Approval

The authors have no relevant interests to disclose.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 493 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Perrone, G., Soffritti, G. Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models. J Classif (2024). https://doi.org/10.1007/s00357-023-09458-8

Download citation

Accepted: 05 December 2023
Published: 08 January 2024
DOI: https://doi.org/10.1007/s00357-023-09458-8

Keywords

Mathematics Subject Classification (2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models

Abstract

Access this article

Similar content being viewed by others

Seemingly unrelated clusterwise linear regression for contaminated data

Multivariate cluster weighted models using skewed distributions

Parsimonious Mixtures of Seemingly Unrelated Contaminated Normal Regression Models

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Conflict of Interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 493 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2010)

Navigation

Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models

Abstract

Access this article

Similar content being viewed by others

Seemingly unrelated clusterwise linear regression for contaminated data

Multivariate cluster weighted models using skewed distributions

Parsimonious Mixtures of Seemingly Unrelated Contaminated Normal Regression Models

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Conflict of Interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 493 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation