Skip to main content
Log in

Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

Normal cluster-weighted models constitute a modern approach to linear regression which simultaneously perform model-based cluster analysis and multivariate linear regression analysis with random quantitative regressors. Robustified models have been recently developed, based on the use of the contaminated normal distribution, which can manage the presence of mildly atypical observations. A more flexible class of contaminated normal linear cluster-weighted models is specified here, in which the researcher is free to use a different vector of regressors for each response. The novel class also includes parsimonious models, where parsimony is attained by imposing suitable constraints on the component-covariance matrices of either the responses or the regressors. Identifiability conditions are illustrated and discussed. An expectation-conditional maximisation algorithm is provided for the maximum likelihood estimation of the model parameters. The effectiveness and usefulness of the proposed models are shown through the analysis of simulated and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availability

The sources of the real-world data supporting the findings of the study reported in Section 4 are the websites of Emilia-Romagna (https://statistica.regione.emilia-romagna.it/turismo) and Veneto (https://www.veneto.eu/web/area-operatori/statistiche) regional governments (tourist arrivals and overnights) and the website of the Italian Ministry of Cultural Heritage (http://www.statistica.beniculturali.it) (visits to state museums, monuments and museum networks).

Notes

  1. https://statistica.regione.emilia-romagna.it/turismo

  2. https://www.veneto.eu/web/area-operatori/statistiche

  3. http://www.statistica.beniculturali.it

References

  • Aitken, A. C. (1926). A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb, 45(1), 14–22.

    Article  Google Scholar 

  • Aitkin, M., & Wilson, T. G. (1980). Mixture models, outliers, and the EM algorithm. Technometrics, 22(3), 325–331.

    Article  Google Scholar 

  • Andrews, J. L., & McNicholas, P. D. (2011). Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput, 21(3), 361–373.

    Article  MathSciNet  Google Scholar 

  • Baek, J., & McLachlan, G. J. (2011). Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27(9), 1269–1276.

    Article  Google Scholar 

  • Bai, X., Yao, W., & Boyer, J. E. (2012). Robust fitting of mixture regression models. Comput Stat Data Anal, 56(7), 2347–2359.

    Article  MathSciNet  Google Scholar 

  • Baudry, J. P., Raftery, A. E., Celeux, G., Lo, K., & Gottardo, R. (2010). Combining mixture components for clustering. J Comput Graph Stat, 19(2), 332–353.

    Article  MathSciNet  Google Scholar 

  • Biernacki, C., Celeux, G., & Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell, 22(7), 719–725.

    Article  Google Scholar 

  • Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal, 41(3–4), 561–575.

    Article  MathSciNet  Google Scholar 

  • Boldea, O., & Magnus, J. R. (2009). Maximum likelihood estimation of the multivariate normal mixture model. J Am Stat Assoc, 104, 1539–1549.

    Article  MathSciNet  Google Scholar 

  • Browne, R. P., & McNicholas, P. D. (2014). Estimating common principal components in high dimensions. Adv Data Anal Classif, 8, 217–226.

    Article  MathSciNet  Google Scholar 

  • Browne, R. P., & McNicholas, P. D. (2014). Orthogonal Stiefel manifold optimization for eigen-decomposed covariance parameter estimation in mixture models. Stat Comput, 24, 203–210.

    Article  MathSciNet  Google Scholar 

  • Cadavez, V. A. P., & Henningsen, A. (2012). The use of seemingly unrelated regression (SUR) to predict the carcass composition of lambs. Meat Sci, 92(4), 548–553.

    Article  Google Scholar 

  • Cappozzo, A., García-Escudero, L. A., Greselin, F., & Mayo-Iscar, A. (2021). Parameter choice, stability and validity for robust cluster weighted modeling. Stats, 4, 602–615.

    Article  Google Scholar 

  • Cappozzo, A., García-Escudero, L. A., Greselin, F., & Mayo-Iscar, A. (2023). Graphical and computational tools to guide parameter choice for the cluster weighted robust model. J Comput Graph Stat. https://doi.org/10.1080/10618600.2022.2154218

    Article  MathSciNet  Google Scholar 

  • Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognit, 28(5), 781–793.

    Article  Google Scholar 

  • Chatterjee, S., Laudato, M., & Lynch, L. A. (1996). Genetic algorithms and their statistical applications: An introduction. Comput Stat Data Anal, 22, 633–651.

    Article  Google Scholar 

  • Cuesta-Albertos, J. A., Gordaliza, A., & Matran, C. (1997). Trimmed \(k\) means: An attempt to robustify quantizers. Ann Stat, 25(2), 553–576.

    Article  MathSciNet  Google Scholar 

  • Dang, U. J., Punzo, A., McNicholas, P. D., Ingrassia, S., & Browne, R. P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. J Classif, 34(1), 4–34.

    Article  MathSciNet  Google Scholar 

  • Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc: Ser B, 39(1), 1–38.

    MathSciNet  Google Scholar 

  • Diani, C., Galimberti, G., & Soffritti, G. (2022). Multivariate cluster-weighted models based on seemingly unrelated linear regression. Comput Stat Data Anal, 171, 107451.

    Article  MathSciNet  Google Scholar 

  • Disegna, M., & Osti, L. (2016). Tourists’ expenditure behaviour: The influence of satisfaction and the dependence of spending categories. Tour Econ, 22(1), 5–30.

    Article  Google Scholar 

  • Farcomeni, A., & Punzo, A. (2020). Robust model-based clustering with mild and gross outliers. Test, 29, 989–1007.

    Article  MathSciNet  Google Scholar 

  • Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. New York: Springer.

    Google Scholar 

  • Gallaugher, M. P. B., Tomarchio, S. D., McNicholas, P. D., & Punzo, A. (2022). Multivariate cluster weighted models using skewed distributions. Adv Data Anal Classif, 16, 93–124.

    Article  MathSciNet  Google Scholar 

  • Galimberti, G., Manisi, A., & Soffritti, G. (2018). Modelling the role of variables in model-based cluster analysis. Stat Comput, 28(1), 145–169.

    Article  MathSciNet  Google Scholar 

  • Galimberti, G., Nuzzi, L., & Soffritti, G. (2021). Covariance matrix estimation of the maximum likelihood estimation in multivariate clusterwise linear regression. Stat Methods Appl, 30, 235–268.

    Article  MathSciNet  Google Scholar 

  • García-Escudero, L. A., Gordaliza, A., Greselin, F., Ingrassia, S., & Mayo-Iscar, A. (2017). Robust estimation of mixtures of regressions with random covariates, via trimming and constraints. Stat Comput, 27, 377–402.

    Article  MathSciNet  Google Scholar 

  • Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Ann. N. Y. Acad. Sci., 808, 18–24.

    Article  Google Scholar 

  • Giles, S., & Hampton, P. (1984). Regional production relationships during the industrialization of New Zealand, 1935–1948. Reg Sci, 24(4), 519–532.

    Article  Google Scholar 

  • Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading

  • Hastie, Tibshirani, & Friedman,. (2009). The elements of statistical learning (2nd ed.). New York: Springer.

  • Hennig, C. (2000). Identifiability of models for clusterwise linear regression. J Classif, 17, 273–296.

    Article  MathSciNet  Google Scholar 

  • Hennig, C. (2004). Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann Stat, 32, 1313–1340.

    Article  MathSciNet  Google Scholar 

  • Henningsen, A., & Hamann, J. D. (2007). systemfit: A package for estimating systems of simultaneous equations in R. J Stat Softw, 23(4), 1–40.

    Article  Google Scholar 

  • Hubert, L., & Arabie, P. (1985). Comparing partitions. J Classif, 2(1), 193–218.

    Article  Google Scholar 

  • Ingrassia, S., Minotti, S. C., & Vittadini, G. (2012). Local statistical modeling via a cluster-weighted approach with elliptical distributions. J Classif, 29(3), 363–401.

    Article  MathSciNet  Google Scholar 

  • Ingrassia, S., Minotti, S. C., & Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal, 71, 159–182.

    Article  MathSciNet  Google Scholar 

  • Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal, 41(3–4), 577–590.

    Article  MathSciNet  Google Scholar 

  • Lin, T.-I., & Wang, W.-L. (2022). Multivariate linear mixed models with censored and nonignorable missing outcomes, with application to AIDS studies. Biom J, 64, 1325–1339.

    Article  MathSciNet  Google Scholar 

  • Lin, T.-I., & Wang, W.-L. (2023). Flexible modeling of multiple nonlinear longitudinal trajectories with censored and non-ignorable missing outcomes. Stat Methods Med Res, 32(3), 593–608.

    Article  MathSciNet  Google Scholar 

  • Magnus, J. R., & Neudecker, H. (1988). Matrix differential calculus with applications in statistics and econometrics. New York: Wiley.

    Google Scholar 

  • Maronna, R. A., Martin, R. D., & Yohai, V. J. (2006). Robust statistics: Theory and methods. Chichester: Wiley.

    Book  Google Scholar 

  • Mazza, A., & Punzo, A. (2020). Mixtures of multivariate contaminated normal regression models. Stat Papers, 61(2), 787–822.

    Article  MathSciNet  Google Scholar 

  • McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley.

    Book  Google Scholar 

  • McNicholas, P. D. (2010). Model-based classification using latent Gaussian mixture models. J Stat Plan Inference, 140(5), 1175–1181.

    Article  MathSciNet  Google Scholar 

  • Meng, X. L., & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2), 267–278.

    Article  MathSciNet  Google Scholar 

  • Miller, A. J. (1991). Subset selection in regression (2nd ed.). Boca Raton: Chapman and Hall.

    Google Scholar 

  • Park, T. (1993). Equivalence of maximum likelihood estimation and iterative two-stage estimation for seemingly unrelated regression models. Commun Stat Theory Methods, 22(8), 2285–2296.

    Article  MathSciNet  Google Scholar 

  • Perrone, G., & Soffritti, G. (2023). Seemingly unrelated clusterwise linear regression for contaminated data. Stat Papers, 64, 883–921.

    Article  MathSciNet  Google Scholar 

  • Punzo, A., & McNicholas, P. D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. J Classif, 34(2), 249–293.

    Article  MathSciNet  Google Scholar 

  • Punzo, A., Mazza, A., & McNicholas, P. D. (2018). ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J Stat Softw, 85(10), 1–25.

    Article  Google Scholar 

  • R Core Team (2022) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org

  • Ritter, G. (2015). Robust cluster analysis and variable selection. Boca Raton: Chapman and Hall.

    Google Scholar 

  • Rousseeuw, P. J., & Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), 212–223.

    Article  Google Scholar 

  • Rousseeuw, P. J., & Leroy, A. M. (2005). Robust regression and outlier detection. New York: Wiley.

    Google Scholar 

  • Ruwet, C., García-Escudero, L. A., Gordaliza, A., & Mayo-Iscar, A. (2013). On the breakdown behavior of the TCLUST clustering procedure. Test, 22(3), 466–487.

    Article  MathSciNet  Google Scholar 

  • Schwarz, G. (1978). Estimating the dimension of a model. Ann Stat, 6(2), 461–464.

    Article  MathSciNet  Google Scholar 

  • Scrucca, L. (2013). GA: A package for genetic algorithms in R. J Stat Softw, 53(4), 1–37.

    Article  Google Scholar 

  • Scrucca, L. (2016). Genetic algorithms for subset selection in model-based clustering. In M. E. Celebi & K. Aydin (Eds.), Unsupervised learning algorithms (pp. 55–70). Berlin: Springer.

    Chapter  Google Scholar 

  • Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2017). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. R J, 8(1), 205–223.

    Google Scholar 

  • Soffritti, G. (2021). Estimating the covariance matrix of the maximum likelihood estimator under linear cluster-weighted models. J Classif, 38, 594–625.

    Article  MathSciNet  Google Scholar 

  • Srivastava, V. K., & Giles, D. E. A. (1987). Seemingly unrelated regression equations models. New York: Marcel Dekker.

    Google Scholar 

  • Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2015). Cluster-weighted \(t\)-factor analyzers for robust model-based clustering and dimension reduction. Stat Methods Appl, 24, 623–649.

    Article  MathSciNet  Google Scholar 

  • Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In I. Olkin (Ed.), Contributions to probability and statistics: essays in honor of Harold Hotelling, Stanford studies in mathematics and statistics (pp. 448–485). California: Stanford University Press.

    Google Scholar 

  • Wang, W.-L., & Lin, T.-I. (2016). Maximum likelihood inference for the multivariate t mixture model. J Multivar Anal, 149, 54–64.

    Article  MathSciNet  Google Scholar 

  • White, E. N., & Hewings, G. J. D. (1982). Space-time employment modelling: Some results using seemingly unrelated regression estimators. J Reg Sci, 22(3), 283–302.

    Article  Google Scholar 

  • Yao, W., Wei, Y., & Yu, C. (2014). Robust mixture regression using the \(t\)-distribution. Comput Stat Data Anal, 71, 116–127.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors are grateful to three anonymous reviewers for their constructive comments and valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriele Soffritti.

Ethics declarations

Ethical Approval

The authors have no relevant interests to disclose.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 493 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Perrone, G., Soffritti, G. Parsimonious Seemingly Unrelated Contaminated Normal Cluster-Weighted Models. J Classif (2024). https://doi.org/10.1007/s00357-023-09458-8

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00357-023-09458-8

Keywords

Mathematics Subject Classification (2010)

Navigation