Skip to main content
Log in

Multivariate cluster weighted models using skewed distributions

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

A Correction to this article was published on 09 December 2021

This article has been updated

Abstract

Much work has been done in the area of the cluster weighted model (CWM), which extends the finite mixture of regression model to include modelling of the covariates. Although many types of distributions have been considered for both the response(s) and covariates, to our knowledge skewed distributions have not yet been considered in this paradigm. Herein, a family of 24 novel CWMs is considered which allows both the responses and covariates to be modelled using one of four skewed distributions (the generalized hyberbolic and three of its skewed special cases, i.e., the skew-t, the variance-gamma and the normal-inverse Gaussian distributions) or the normal distribution. Parameter estimation is performed using the expectation-maximization algorithm and both simulated and real data are used for illustration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Change history

References

  • Aas K, Hobæk Haff I (2005) NIG and skew student’s t: two special cases of the generalised hyperbolic distribution. Appl Res Dev Res Rep

  • Andrews JL, McNicholas PD (2011) Extending mixtures of multivariate t-factor analyzers. Stat Comput 21(3):361–373

    Article  MathSciNet  MATH  Google Scholar 

  • Andrews JL, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate \(t\)-distributions: the \(t\) EIGEN family. Stat Comput 22(5):1021–1029

    Article  MathSciNet  MATH  Google Scholar 

  • Azzalini A (2020) The R package sn: the skew-normal and related distributions such as the skew-\(t\) (version 1.6-1). Università di Padova, Italia. http://azzalini.stat.unipd.it/SN

  • Baricz Á (2010) Turán type inequalities for some probability density functions. Stud Sci Math Hung 47(2):175–189

    MATH  Google Scholar 

  • Berta P, Ingrassia S, Punzo A, Vittadini G (2016) Multilevel cluster-weighted models for the evaluation of hospitals. METRON 74(3):275–292

    Article  MathSciNet  MATH  Google Scholar 

  • Browne RP, McNicholas PD (2015) A mixture of generalized hyperbolic distributions. Can J Stat 43(2):176–198

    Article  MathSciNet  MATH  Google Scholar 

  • Chamroukhi F (2017) Skew t mixture of experts. Neurocomputing 266:390–408

    Article  Google Scholar 

  • Chen L, Pourahmadi M, Maadooliat M (2014) Regularized multivariate regression models with skew-t error distributions. J Stat Plan Inference 149:125–139

    Article  MathSciNet  MATH  Google Scholar 

  • Crawford SL (1994) An application of the Laplace method to finite mixture distributions. J Am Stat Assoc 89(425):259–267

    Article  MathSciNet  MATH  Google Scholar 

  • Dang UJ, Browne RP, McNicholas PD (2015) Mixtures of multivariate power exponential distributions. Biometrics 71(4):1081–1089

    Article  MathSciNet  MATH  Google Scholar 

  • Dang UJ, Punzo A, McNicholas PD, Ingrassia S, Browne RP (2017) Multivariate response and parsimony for Gaussian cluster-weighted models. J Classif 34(1):4–34

    Article  MathSciNet  MATH  Google Scholar 

  • Dang UJ, Gallaugher MP, Browne RP, McNicholas PD (2019) Model-based clustering and classification using mixtures of multivariate skewed power exponential distributions. arXiv preprint arXiv:1907.01938

  • Dayton CM, Macready GB (1988) Concomitant-variable latent-class models. J Am Stat Assoc 83(401):173–178

    Article  MathSciNet  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  • DeSarbo WS, Cron WL (1988) A maximum likelihood methodology for clusterwise linear regression. J Classif 5(2):249–282

    Article  MathSciNet  MATH  Google Scholar 

  • Di Mari R, Bakk Z, Punzo A (2020) A random-covariate approach for distal outcome prediction with latent class analysis. Struct Equ Model 27(3):351–368

    Article  Google Scholar 

  • Doğru FZ, Arslan O (2017) Parameter estimation for mixtures of skew Laplace normal distributions and application in mixture regression modeling. Commun Stat Theory Methods 46(21):10879–10896

    Article  MathSciNet  MATH  Google Scholar 

  • Ferreira CS, Lachos VH, Bolfarine H (2015) Inference and diagnostics in skew scale mixtures of normal regression models. J Stat Comput Simul 85(3):517–537

    Article  MathSciNet  MATH  Google Scholar 

  • Frimpong EY, Gage TB, Stratton H (2008) Identifiability of bivariate mixtures: an application to infant mortality models. PhD thesis, Citeseer

  • Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York

    MATH  Google Scholar 

  • Frühwirth-Schnatter S, Pyne S (2010) Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11(2):317–336

    Article  MATH  Google Scholar 

  • Galimberti G, Soffritti G (2020) A note on the consistency of the maximum likelihood estimator under multivariate linear cluster-weighted models. Stat Probab Lett 157:1089630

    Article  MathSciNet  MATH  Google Scholar 

  • Gallaugher MPB, McNicholas PD (2017) A matrix variate skew-t distribution. Stat 6(1):160–170

    Article  MathSciNet  Google Scholar 

  • Gallaugher MPB, McNicholas PD (2019) Three skewed matrix variate distributions. Statist Probab Lett 145:103–109

    Article  MathSciNet  MATH  Google Scholar 

  • Gershenfeld N (1997) Nonlinear inference and cluster-weighted modeling. Ann N Y Acad Sci 808(1):18–24

    Article  Google Scholar 

  • Göncü A, Yang H (2016) Variance-gamma and normal-inverse Gaussian models: goodness-of-fit to Chinese high-frequency index returns. North Am J Econ Finance 36:279–292

    Article  Google Scholar 

  • Hennig C (2000) Identifiablity of models for clusterwise linear regression. J Classif 17(2):273–296

    Article  MATH  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    Article  MATH  Google Scholar 

  • Hung W-L, Chang-Chien S-J (2017) Learning-based EM algorithm for normal-inverse Gaussian mixture model with application to extrasolar planets. J Appl Stat 44(6):978–999

    Article  MathSciNet  MATH  Google Scholar 

  • Ingrassia S, Minotti SC, Vittadini G (2012) Local statistical modeling via the cluster-weighted approach with elliptical distributions. J Classif 29(3):363–401

    Article  MathSciNet  MATH  Google Scholar 

  • Ingrassia S, Minotti SC, Punzo A (2014) Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal 71:159–182

    Article  MathSciNet  MATH  Google Scholar 

  • Ingrassia S, Punzo A, Vittadini G, Minotti SC (2015) The generalized linear mixed cluster-weighted model. J Classif 32(1):85–113

    Article  MathSciNet  MATH  Google Scholar 

  • Ingrassia S, Punzo A (2016) Decision boundaries for mixtures of regressions. J Korean Stat Soc 45(2):295–306

    Article  MathSciNet  MATH  Google Scholar 

  • Jorgensen B (2012) Statistical properties of the generalized inverse Gaussian distribution, vol 9. Springer, New York

    Google Scholar 

  • Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83

    Article  MathSciNet  Google Scholar 

  • Kim N-H, Browne R (2019) Subspace clustering for the finite mixture of generalized hyperbolic distributions. Adv Data Anal Classif 13(3):641–661

    Article  MathSciNet  MATH  Google Scholar 

  • Lee S, McLachlan GJ (2014) Finite mixtures of multivariate skew t-distributions: some recent and new results. Stat Comput 24:181–202

    Article  MathSciNet  MATH  Google Scholar 

  • Lin TI (2009) Maximum likelihood estimation for multivariate skew normal mixture models. J Multivar Anal 100(2):257–265

    Article  MathSciNet  MATH  Google Scholar 

  • Lin TI (2010) Robust mixture modeling using multivariate skew t distributions. Stat Comput 20(3):343–356

    Article  MathSciNet  Google Scholar 

  • Lin T, McNicholas PD, Hsiu JH (2014) Capturing patterns via parsimonious t mixture models. Statist Probab Lett 88:80–87

    Article  MathSciNet  MATH  Google Scholar 

  • Mazza A, Punzo A, Ingrassia S (2018) flexCWM: a flexible framework for cluster-weighted models. J Stat Softw 86(2):1–30

    Article  Google Scholar 

  • McNeil AJ, Frey R, Embrechts P (2005) Quantitative risk management: concepts, techniques and tools. Princeton University Press, Princeton

    MATH  Google Scholar 

  • McNicholas PD (2016a) Mixture model-based classification. Chapman & Hall/CRC Press, Boca Raton

  • McNicholas PD (2016b) Model-based clustering. J Classif 33(3):331–373

  • McNicholas SM, McNicholas PD, Browne RP (2017) A mixture of variance-gamma factor analyzers. In: Ahmed SE (ed) Big and complex data analysis, contributions to statistics. Springer, Cham, pp 369–385

    Chapter  Google Scholar 

  • Murphy K, Murphy TB (2020a) Gaussian parsimonious clustering models with covariates and a noise component. Adv Data Anal Classif 14:293–325

  • Murphy K, Murphy TB (2020b) MoEClust: Gaussian parsimonious clustering models with covariates and a noise component. R package version 1.3.3. https://cran.r-project.org/package=MoEClust

  • Murray PM, Browne RB, McNicholas PD (2014a) Mixtures of skew-t factor analyzers. Comput Stat Data Anal 77:326–335

  • Murray PM, McNicholas PD, Browne RB (2014b) A mixture of common skew-\(t\) factor analyzers. Stat 3(1):68–82

  • Peel D, McLachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10(4):339–348

    Article  Google Scholar 

  • Počuča N, Jevtić P, McNicholas PD, Miljkovic T (2020) Modeling frequency and severity of claims with the zero-inflated generalized cluster-weighted models. Math Econ Insur

  • Punzo A (2014) Flexible mixture modelling with the polynomial Gaussian cluster-weighted model. Stat Model 14(3):257–291

    Article  MathSciNet  MATH  Google Scholar 

  • Punzo A, Ingrassia S (2015) Parsimonious generalized linear Gaussian cluster-weighted models. In: Morlini I, Minerva T, Vichi M (eds) Advances in statistical models for data analysis, studies in classification, data analysis and knowledge organization. Springer, Switzerland, pp 201–209

    Chapter  Google Scholar 

  • Punzo A, Ingrassia S (2016) Clustering bivariate mixed-type data via the cluster-weighted model. Comput Statist 31(3):989–1013

    Article  MathSciNet  MATH  Google Scholar 

  • Punzo A, Bagnato L (2021) The multivariate tail-inflated normal distribution and its application in finance. J Stat Comput Simul 91(1):1–36

    Article  MathSciNet  MATH  Google Scholar 

  • Punzo A, Ingrassia S, Maruotti A (2018) Multivariate generalized hidden Markov regression models with random covariates: physical exercise in an elderly population. Stat Med 37(19):2797–2808

    Article  MathSciNet  Google Scholar 

  • Punzo A, Ingrassia S, Maruotti A (2021) Multivariate hidden Markov regression models: random covariates and heavy-tailed distributions. Stat Pap 62(3):1519–1555

    Article  MathSciNet  MATH  Google Scholar 

  • Pyne S, Hu X, Wang K, Rossin E, Lin T-I, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA et al (2009) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci 106(21):8519–8524

    Article  Google Scholar 

  • R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Soffritti G, Galimberti G (2011) Multivariate linear regression with non-normal errors: a solution based on mixture models. Stat Comput 21(4):523–536

    Article  MathSciNet  MATH  Google Scholar 

  • Steane MA, McNicholas PD, Yada R (2012) Model-based classification via mixtures of multivariate t-factor analyzers. Commun Stat Simul Comput 41(4):510–523

    Article  MathSciNet  MATH  Google Scholar 

  • Subedi S, Punzo A, Ingrassia S, McNicholas PD (2013) Clustering and classification via cluster-weighted factor analyzers. Adv Data Anal Classif 7(1):5–40

    Article  MathSciNet  MATH  Google Scholar 

  • Subedi S, Punzo A, Ingrassia S, McNicholas PD (2015) Cluster-weighted \(t\)-factor analyzers for robust model-based clustering and dimension reduction. Stat Methods Appl 24(4):623–649

    Article  MathSciNet  MATH  Google Scholar 

  • Tiedeman DV (1955) On the study of types. In: Sells SB (ed) Symposium on pattern analysis. Air University, U.S.A.F. School of Aviation Medicine, Randolph Field, Texas

  • Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York

    MATH  Google Scholar 

  • Tomarchio SD, McNicholas PD, Punzo A (2021) Matrix normal cluster-weighted models. J Classif 38(3)

  • Tortora C, Browne RP, ElSherbiny A, Franczak BC, McNicholas PD (2021) Model-based clustering, classification, and discriminant analysis using the generalized hyperbolic distribution: MixGHD R package. J Stat Softw 98(3):1–24

    Article  Google Scholar 

  • Vrbik I, McNicholas PD (2012) Analytic calculations for the EM algorithm for multivariate skew-t mixture models. Statist Probab Lett 82(6):1169–1174

    Article  MathSciNet  MATH  Google Scholar 

  • Vrbik I, McNicholas PD (2014) Parsimonious skew mixture models for model-based clustering and classification. Comput Stat Data Anal 71:196–210

    Article  MathSciNet  MATH  Google Scholar 

  • Wang K, Ng SK, McLachlan GJ (2009) Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In: Digital image computing: techniques and applications. IEEE, pp 526–531

  • Wolfe JH (1965) A computer program for the maximum likelihood analysis of types, technical bulletin. U.S, Naval Personnel Research Activity, pp. 65–15

  • Zarei S, Mohammadpour A, Ingrassia S, Punzo A (2019) On the use of the sub-Gaussian \(\alpha \)-stable distribution in the cluster-weighted model. Iran J Sci Technol Trans A Sci 43(3):1059–1069

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Salvatore D. Tomarchio.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: “the error in the line after equation (5) has been corrected in original article”.

Appendices

Appendix

Technical details on the ST distribution

In the fashion of Kim and Browne (2019), it is possible to show that the pdf in (12) can be obtained from the pdf in (8), by forcing \(\lambda \) and \(\omega \) to be a convenient function of \(\nu \), by letting \(\varvec{\Sigma }\) and \(\varvec{\alpha }\) to become large in a controlled way, and by letting \(\omega \) to become small in a controlled way. Specifically, let

$$\begin{aligned} {\varvec{\theta }}=\left( {\varvec{\mu }},\gamma ^{-1}\varvec{\alpha },\gamma ^{-1}\varvec{\Sigma },-\frac{\nu }{2},\nu \gamma \right) , \end{aligned}$$

where \(\gamma >0\) is a scaling factor. By substituting these parameter values into (8) we obtain

$$\begin{aligned} f_{\text {GH}}(\varvec{v}; {\varvec{\theta }})=&\, \frac{\exp \left[ (\varvec{v}-{\varvec{\mu }})'\varvec{\Sigma }^{-1}\varvec{\alpha }\right] }{(2\pi )^{\frac{d}{2}}| \gamma ^{-1} \varvec{\Sigma }|^{\frac{1}{2}}K_{-\frac{\nu }{2}}(\nu \gamma )} \left[ \frac{\gamma \delta (\varvec{v};{\varvec{\mu }},\varvec{\Sigma })+\nu \gamma }{\gamma ^{-1}\rho (\varvec{\alpha },\varvec{\Sigma })+\nu \gamma }\right] ^{-\frac{\nu +d}{4}}\\&\times K_{-\frac{\nu +d}{2}}\left( \sqrt{\left[ \gamma ^{-1}\rho (\varvec{\alpha },\varvec{\Sigma })+\nu \gamma \right] \left[ \gamma \delta (\varvec{v};{\varvec{\mu }},\varvec{\Sigma })+\nu \gamma \right] }\right) , \end{aligned}$$

which after some manipulation becomes

$$\begin{aligned} f_{\text {GH}}(\varvec{v}; {\varvec{\theta }})=&\, \frac{\gamma ^{-\frac{\nu }{2}}}{K_{-\frac{\nu }{2}}(\nu \gamma )} \frac{\exp \left[ (\varvec{v}-{\varvec{\mu }})'\varvec{\Sigma }^{-1}\varvec{\alpha }\right] }{(2\pi )^{\frac{d}{2}} |\varvec{\Sigma }|^{\frac{1}{2}}} \left[ \frac{\delta (\varvec{v};{\varvec{\mu }},\varvec{\Sigma })+\nu }{\rho (\varvec{\alpha },\varvec{\Sigma })+\nu \gamma ^2}\right] ^{-\frac{\nu +d}{4}}\\&\times K_{-\frac{\nu +d}{2}}\left( \sqrt{\left[ \rho (\varvec{\alpha },\varvec{\Sigma })+\nu \gamma ^2\right] \left[ \delta (\varvec{v};{\varvec{\mu }},\varvec{\Sigma })+\nu \right] }\right) . \end{aligned}$$

Now, letting \(\gamma \rightarrow 0\) and by using the following asymptotic relation

$$\begin{aligned} K_{\lambda }\left( x\right) \sim \Gamma \left( -\lambda \right) 2^{-\lambda -1}x^{\lambda },\quad \text {for } x\rightarrow 0 \text { and } \lambda < 0, \end{aligned}$$

we obtain

$$\begin{aligned}&\frac{2\left( \frac{\nu }{2}\right) ^{\frac{\nu }{2}}\exp \left[ (\varvec{v}-{\varvec{\mu }})'\varvec{\Sigma }^{-1}\varvec{\alpha }\right] }{(2\pi )^{\frac{d}{2}}| \varvec{\Sigma }|^{\frac{1}{2}}\Gamma (\frac{\nu }{2})} \left( \frac{\delta (\varvec{v};{\varvec{\mu }},\varvec{\Sigma })+\nu }{\rho (\varvec{\alpha },\varvec{\Sigma })}\right) ^{-\frac{\nu +d}{4}} \\&\qquad \qquad \qquad \qquad \times K_{-\frac{\nu +d}{2}}\left( \sqrt{\rho (\varvec{\alpha },\varvec{\Sigma })\left[ \delta (\varvec{v};{\varvec{\mu }},\varvec{\Sigma })+\nu \right] }\right) , \end{aligned}$$

which is the density reported in (12).

Parameter estimation

Let \(\left( \varvec{x}_{1}',\varvec{y}_{1}'\right) ',\ldots ,\left( \varvec{x}_{N}',\varvec{y}_{N}'\right) '\) be a random sample of N independent observations from (15). In the context of the EM algorithm, the random sample is considered incomplete. Specifically, we have two sources of incompleteness. The first source arises from the fact that, for each observation, we do not know its component membership; to govern this source, we use an indicator vector \(\varvec{z}_i=\left( z_{i1},\ldots ,z_{iG}\right) \), where \(z_{ig}=1\) if observation i is in group g, and \(z_{ig}=0\) otherwise. The second source arises if \(f\left( \varvec{y}|\varvec{x};{\varvec{\theta }}_{\varvec{Y}|g}\right) \) or \(f\left( \varvec{x};{\varvec{\theta }}_{\varvec{X}|g}\right) \) are skewed; to govern this source, we need the latent variables \(W_{\varvec{Y}|g}\) and \(W_{\varvec{X}|g}\) introduced in (17).

Based on this source of incompleteness, we can write the complete-data log-likelihood in the following way

$$\begin{aligned} l({\varvec{\vartheta }})=l_1(\varvec{\pi })+l_2({\varvec{\theta }}_{\varvec{X}})+l_3({\varvec{\theta }}_{\varvec{Y}}), \end{aligned}$$
(22)

where \(\varvec{\pi }=(\pi _1,\ldots ,\pi _G)'\), and

$$\begin{aligned} l_1(\varvec{\pi })=\sum _{g=1}^G\sum _{i=1}^Nz_{ig}\log \left( \pi _g\right) . \end{aligned}$$

If \(\varvec{X}\) in component g, \(g=1,\ldots ,G\), follows one of the four skewed distributions,

$$\begin{aligned} l_2({\varvec{\theta }}_{\varvec{X}}) =&\sum _{g=1}^G\sum _{i=1}^N z_{ig} \log \left[ h(w_{ig\varvec{X}};{\varvec{\phi }}_{W_{\varvec{X}}|g})\right] + C_{\varvec{X}} \\&\quad -\frac{1}{2}\sum _{g=1}^G\sum _{i=1}^N z_{ig}\big [\log (|\varvec{\Sigma }_{\varvec{X}|g}|)+w_{ig{\varvec{X}}}\varvec{\alpha }_{\varvec{X}|g}'\varvec{\Sigma }_{\varvec{X}|g}^{-1}\varvec{\alpha }_{\varvec{X}|g} \\&\quad + \frac{1}{w_{ig\varvec{X}}}(\varvec{x}_i-{\varvec{\mu }}_{\varvec{X}|g})'\varvec{\Sigma }_{\varvec{X}|g}^{-1}(\varvec{x}_i-{\varvec{\mu }}_{\varvec{X}|g}) \\&\quad -(\varvec{x}_i-{\varvec{\mu }}_{\varvec{X}|g})'\varvec{\Sigma }_{\varvec{X}|g}^{-1}\varvec{\alpha }_{\varvec{X}|g}-\varvec{\alpha }_{\varvec{X}|g}'\varvec{\Sigma }_{\varvec{X}|g}^{-1}(\varvec{x}_i-{\varvec{\mu }}_{\varvec{X}|g})\big ], \end{aligned}$$

where \(h(w_{ig\varvec{X}};{\varvec{\phi }}_{W_{\varvec{X}}|g})\) is the appropriate pdf for \(W_{ig\varvec{X}}\) discussed in Sect. 2, with parameters notated as \({\varvec{\phi }}_{W_{\varvec{X}}|g}\), while \(C_{\varvec{X}}\) is constant with respect to the parameters. On the other hand, if \(\varvec{X}\) in component g, \(g=1,\ldots ,G\), is normally distributed then

$$\begin{aligned} l_2({\varvec{\theta }}_{\varvec{X}}) = C_{\varvec{X}}-\frac{1}{2}\sum _{g=1}^G\sum _{i=1}^Nz_{ig}[\log (|\varvec{\Sigma }_{\varvec{X}|g}|)+(\varvec{x}_i-{\varvec{\mu }}_{\varvec{X}|g})'\varvec{\Sigma }_{\varvec{X}|g}^{-1}(\varvec{x}_i-{\varvec{\mu }}_{\varvec{X}|g})]. \end{aligned}$$

Similarly, if \(\varvec{Y}|\varvec{x}\) in component g, \(g=1,\ldots ,G\), is distributed according to one of the four skewed distributions,

$$\begin{aligned} l_3({\varvec{\theta }}_{\varvec{Y}}) =&\sum _{g=1}^G\sum _{i=1}^N z_{ig} \log \left[ h(w_{ig\varvec{Y}};{\varvec{\phi }}_{W_{\varvec{Y}}|g})\right] +C_{\varvec{Y}}\\&\quad -\frac{1}{2}\sum _{g=1}^G\sum _{i=1}^N z_{ig}\big [\log (|\varvec{\Sigma }_{\varvec{Y}|g}|)+w_{ig\varvec{Y}}\varvec{\alpha }_{\varvec{Y}|g}'\varvec{\Sigma }_{\varvec{Y}|g}^{-1}\varvec{\alpha }_{\varvec{Y}|g} \\&\quad + \frac{1}{w_{ig\varvec{Y}}}(\varvec{y}_i-{\varvec{B}_g'\varvec{x}_i^*})'\varvec{\Sigma }_{\varvec{Y}|g}^{-1}(\varvec{y}_i-{\varvec{B}_g'\varvec{x}_i^*})\\&\quad -(\varvec{y}_i-{\varvec{B}_g'\varvec{x}_i^*})'\varvec{\Sigma }_{\varvec{Y}|g}^{-1}\varvec{\alpha }_{\varvec{Y}|g}-\varvec{\alpha }_{\varvec{Y}|g}'\varvec{\Sigma }_{\varvec{Y}|g}^{-1}(\varvec{y}_i-{\varvec{B}_g'\varvec{x}_i^*})\big ], \end{aligned}$$

where \(h(w_{ig\varvec{Y}};{\varvec{\phi }}_{W_{\varvec{Y}}|g})\) is the appropriate pdf for \(W_{ig\varvec{Y}}\) discussed in Sect. 2, with parameters notationally compacted as \({\varvec{\phi }}_{W_{\varvec{Y}}|g}\), while \(C_{\varvec{Y}}\) is constant with respect to the parameters. Conversely, if \(\varvec{Y}|\varvec{x}\) in component g, \(g=1,\ldots ,G\), is normally distributed,

$$\begin{aligned} l_3({\varvec{\theta }}_{\varvec{Y}})=C_{\varvec{Y}}-\frac{1}{2}\sum _{g=1}^G\sum _{i=1}^Nz_{ig}[\log (|\varvec{\Sigma }_{\varvec{Y}|g}|)+(\varvec{x}_i-{\varvec{\mu }}_{\varvec{Y}|g})'\varvec{\Sigma }_{\varvec{Y}|g}^{-1}(\varvec{x}_i-{\varvec{\mu }}_{\varvec{Y}|g})]. \end{aligned}$$

After initialization, the EM algorithm proceeds iterating the following two steps until convergence.

E-Step. The E-step requires the calculation of the conditional expectation of (22). Thus, we first need to calculate

$$\begin{aligned} \hat{z}_{ig}=\frac{\hat{\pi }_g f\left( \varvec{y}_i|\varvec{x}_i;\hat{{\varvec{\theta }}}_{\varvec{Y}|g}\right) f\left( \varvec{x}_i;\hat{{\varvec{\theta }}}_{\varvec{X}|g}\right) }{\displaystyle \sum _{h=1}^G\hat{\pi }_h f\left( \varvec{y}_i|\varvec{x}_i;\hat{{\varvec{\theta }}}_{\varvec{Y}|h}\right) f\left( \varvec{x}_i;\hat{{\varvec{\theta }}}_{\varvec{X}|h}\right) }, \end{aligned}$$

which corresponds to the posterior probability that the unlabeled observation \(\left( \varvec{X}_{i}',\varvec{Y}_{i}'\right) '\) belongs to the gth component of the CWM. In addition, if the distribution of \(\varvec{X}\) in component g, \(g=1,\ldots ,G\), is skewed, the following values need to be updated:

$$\begin{aligned} \begin{aligned} \hat{l}_{ig\varvec{X}}&{:}{=}\, \mathbb {E}[W_{ig\varvec{X}}|z_{ig}=1,\varvec{x}_i,\hat{{\varvec{\phi }}}_{W_{\varvec{X}}|g}],\\ \hat{m}_{ig\varvec{X}}&{:}{=}\, \mathbb {E}[1/W_{ig\varvec{X}}|z_{ig}=1,\varvec{x}_i,\hat{{\varvec{\phi }}}_{W_{\varvec{X}}|g}],\\ \hat{n}_{ig\varvec{X}}&{:}{=}\, \mathbb {E}[\log (W_{ig\varvec{X}})|z_{ig}=1,\varvec{x}_i,\hat{{\varvec{\phi }}}_{W_{\varvec{X}}|g}].\\ \end{aligned} \end{aligned}$$

If the distribution of \(\varvec{Y}|\varvec{x}\) in component g, \(g=1,\ldots ,G\), is skewed, then the following values are also updated:

$$\begin{aligned} \begin{aligned} \hat{l}_{ig\varvec{Y}}&{:}{=}\, \mathbb {E}[W_{ig\varvec{Y}}|z_{ig}=1,\varvec{y}_i,\varvec{x}_i,\hat{{\varvec{\phi }}}_{W_{\varvec{Y}}|g}],\\ \hat{m}_{ig\varvec{Y}}&{:}{=}\, \mathbb {E}[1/W_{ig\varvec{Y}}|z_{ig}=1,\varvec{y}_i,\varvec{x}_i,\hat{{\varvec{\phi }}}_{W_{\varvec{Y}}|g}],\\ \hat{n}_{ig\varvec{Y}}&{:}{=}\, \mathbb {E}[\log (W_{ig\varvec{Y}})|z_{ig}=1,\varvec{y}_i,\varvec{x}_i,\hat{{\varvec{\phi }}}_{W_{\varvec{Y}}|g}].\\ \end{aligned} \end{aligned}$$

These updates depend on which of the skewed distributions is considered. However, as shown in Sect. 2.2, the conditional latent variables are all GIG distributed. Therefore, all of the required expectations can be calculated using (2)–(4).

M-Step. The M-step involves the maximization of the conditional expectation of the complete-data log-likelihood, allowing all parameters to be updated. Specifically, the update for \(\pi _g\) is

$$\begin{aligned} \hat{\pi }_g = \frac{1}{N}\sum _{i=1}^N\hat{z}_{ig}. \end{aligned}$$

The parameters related to the distribution of \(\varvec{X}\) in component g, \(g=1,\ldots ,G\), are updated as follows. For skewed distributions, we have the following updates for \({\varvec{\mu }}_{\varvec{X}|g}\) and \(\varvec{\alpha }_{\varvec{X}|g}\):

$$\begin{aligned} \hat{{\varvec{\mu }}}_{\varvec{X}|g}=\frac{\displaystyle \sum _{i=1}^N\hat{z}_{ig}\varvec{x}_i\left( \overline{l}_{\varvec{X}|g} \hat{m}_{ig\varvec{X}}-1\right) }{\displaystyle \sum _{i=1}^N\hat{z}_{ig}\overline{l}_{\varvec{X}|g} \hat{m}_{ig\varvec{X}}-T_g}\quad \text {and}\quad \hat{{\varvec{\alpha }}}_{\varvec{X}|g}=\frac{\displaystyle \sum _{i=1}^N\hat{z}_{ig}\varvec{x}_i\left( \overline{m}_{\varvec{X}|g}-\hat{m}_{ig\varvec{X}}\right) }{\displaystyle \sum _{i=1}^N\hat{z}_{ig}\overline{l}_{\varvec{X}|g} \hat{m}_{ig\varvec{X}}-T_g}, \end{aligned}$$

where \(T_g=\sum _{i=1}^N\hat{z}_{ig}\), \(\overline{l}_{\varvec{X}|g}=(1/T_g)\sum _{i=1}^N\hat{z}_{ig}\hat{l}_{ig\varvec{X}}\) and \(\overline{m}_{\varvec{X}|g}=(1/T_g)\sum _{i=1}^N\hat{z}_{ig}\hat{m}_{ig\varvec{X}}\). The update for \(\varvec{\Sigma }_{\varvec{X}|g}\) is

$$\begin{aligned} \hat{\varvec{\Sigma }}_{\varvec{X}|g} =&\frac{1}{T_g}\sum _{i=1}^N\hat{z}_{ig}\big [\hat{m}_{ig\varvec{X}}(\varvec{x}_i-\hat{{\varvec{\mu }}}_{\varvec{X}|g})(\varvec{x}_i-\hat{{\varvec{\mu }}}_{\varvec{X}|g})'\\&-(\varvec{x}_i-\hat{{\varvec{\mu }}}_{\varvec{X}|g})\hat{\varvec{\alpha }}_{\varvec{X}|g}'-\hat{\varvec{\alpha }}_{\varvec{X}|g}(\varvec{x}_i-\hat{{\varvec{\mu }}}_{\varvec{X}|g})' \\&+ \hat{l}_{ig\varvec{X}}\hat{\varvec{\alpha }}_{\varvec{X}|g}\hat{\varvec{\alpha }}_{\varvec{X}|g}'\big ]. \end{aligned}$$

Instead, for the normal distribution, we have

$$\begin{aligned} \hat{{\varvec{\mu }}}_{\varvec{X}|g}=\frac{1}{T_g}\sum _{g=1}^G\hat{z}_{ig}\varvec{x}_i, \qquad \hat{\varvec{\Sigma }}_{\varvec{X}|g}=\frac{1}{T_g}\sum _{g=1}^G\hat{z}_{ig}(\varvec{x}_i-\hat{{\varvec{\mu }}}_{\varvec{X}|g})(\varvec{x}_i-\hat{{\varvec{\mu }}}_{\varvec{X}|g})'. \end{aligned}$$

The parameters related to the distribution of \(\varvec{Y}|\varvec{x}\) in component g, \(g=1,\ldots ,G\), are updated as follows. For skewed distributions the updates for \(\varvec{B}_g\) and \(\varvec{\alpha }_{\varvec{Y}|g}\) are

$$\begin{aligned} \hat{\varvec{B}}_g=\varvec{P}_g^{-1}\varvec{R}_g \quad \text {and}\quad \hat{\varvec{\alpha }}_{\varvec{Y}|g}=\frac{1}{T_g\overline{l}_{\varvec{Y}|g}}\left( \sum _{i=1}^N\hat{z}_{ig}\varvec{y}_i-\varvec{R}_g'\varvec{P}_g^{-1}\sum _{i=1}^N\hat{z}_{ig}\varvec{x}_i^*\right) , \end{aligned}$$

where

$$\begin{aligned} \varvec{P}_g=\sum _{i=1}^N\hat{z}_{ig}\hat{m}_{ig\varvec{Y}}\varvec{x}_i^*{\varvec{x}_i^*}'-\frac{1}{T_g\overline{l}_{\varvec{Y}|g}}\left( \sum _{i=1}^N\hat{z}_{ig}\varvec{x}_i^*\right) \left( \sum _{i=1}^N\hat{z}_{ig}{\varvec{x}_i^*}'\right) \end{aligned}$$

and

$$\begin{aligned} \varvec{R}_g=\sum _{i=1}^N\hat{z}_{ig}\hat{m}_{ig\varvec{Y}}\varvec{x}_i^*{\varvec{y}_i}'-\frac{1}{T_g\overline{l}_{\varvec{Y}|g}}\left( \sum _{i=1}^N\hat{z}_{ig}\varvec{x}_i^*\right) \left( \sum _{i=1}^N\hat{z}_{ig}{\varvec{y}_i}'\right) , \end{aligned}$$

with \(\overline{l}_{\varvec{Y}|g}=(1/T_g)\sum _{i=1}^N\hat{z}_{ig}\hat{l}_{ig\varvec{Y}}\). The update for \(\varvec{\Sigma }_{\varvec{Y}|g}\) is

$$\begin{aligned} \hat{\varvec{\Sigma }}_{\varvec{Y}|g}= & {} \frac{1}{T_g}\sum _{i=1}^N\hat{z}_{ig}\Big [\hat{m}_{ig\varvec{Y}}\left( \varvec{y}-\hat{\varvec{B}}_g'\varvec{x}_i^*\right) \left( \varvec{y}-\hat{\varvec{B}}_g'\varvec{x}_i^*\right) '\\&-\left( \varvec{y}-\hat{\varvec{B}}_g'\varvec{x}_i^*\right) \hat{\varvec{\alpha }}_{\varvec{Y}|g}'-\hat{\varvec{\alpha }}_{\varvec{Y}|g}\left( \varvec{y}-\hat{\varvec{B}}_g'\varvec{x}_i^*\right) ' + \hat{l}_{ig\varvec{Y}}\hat{\varvec{\alpha }}_{\varvec{Y}|g}\hat{\varvec{\alpha }}_{\varvec{Y}|g}'\Big ]. \end{aligned}$$

Conversely, in the case of a a multivariate normal distribution, the updates for \(\varvec{B}_g\) and \(\varvec{\Sigma }_{\varvec{Y}|g}\) are

$$\begin{aligned}&\hat{\varvec{B}}_g=\left( \sum _{i=1}^N\hat{z}_{ig}\varvec{x}_i^*{\varvec{x}_i^*}'\right) ^{-1}\left( \sum _{i=1}^N\hat{z}_{ig}\varvec{x}_i^*{\varvec{y}_i}'\right) \quad \text {and} \\&\hat{\varvec{\Sigma }}_{\varvec{Y}|g}=\frac{1}{T_g}\sum _{i=1}^N\hat{z}_{ig}(\varvec{y}_i-\hat{\varvec{B}}_g'\varvec{x}_i)(\varvec{y}_i-\hat{\varvec{B}}_g'\varvec{x}_i)'. \end{aligned}$$

Finally, if either \(\varvec{X}\) or \(\varvec{Y}|\varvec{x}\) in component g, \(g=1,\ldots ,G\), follows one of the skewed distributions, then there are the additional tailedness and, in the case of the GH distribution, the index parameters that need to be updated. The updates for each distribution are now given.

1.1 Skew-t distribution

In the case of the ST distribution, we need to update the degrees of freedom \(\nu _g\). This update cannot be obtained in closed form, and thus needs to be performed numerically. For the covariates the update for \(\nu _{\varvec{X}|g}\) is obtained by solving the equation

$$\begin{aligned} \log \left( \frac{\nu _{\varvec{X}|g}}{2}\right) +1-\varphi \left( \frac{\nu _{\varvec{X}|g}}{2}\right) -\frac{1}{T_g}\sum _{i=1}^N\hat{z}_{ig}(\hat{m}_{ig\varvec{X}}+\hat{n}_{ig\varvec{X}})=0, \end{aligned}$$
(23)

where \(\varphi (\cdot )\) denotes the digamma function. When the responses are considered, the update for \(\nu _{\varvec{Y}|g}\) is obtained via (23), after the replacement of \(\nu _{\varvec{X}|g}\), \(\hat{m}_{ig\varvec{X}}\) and \(\hat{n}_{ig\varvec{X}}\) with \(\nu _{\varvec{Y}|g}\), \(\hat{m}_{ig\varvec{Y}}\) and \(\hat{n}_{ig\varvec{Y}}\), respectively.

1.2 Generalized hyperbolic distribution

For the GH distribution, we would update \(\lambda _g\) and \(\omega _g\). These updates are derived from Browne and McNicholas (2015), and rely on the log convexity of \(K_{s}(t)\) in both s and t (Baricz 2010). For notational purposes in this section, the superscript “prev” is used to distinguish the previous update from the current one. The resulting updates, when \(\varvec{X}\) is considered, are

$$\begin{aligned} \hat{\lambda }_{\varvec{X}|g}&=\overline{n}_{\varvec{X}|g}\hat{\lambda }_{\varvec{X}|g}^{\text {prev}}\left[ \left. \frac{\partial }{\partial s}\log (K_{s}(\hat{\omega }_{\varvec{X}|g}^{\text {prev}}))\right| _{s=\hat{\lambda }_{\varvec{X}|g}^{\text {prev}}}\right] ^{-1}, \end{aligned}$$
(24)
$$\begin{aligned} \hat{\omega }_{\varvec{X}|g}&=\hat{\omega }_{\varvec{X}|g}^{\text {prev}}-\left[ \left. \frac{\partial }{\partial s}q(\hat{\lambda }_{\varvec{X}|g},s)\right| _{s=\hat{\omega }_{\varvec{X}|g}^{\text {prev}}}\right] \left[ \left. \frac{\partial ^2}{\partial s^2}q(\hat{\lambda }_{\varvec{X}|g},s)\right| _{s=\hat{\omega }_{\varvec{X}|g}^{\text {prev}}}\right] ^{-1}, \end{aligned}$$
(25)

where the derivative in (24) is calculated numerically,

$$\begin{aligned} q(\lambda _{\varvec{X}|g},\omega _{\varvec{X}|g})=\sum _{i=1}^Nz_{ig}\left[ \log (K_{\lambda _{\varvec{X}|g}}(\omega _{\varvec{X}|g}))-\lambda _{\varvec{X}|g}\overline{n}_{\varvec{X}|g}-\frac{1}{2}\omega _{\varvec{X}|g}\left( \overline{l}_{\varvec{X}|g}+\overline{m}_{\varvec{X}|g}\right) \right] , \end{aligned}$$

and \(\overline{n}_{\varvec{X}|g}=({1}/{T_g})\sum _{i=1}^N\hat{z}_{ig}\hat{n}_{ig\varvec{X}}\). When \(\varvec{Y}\) is considered, \(\lambda _{\varvec{X}|g}\), \(\omega _{\varvec{X}|g}\), \(\overline{l}_{\varvec{X}|g}\), \(\overline{m}_{\varvec{X}|g}\), and \(\overline{n}_{\varvec{X}|g}\) are replaced with \(\lambda _{\varvec{Y}|g}\), \(\omega _{\varvec{Y}|g}\), \(\overline{l}_{\varvec{Y}|g}\), \(\overline{m}_{\varvec{Y}|g}\), and \(\overline{n}_{\varvec{Y}|g}\), respectively, where \(\overline{m}_{\varvec{Y}|g}=(1/T_g)\sum _{i=1}^N\hat{z}_{ig}\hat{m}_{ig\varvec{Y}}\) and \(\overline{n}_{\varvec{Y}|g}=({1}/{T_g})\sum _{i=1}^N\hat{z}_{ig}\hat{n}_{ig\varvec{Y}}\).

1.3 Variance-gamma distribution

For the VG distribution, the update for \(\psi _g\) cannot be obtained in closed form. When \(\varvec{X}\) is considered, this update is obtained by solving the equation

$$\begin{aligned} \log \psi _{\varvec{X}|g}+1-\varphi (\psi _{\varvec{X}|g})+\overline{n}_{\varvec{X}|g}-\overline{l}_{\varvec{X}|g}=0. \end{aligned}$$
(26)

Clearly, when \(\varvec{Y}\) is considered, \(\psi _{\varvec{X}|g}\), \(\overline{n}_{\varvec{X}|g}\) and \(\overline{l}_{\varvec{X}|g}\) are replaced with \(\psi _{\varvec{Y}|g}\), \(\overline{n}_{\varvec{Y}|g}\) and \(\overline{l}_{\varvec{Y}|g}\), respectively.

1.4 Normal inverse Gaussian distribution

The NIG distribution is the only having a closed form expression for its tailedness parameter. In detail, when we consider the covariates, the update of \(\kappa _{\varvec{X}|g}\) is

$$\begin{aligned} \hat{\kappa }_{\varvec{X}|g}=\frac{1}{\overline{l}_{\varvec{X}|g}}. \end{aligned}$$

If the responses are considered, we replace \(\kappa _{\varvec{X}|g}\) and \(\overline{l}_{\varvec{X}|g}\) with \(\kappa _{\varvec{Y}|g}\) and \(\overline{l}_{\varvec{Y}|g}\), respectively.

1.5 Initialization of the algorithm

To initialize the EM algorithm, we followed the approach discussed in Dang et al. (2017). Specifically, the \(z_{ig}\) are initialized in two different ways: 10 times using a random soft initialization and once with a k-means (hard) initialization. Therefore, for each G, the algorithms are run 11 times until convergence, and the solution producing the highest log-likelihood value is chosen. Notice that, for the k-means initialization, the initial \(z_{ig}\) are selected from the best k-means clustering results from 10 random starting values, and it is implemented by using the kmeans() function of the R statistical software (R Core Team 2019).

Proof of Theorem 3.1

Proof

Suppose that

$$\begin{aligned} p_{\text {GH-GH}}\left( \varvec{x},\varvec{y}; \varvec{\vartheta }\right) =p_{\text {GH-GH}}\left( \varvec{x},\varvec{y}; \widetilde{\varvec{\vartheta }}\right) . \end{aligned}$$
(27)

Integrating out \(\varvec{y}\) from each side of (27) yields an equality on the marginal distribution of \(\varvec{X}\), i.e.,

$$\begin{aligned} \sum _{g=1}^G\pi _gf_{\text {GH}}\left( \varvec{x};{\varvec{\theta }}_{\varvec{X}|g}\right)&= \sum _{j=1}^{\widetilde{G}}\widetilde{\pi }_jf_{\text {GH}}\left( \varvec{x};\widetilde{{\varvec{\theta }}}_{\varvec{X}|j}\right) \nonumber \\ p_{\text {GH}}\left( \varvec{x}; \varvec{\pi },{\varvec{\theta }}_{\varvec{X}}\right)&= p_{\text {GH}}\left( \varvec{x}; \widetilde{\varvec{\pi }},\widetilde{{\varvec{\theta }}}_{\varvec{X}}\right) , \end{aligned}$$
(28)

where \({\varvec{\theta }}_{\varvec{X}}=\left\{ {\varvec{\theta }}_{\varvec{X}|g}; \, g=1,\ldots ,G\right\} \), \(\widetilde{{\varvec{\theta }}}_{\varvec{X}}=\left\{ \widetilde{{\varvec{\theta }}}_{\varvec{X}|j}; \, j=1,\ldots ,\widetilde{G}\right\} \), \(\varvec{\pi }=\{\pi _g; \, g=1,\ldots ,G \}\) and \(\widetilde{\varvec{\pi }}=\{\widetilde{\pi }_j; \, j=1,\ldots ,\widetilde{G}\}\). Dividing the left-hand (right-hand) side of (27) by the left-hand (right-hand) side of (28) leads to

$$\begin{aligned} \sum _{g=1}^{\widetilde{G}} \frac{\pi _gf_{\text {GH}}\left( \varvec{x};{\varvec{\theta }}_{\varvec{X}|g}\right) }{p_{\text {GH}}\left( \varvec{x}; \varvec{\pi },{\varvec{\theta }}_{\varvec{X}}\right) } f_{\text {GH}}\left( \varvec{y}|\varvec{x};{\varvec{\theta }}_{\varvec{Y}|g}\right)&= \sum _{j=1}^{\widetilde{G}} \frac{\widetilde{\pi }_jf_{\text {GH}}\left( \varvec{x};\widetilde{{\varvec{\theta }}}_{\varvec{X}|j}\right) }{p_{\text {GH}}\left( \varvec{x}; \widetilde{\varvec{\pi }}, \widetilde{{\varvec{\theta }}}_{\varvec{X}}\right) } f_{\text {GH}}\left( \varvec{y}|\varvec{x};\widetilde{{\varvec{\theta }}}_{\varvec{Y}|j}\right) \nonumber \\ p_{\text {GH}}\left( \varvec{y}|\varvec{x};\varvec{\vartheta }\right)&= p_{\text {GH}}\left( \varvec{y}|\varvec{x};\widetilde{\varvec{\vartheta }}\right) . \end{aligned}$$
(29)

For each fixed value of \(\varvec{x}\), \(p_{\text {GH}}\left( \varvec{y}|\varvec{x};\varvec{\vartheta }\right) \) and \(p_{\text {GH}}\left( \varvec{y}|\varvec{x};\widetilde{\varvec{\vartheta }}\right) \) are mixtures of \(d_{\varvec{Y}}\)-variate GH distributions for \(\varvec{Y}\) (see Browne and McNicholas 2015).

Now, recall from Sect. 3.1 that the location parameter \(\varvec{\mu }_{\varvec{Y}|g}\) of the \(d_{\varvec{Y}}\)-variate GH distribution of \(\varvec{Y}\) in the gth mixture component is related to the covariates \(\varvec{X}\), through the regression coefficients \(\varvec{B}_g\), by the relation \(\varvec{B}'_g \varvec{x}^*\), \(g=1, \ldots , G\). Define the set of all covariate points \(\varvec{x}\) which can be used to distinct different regression coefficients \(\varvec{B}_g\) by different values of \(\varvec{B}'_g \varvec{x}^*\), i.e.

$$\begin{aligned} \mathcal {X} := \Bigl \{ \varvec{x}\in IR^{d_{\varvec{X}}}:&\forall g,s \in \{1, \ldots , G \} \text { and } j,t \in \{1, \ldots , \widetilde{G}\},\Bigr . \nonumber \\&\Bigl . \varvec{B}'_g \varvec{x}^*=\varvec{B}'_s\varvec{x}^* \ \Rightarrow \ \varvec{B}_g=\varvec{B}_s, \Bigr . \nonumber \\&\Bigl . \varvec{B}'_g \varvec{x}^*=\widetilde{\varvec{B}}'_j\varvec{x}^* \ \Rightarrow \ \varvec{B}_g=\widetilde{\varvec{B}}_j , \Bigr . \nonumber \\&\Bigl . \widetilde{\varvec{B}}'_j\varvec{x}^*=\widetilde{\varvec{B}}'_t\varvec{x}^* \ \Rightarrow \ \widetilde{\varvec{B}}_j=\widetilde{\varvec{B}}_t\Bigr \} . \end{aligned}$$

Note that \(\mathcal {X}\) is complement of a finite union of hyperplanes of \(IR^{d_{\varvec{X}}}\). Therefore,

$$\begin{aligned} \int _{\mathcal {X}}p_{\text {GH}}\left( \varvec{x}; \varvec{\pi },{\varvec{\theta }}_{\varvec{X}}\right) d\varvec{x}=1. \end{aligned}$$

For \(\varvec{x}\in \mathcal {X}\), all \(\left\{ \varvec{B}'_g \varvec{x}^*,\varvec{\Sigma }_{\varvec{Y}|g},\varvec{\alpha }_{\varvec{Y}|g},\lambda _{\varvec{Y}|g},\omega _{\varvec{Y}|g}\right\} \), \(g=1,\ldots ,G\), are pairwise distinct because all \(\left\{ \varvec{B}_g,\varvec{\Sigma }_{\varvec{Y}|g},\varvec{\alpha }_{\varvec{Y}|g},\lambda _{\varvec{Y}|g},\omega _{\varvec{Y}|g}\right\} \), \(g=1,\ldots ,G\), are pairwise distinct for the hypothesis of the theorem. As mentioned above, for each fixed value of \(\varvec{x}\), \(p_{\text {GH}}\left( \varvec{y}|\varvec{x};\varvec{\vartheta }\right) \) is a mixture of \(d_{\varvec{Y}}\)-variate GH distributions, which being identifiable (Browne and McNicholas 2015) implies that \(G=\widetilde{G}\) and that, for each \(g\in \left\{ 1,\ldots ,G\right\} \), there exists a \(j\in \left\{ 1,\ldots ,G\right\} \) such that

$$\begin{aligned} \varvec{B}_g=\widetilde{\varvec{B}}_j, \quad \varvec{\Sigma }_{\varvec{Y}|g}=\widetilde{\varvec{\Sigma }}_{\varvec{Y}|j}, \quad \varvec{\alpha }_{\varvec{Y}|g}=\widetilde{\varvec{\alpha }}_{\varvec{Y}|j}, \quad \lambda _{\varvec{Y}|g}=\widetilde{\lambda }_{\varvec{Y}|j}, \quad \omega _{\varvec{Y}|g}=\widetilde{\omega }_{\varvec{Y}|j} \end{aligned}$$

and

$$\begin{aligned} \frac{\pi _gf_{\text {GH}}\left( \varvec{x};{\varvec{\theta }}_{\varvec{X}|g}\right) }{p_{\text {GH}}\left( \varvec{x}; \varvec{\pi },{\varvec{\theta }}_{\varvec{X}}\right) } = \frac{\widetilde{\pi }_jf_{\text {GH}}\left( \varvec{x};\widetilde{{\varvec{\theta }}}_{\varvec{X}|j}\right) }{p_{\text {GH}}\left( \varvec{x}; \widetilde{\varvec{\pi }},\widetilde{{\varvec{\theta }}}_{\varvec{X}}\right) }. \end{aligned}$$
(30)

Now, based on (28), the equality in (30) simplifies to

$$\begin{aligned} \pi _gf_{\text {GH}}\left( \varvec{x};{\varvec{\theta }}_{\varvec{X}|g}\right) =\widetilde{\pi }_jf_{\text {GH}}\left( \varvec{x};\widetilde{{\varvec{\theta }}}_{\varvec{X}|j}\right) ,\quad \forall \ \varvec{x}\in \mathcal {X}. \end{aligned}$$
(31)

Integrating (31) over \(\varvec{x}\in \mathcal {X}\) yields \(\pi _g=\widetilde{\pi }_j\). Therefore, the condition (31) further simplifies as

$$\begin{aligned} f_{\text {GH}}\left( \varvec{x};{\varvec{\theta }}_{\varvec{X}|g}\right) = f_{\text {GH}}\left( \varvec{x};\widetilde{{\varvec{\theta }}}_{\varvec{X}|j}\right) ,\quad \forall \ \varvec{x}\in \mathcal {X}. \end{aligned}$$

The equalities \(\varvec{\mu }_{\varvec{X}|g}=\widetilde{\varvec{\mu }}_{\varvec{X}|j}\), \(\varvec{\Sigma }_{\varvec{X}|g}=\widetilde{\varvec{\Sigma }}_{\varvec{X}|j}\), \(\varvec{\alpha }_{\varvec{X}|g}=\widetilde{\varvec{\alpha }}_{\varvec{X}|j}\), \(\lambda _{\varvec{X}|g}=\widetilde{\lambda }_{\varvec{X}|j}\), and \(\omega _{\varvec{X}|g}=\widetilde{\omega }_{\varvec{X}|j}\) simply arise from the identifiability of the \(d_{\varvec{X}}\)-variate GH distribution, and this completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gallaugher, M.P.B., Tomarchio, S.D., McNicholas, P.D. et al. Multivariate cluster weighted models using skewed distributions. Adv Data Anal Classif 16, 93–124 (2022). https://doi.org/10.1007/s11634-021-00480-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-021-00480-5

Keywords

Mathematics Subject Classification

Navigation