Parsimony and parameter estimation for mixtures of multivariate leptokurtic-normal distributions

Mixtures of multivariate leptokurtic-normal distributions have been recently introduced in the clustering literature based on mixtures of elliptical heavy-tailed distributions. They have the advantage of having parameters directly related to the moments of practical interest. We derive two estimation procedures for these mixtures. The first one is based on the majorization-minimization algorithm, while the second is based on a fixed point approximation. Moreover, we introduce parsimonious forms of the considered mixtures and we use the illustrated estimation procedures to fit them. We use simulated and real data sets to investigate various aspects of the proposed models and algorithms. Supplementary Information The online version contains supplementary material available at 10.1007/s11634-023-00558-2.

A Summary of notation and function definitions • The function defined in Equation (2) of the paper, adjusts the normal density making it leptokurtic.
• The function u (r), defined in Equation (4) of the paper, is the log-density as a function of the squared Mahalanobis distance.It's first derivative is used in parameter estimation.
• The function , defined in Equation (5) of the paper, is used to find a bound on the Hessian of the log density.
• The function is used to update β j .
• The two matrices are used in the MM updates.
• The matrix function G H (A, B) provides the solution to continuous time algebraic Riccati equation using the Schur vectors of the associated Hamiltonian which we represent with the following matrix function where V 11 and V 21 are Schur vectors which form an orthogonal transformation that puts the associated Hamiltonian into a quasi-upper-triangular form, i.e.

  
V 11 V 12 These matrices are arranged so that the real parts of the spectrum of S 11 are negative and the real parts of the spectrum of S 22 are positive.
• The function h (ψ, a, b) represents Algorithm 1 that will be described in Section B.7.
• |U | ll denotes the lth diagonal element of the matrix U .
• As for the diag (•) operator, if the input is a matrix, then it extracts the diagonal elements and put them into a vector, while if the input is a vector, then it creates a diagonal matrix with the vector values along the main diagonal.
• G svd [F ] gives the maximizer to Tr (F Γ) and it is given by the following matrix operation where F has the singular value decomposition F = P DQ .
• The function is used in the FP algorithm.
B MM algorithm for updating µ j and Σ j The minorizer, in terms of the covariance parameters, with updated µ where 2) is the starting point for all the eigen-decomposed models.

B.1 VVV -Σ j
To update Σ j , we let . Then, taking the derivative of (B.2) yields This system is equivalent to the continuous time algebraic Riccati equation (for review, see Wonham, 1968).From Laub (1979, Theorem 5), if I d and B j are symmetric and positive definite, then there exists a unique symmetric positive definite solution; in addition, Laub (1979) provides the solution using the Schur vectors of the associated Hamiltonian which we represent with the following matrix function where V 11 and V 21 are Schur vectors which form an orthogonal transformation that puts the associated Hamiltonian into a quasi-upper-triangular form, i.e.

  
V 11 V 12 These matrices are arranged so that the real parts of the spectrum of S 11 are negative and the real parts of the spectrum of S 22 are positive.Calculating the Schur eigenvectors, sorted by the eigenvalues, is implemented in Lapack by Anderson et al. (1999); then the update for Σ j is To update Σ, we let Ξ = Σ −1/2 .Then, taking the derivative of (B.2) with respect to Ξ yields another continuous time algebraic Riccati equation which has solution given by and then again we reparameterize to obtain Σ (q+1) .

B.3 VII -Σ
Letting ξ j = λ −1/2 j and taking the derivative with respect to ξ j yields the following quadratic equation, An equivalent quadratic equation is 1 + az − bz 2 , which has roots Because b > 0, the first root, that we define as v(a, b), is positive and the second one is negative.
So the update for ξ j can be defined as By letting ξ = λ −1/2 , and taking the derivative with respect to ξ, yields the update Let ψ j be a d-dimensional vector.By letting ξ jl = λ j ψ l , and taking the derivative with respect to ξ gj , yields the following updates for each l = 1, . . ., d, j = 1, . . ., k, where |U | ll denotes the lth diagonal element of the matrix U .
Then, we can reparameterize to obtain updates for λ j and ψ j .
B.6 EEI -Σ j = λΨ = λ diag (ψ) Let ψ be a d-dimensional vector.By letting ξ l = λψ l , and taking the derivative with respect to each ξ l , yields a quadratic equation whose solution is The surrogate function in (B.2) becomes Conditional on Ψ, an update for , and λ . Then, conditional on λ j = λ (q+1) j , an update for ψ can be obtained by a reparamerization to yield a linear constraint.That is, we let η j = −1/2 log ψ l and obtain The constraint is now linear and we can use the elimination of variables (Nocedal and Wright, 2006, Section 15.3).By letting , we can form the unconstrained problem We then perform Newton-Raphson on this unconstrained problem and reparameterize to update ψ.
Algorithm 1 defines the function which performs this procedure.Note, one can use the Sherman- Calculate the gradient and Hessian Perform Netwon-Raphson Obtain ψ by setting ψ l = e −2η l for l = 2, . . ., d and ψ 1 = e −2 d l=2 η l .return ψ Morrison formula to find the inverse of the Hessian.We define the function h to represent algo-rithm 1 and then the update can be written as The utilize of Algorithm 1 gets the following update By conditioning on Γ j , and letting ξ l = λψ l , we can obtain the following update and reparameterize it to update λ and An iterative procedure can be obtained using an accelerated line search (Browne and McNicholas, 2014b;Absil et al., 2008), but here we will follow Browne and McNicholas (2014a) and Kiers (2002) and construct a MM algorithm.Then, to update Γ j we view (B.2) as with the matrix function, F , defined as where α is the smallest eigenvalue of Λ.
According to Cliff (1996), the maximizer to Tr (F Γ) is given by following matrix operation where F has the following singular value decomposition F = P DQ .Applying (B.8) and (B.9), the update can be written as By conditioning on Γ, and letting ξ jl = λ j ψ jl , we can obtain following update and reparameterizing it to update λ j and ψ jl .If we let Λ j = λ j diag ψ j , then (B.2), as a function of Γ, becomes Utilizing (B.8) and (B.9), then we have the following update Conditional on Ψ j and Γ j , an update for Then, we obtain λ (q+1) = ξ (q+1) −2 and utilize Algorithm 1 to get the following update q+1)  .
Finally, we construct Λ and obtain the update The solutions for each individual quadratic equation are obtained analogously to (B.5) and (B.6).
Conditional on Ψ, an update for Then, we obtain λ and utilize Algorithm 1 to get the following update Finally, we construct Λ q+1) and obtain the update

C Fixed point iteration with weighted average
For a single component and having (µ, β) fixed, the update for Σ j is based on the following quantity and note that κ i can be written as .
So, the update can be viewed as fixed point iteration To take the derivative we apply the vectorization, Taking the derivative with respect to vecΣ, we get the Jacobian, Then taking the The function, κ β (r) r 2 , has stationary points equal to the three roots from the following polynomial and if r 1 , r 2 & r 3 are the roots then we have  natural fix is a relaxed Picard fixed point iteration If a < 1/K the fixed point iteration will converge.However, it seems the a = 1/K is too strict of a bound and simulations show it can be can be higher.We further compare the weights by examining the log-likelihood sequences produced when using a weight equal to 1/K or given (C.17) while checking each sequence for log-likelihood fluctuations.
We generate 100 observations while varying d to be 10, 15, 20 and 25 from (C.16) or instead using shows the average number of iterations until convergence while estimating either Σ or a diagonal Σ based on 100 replications for each d.The convergence tolerance was set to 10 −4 .Both weights yielded monotonic log-likelihood sequences and using (C.17) requires less iterations.E Simulation study: Investigating some aspects of the MLN mixture An expanded form for the parameter setups given in Equations ( 16)-( 18) of the paper is given by

E.1 Asymptotic properties of the ML estimators
Here, we present the additional figures from the simulation study described within Section 8.1 of the manuscript.

E.2 Choosing the number of components
Here, we present the additional tables from the simulation study described within Section 8.2 of the manuscript.

F Simulation study: Computational time
To quantify the viability of the estimation procedure on large data, we partition the models into two groups, the first obtained by fixing the rotation matrices to be the identity matrix, and the second leaving the rotations matrices to be estimated form the data.The two groups refer to the third letter in the four letter notation; the first group has "I", while the other has either "E" or "V".We will denote these two groups with I and I.In detail, •
Figure C.3 show a = 1/K as function of d.

Figure C. 4 :
Figure C.4: Plots the proportion of runs with fluctuations while varying the weight, a = 1 in (C.15) while varying the number of variables d and having the sample size equal to n = 100.The left panel involves estimating a general Σ and the middle panel involves estimating a diagonal Σ.

Figures C. 4
Figures C.4 and C.5 indicate that the weight does not need to as strict or equal to 1/K (the value of 1/K is show in Figure C.3). Browne (2022) suggests using the weight by given by

Figure C. 5 :
Figure C.5: Plots the proportion of runs with fluctuations while varying the weight, a = 1 in (C.15) while varying the number of variables d and having the sample size equal to n = 100.The left panel involves estimating a general Σ and the middle panel involves estimating a diagonal Σ.

Figure C. 6 :
Figure C.6: Two illustrations with and without log-likelihoods fluctuations using different weights.The left panel has d = 10 and the right panel has d = 16.

Figures E. 7
Figures E.7, E.8, and E.9 show the log (Avarage MSE) when varying the data generating and fitted models for k = 2, 3, and 4, respectively.Each column is related to the data generating model and each row to the fitted model.Each cell contains the log (Avarage MSE) based on 100 replications when d ∈ {2, 8, 16} and the sample size varies by color within the set {100, 500, 1000}.

Figure E. 7 :
Figure E.7: The log (Avarage MSE) when varying the data generating and fitted models.In each subplot, columns refer to the data generating model and rows to the fitted model.Each point on the line represents the log (Avarage MSE), based on 100 replications, when k = 2, d ∈ {2, 8, 16}, and n ∈ {100, 500, 1000}.

Figure E. 8 :
Figure E.8: The log (Avarage MSE) when varying the data generating and fitted models.In each subplot, columns refer to the data generating model and rows to the fitted model.Each point on the line represents the log (Avarage MSE), based on 100 replications, when k = 3, d ∈ {2, 8, 16}, and n ∈ {100, 500, 1000}.

Figure E. 9 :
Figure E.9: The log (Avarage MSE) when varying the data generating and fitted models.In each subplot, columns refer to the data generating model and rows to the fitted model.Each point on the line represents the log (Avarage MSE), based on 100 replications, when k = 4, d ∈ {2, 8, 16}, and n ∈ {100, 500, 1000}.

Table C
TableD.2:Average computational time and number of iterations until convergence (with the maximum number of iterations fixed to 1000), for each estimation algorithm over 100 replications when d = 16.The columns with header lmax − l show the average difference between the loglikelihood at convergence and the maximum log-likelihood for that data set.The output is displayed in pairs (FP, MM), which is the result from the FP and MM algorithms.
.1: Average number of iterations until convergence with a tolerance equal to 10 −4 based on 100 replications, sample size equal to 100.D Simulation study: comparing MM and FP algorithms

Table E .
6 is the analogous, when d = 16, of Table8in the paper, which we recall to refer to the case d = 2. TableE.6 gives the number of components selected by the BIC when fixing the model structure and varying the number of components.Table E.7 is the analogous, when d = 16, of Table9in the paper, which we recall to refer to the case d = 2. TableE.7 gives the number of times each pair (model, k) was selected by the BIC for each scenario.The search space was k ∈ {1, . . ., 8} and all the possible model structures.

Table E .
6: Number of times each value of k is selected by the BIC when fixing the model structure and varying the number of components.Each row refers to 100 data sets generated from a MLN mixture with θ k (δ, λ, ρ, d = 16).