Variable selection for skew-normal mixture of joint location and scale models

Although there are many papers on variable selection methods based on mean model in the finite mixture of regression models, little work has been done on how to select significant explanatory variables in the modeling of the variance parameter. In this paper, we propose and study a novel class of models: a skew-normal mixture of joint location and scale models to analyze the heteroscedastic skew-normal data coming from a heterogeneous population. The problem of variable selection for the proposed models is considered. In particular, a modified Expectation-Maximization(EM) algorithm for estimating the model parameters is developed. The consistency and the oracle property of the penalized estimators is established. Simulation studies are conducted to investigate the finite sample performance of the proposed methodologies. An example is illustrated by the proposed methodologies.


§1 Introduction
Homoscedasticity of the scale parameter is a common assumption in many regression models. However, this may not be appropriate in some situations, such as heteroscedasticity in data. One suitable approach is to model the variance parameter. Particularly in the econometric area and industrial quality improvement experiments, model the variance to identify the source of variability in the observations will be of direct interest in its own right, such as Taguchitype experiments for robust design [1]. In addition, the efficient estimation of mean parameters in regression depends on correct modelling of the variance. The loss of efficiency in using constant variance models when the variance is varying may be substantial. Thus, modelling of the variance can be as important as that of the mean. Joint mean and variance models have received a lot of attention in recent years. Park [2] proposed a log-linear model for the variance parameter and described the Gaussian model using a two-stage process to estimate the parameters. Harvey [3] discussed the maximum-likelihood (ML) estimation of the mean and variance effects and the subsequent likelihood ratio test under general conditions. Aitkin [4] provided ML estimation for a joint mean and variance model and applied it to the commonly cited Minitab tree data. Wu et al. [5] proposed a hybrid strategy, in which variable selection is employed to reduce the dimension of the explanatory variables in joint mean and variance models, and Box-Cox transformation is made to remedy the distribution of the response. Wu et al. [6] investigated the simultaneously variable selection in joint location and scale models of the skew-normal distribution. Wu et al. [7] propose a unified penalized likelihood method to simultaneously select significant variables and estimate unknown parameters in a joint location, scale and skewness model with a skew-t-normal (StN) distribution when outliers and asymmetrical outcomes are present.
Mixture of regression models are well known as switching regression models in econometrics literature, which were the first introduced by Goldfeld and Quandt [8]. Mixture of regression models have been applied in various fields including biology, medicine, economics, agriculture, animal sciences and so on, see [9][10][11][12]. Mixture of regression models can be easily applied to analyze data sets in which two or more subpopulations are mixed together. Due to its flexibility in modeling, mixture of regression models have enjoyed intensive attentions over the past years, from both practical and theoretical perspectives. In particular, on the variable selection, Khalili and Chen [13] proposed a new regularization method for variable selection in finite mixture of regression models. Khalili [14] investigated new estimation and feature selection methods in mixture-of-experts models. Khalili and Lin [15] developed a new regularization in finite mixture of regression models with diverging number of parameters. Khalili [16] gave an overview of the new feature selection methods in finite mixture of regression models. For other methods of variable selection for the mixture of regression models, we can refer to Du et al. [17], Ormoz and Eskandari [18], Lee et al. [19], Khalili et al.[20], Tang and Karunamun [21].
The existing studies on the mixture of regression models mainly focus on the normality assumption of response variables. This assumption may be inappropriate in some applications. For a set of data containing a group or groups of observations with asymmetric behavior, the use of normal component may be unsuitable and inferences can be misleading. In particular, the normal mixture model tends to overfit when additional components are included to capture the skewness. So we introduce the skew-normal distribution to overcome the potential weakness of normal mixtures. Liu and Lin [22] first developed a skew-normal mixture of regression model, but they only considered the mixture of location regression models using the univariate skew-normal distribution.
In the standard formulation of all the above mentioned work within the framework of the mixture of regression models, it is assumed that the equal variance for each component is constant across observations. However, in many practical situations, this assumption may be not hold. A more general model formulation allows for the variance to vary across observations, which has received little attention in the literature. When the variance is really varying, inference carried out under the assumption of fixed constant variance may be highly inaccurate. The common strategy in this situation is to augment the model by defining another regression structure for the variance parameter in a homogeneous population(see Aitkin [4]; Cepeda and Gamerman [23]; Taylor and Verbyla [24] ; Zhang and Wang [25] ; Wu and Li [26] ; Zhao and Zhang [27] ; Wu et al. [7]).
Similar to modelling of the variance parameter in a homogeneous population, we apply the idea of joint mean and variance models to the mixture of regression models and propose and study a novel class of models: a skew-normal mixture of joint location and scale model(SNMJLSM) to analyze the heteroscedastic skew-normal data coming from a heterogeneous population. The problem of variable selection for the proposed model is considered. In particular, a modified Expectation-Maximization(EM) algorithm for estimating the model parameters is developed. The consistency and the oracle property of the penalized estimators are established. Simulation studies are conducted to investigate the finite sample performance of the proposed methodologies. An example is illustrated by the proposed methodologies.
The outline of the article is as follows. In Section 2, we propose a skew-normal mixture of joint location and scale model(SNMJLSM). Then, in Section 3, we present variable selection for our proposed models via the penalized likelihood-based method. To do this in Section 3.1, penalized likelihood-based method is considered, and in Section 3.2 asymptotic properties of the proposed variable selection procedure are studied. Section 3.3 will show the EM algorithm and numerical solution for estimators. Choosing the tuning parameters will be considered in Section 3.4. In Section 4, we carry out simulation studies to assess the finite sample performance of the method. In section 5, a real data set on the air quality index (AQI) date is analyzed to demonstrate the proposed methods. Some concluding remarks are given in Section 6. Some technical proofs are put in the appendix. §2 SN mixture of joint location and scale models

Skew-normal distribution
In this section, we introduce the skew-normal distribution, as developed by Azzalini [28]. The random variable Y is said to have a skew-normal distribution with location parameter µ, scale parameter σ and skewness parameter λ, denoted by Y ∼ SN (µ, σ 2 , λ), if it has the probability density function where ϕ(.), Φ(.) are the density function and the cumulative distribution function of the standard normal distribution, respectively. If λ = 0, then the density of Y in (1) reduces to N (µ, σ 2 ).

SN mixture of joint location and scale models
Let y 1 , y 2 , · · · , y n be a random sample of size n from a m-component skew-normal mixture model with unknown mixing probabilities π 1 , π 2 , · · · , π m , the probability density function of random variable Y has a m-component mixture form: where π j represents the mixing probabilities and ∑ m j=1 π j = 1. Where π j , µ j , σ 2 j , λ j are unknown, and the parameter vector of the model is Ψ Ψ Ψ = (π 1 , · · · , π m , µ 1 , · · · , µ m , σ 2 1 , · · · , σ 2 m , λ 1 , · · · , λ m ) T . In this paper, we assume that the number of components m is fixed and known. Of course, in some practical applications, and m may be unknown and needs to be estimated along with mixing probabilities and other parameters, but in this paper, to simplify, we only consider the case where m is known, and only estimate the unknown vector of parameters Ψ Ψ Ψ.
Consider the following skew-normal mixture of joint location and scale model(SNMJLSM): In the model (3), y i is an independent response variable, h iq ) T are explanatory variables. In the jth subpopulation, β β β j = (β j1 , β j2 , · · · , β jp ) T is an unknown parameter of the location model , γ γ γ j = (γ j1 , γ j2 , · · · , γ jq ) T is an unknown parameter of the scale model and λ j is skewness parameter. x x x i , h h h i may be identical or completely different or part of the same, that is, the location model and the scale model may incorporate different covariates, or some of the same covariates, and may depend on common covariates in different ways. In this paper, we aim to remove the unnecessary explanatory variables from the model (3).

Identifiability
In models of finite mixtures of any class of distributions, it is important to consider the property of the identifiability, because procedures for estimation of parameters can be ill defined without such property. Otiniano et al [29]presents the proof of the identifiability of the classes of all finite mixtures of Skew-Normal and Skew-t distributions. The identifiability of some mixture models has been investigated by several authors, Teicher [30], Atienza et al [31], among others.
In this paper, we assume that the SNMJLSM under study are identifiable. Consider a SNMJLSM as (3). For a given design matrix, the SNMJLSM are said to be identifiable if for any two parameters for each i = 1, 2, · · · , n and all possible values of y, implies m = m * and Ψ Ψ Ψ = Ψ Ψ Ψ * . The identifiability implies that no two sets of different parameter values have the same density functions. Following Hennig [32] and Wang et al. [33] , we interpret Ψ Ψ Ψ = Ψ Ψ Ψ * , in the above definition, up to a permutation . §3 Estimation and variable selection method

Penalized likelihood
Many traditional variable selection criterias can be considered as a penalized likelihood which balances modeling biases and estimation variances [34]. They can enhance model interpretability with parsimonious representation and also improve the accuracy of estimation by efficiently identifying the significant variables.
Suppose that we have a random sample y i , x x x i , h h h i , i = 1, 2, · · · , n, from the model (3). Let l(Ψ Ψ Ψ) denote the log-likelihood function conditional on y i , x x x i , h h h i . Then the log-likelihood function of Ψ Ψ Ψ is given by: Similar to Khalili and Chen [13] , the penalty log-likelihood function is defined as where l n (Ψ Ψ Ψ Ψ Ψ Ψ Ψ Ψ Ψ) is a log-likelihood functions in (4), and the penality function p n (Ψ Ψ Ψ) is as follows: Here p n (Ψ Ψ Ψ; τ j ) is a nonnegative function indexed by certain tuning parameters τ j ≥ 0. The first part of p n (Ψ Ψ Ψ) penalizes the regression coefficients β jt of the location model, and the second part is the penalty on the coefficients γ jt of the scale model. General conditions on the proper choice of the penalty functions in (6) are given in the next subsection. If some of the coefficients β jt or γ jt (or both) are small in the models (3), then in fitting the model to the data through maximization of the function L n (Ψ Ψ Ψ) the hope is that the penalty function p n (Ψ Ψ Ψ) will force the estimated values of those coefficients to zero. The method automatically performs variable selection, which makes it computationally much more efficient than all-subset selection methods.
In general, we should choose appropriate penalty functions to suit the need of the application, under the guidance of statistical theory. However, the following three penalty functions have been investigated in the literature in a number of contexts and will be used here to illustrate the theory: LASSO: Following the convention in Fan and Li [34], we set a = 3.7 in our work. The penalty function of LASSO( [35]) has a good property enables easy numerical computation. The SCAD penalty function gives good performance in selecting important variables without creating excessive biases. HARD( [36]) should work more like SCAD, although less smoothly( [13]).

Asymptotic properties
Without loss of generality, the coefficient vectors β β β j and γ γ γ contains all the parameters corresponding to zero effects, that is β β β 2j and γ γ γ 2j in the true model. The vector of parameters in the true model is Ψ Ψ Ψ 0 , and its components are denoted with a superscript, such as β 0 are the first and second derivatives of the function p n (Ψ Ψ Ψ; τ j ) with respect to Ψ Ψ Ψ, for the tuning parameters τ j and γ j that depend on the simple size n. Using these quantities, the function p n (Ψ Ψ Ψ; τ j ) is required to satisfy the following conditions: C 0 : For all n and τ j , p n (0; τ j ) = 0, and p n (Ψ Ψ Ψ; τ j ) is symmetric and nonnegative. In addition, it is nondecreasing and twice differentiable for all Ψ Ψ Ψ in (0, ∞) with at most a few exceptions.
Conditions C 0 −C 2 guarantee √ n-consistency of the estimators of the true nonzero regression coefficients, and also consistency of the method in variable selection.
n, be a random sample from a density function f (v, Ψ Ψ Ψ) that satisfies the regularity conditions R 1 − R 5 in the Appendix. Assume that the function p n (Ψ Ψ Ψ; τ j ) satisfies the regularity conditions C 0 and C 1 , and also suppose τ j / √ n → 0 as n → ∞. Then there exists a local minimizerΨ Ψ Ψ n of the regularized log-likelihood function . The rate is achievable by proper choice of the tuning parameters τ j .

The EM algorithm
In the context of finite mixture models, because of the existence of latent variable z ij , the classical MLE is not applicable. The expectation-maximization (EM) algorithm of Dempster et al. [37]provides a convenient approach to the estimation of parameters. However, due to condition C 0 , the function p n (Ψ Ψ Ψ; τ j ) are not differentiable at Ψ Ψ Ψ = 0. Then, the Newton-Raphson algorithm can not be used directly. We follow the suggestion of Fan and Li [34], and approximate p n (Ψ Ψ Ψ; τ j ) in a neighbourhood of Ψ Ψ Ψ 0 by the local quadratic function This function increase to infinity when n → ∞, which is more suitable for our application than the simple Taylors expansion. Let Ψ Ψ Ψ (k) be the parameter value after the kth iteration.
We replace p n (Ψ Ψ Ψ) in the penalized log-likelihood function in (5) by the following function: The revised EM algorithm is as follows. Let the complete log-likelihood function be where the conditional expectation of the missing labels z ij is:

Choosing the tuning parameters
When using the methods proposed in this paper, one needs to choose the tuning parameter τ j . Our current theory provides only some guidance on the order of the tuning parameter τ j to ensure the consistency of the method in variable selection. Fan and Li [34], Khalili and Chen [13] showed that the tuning parameter selected by GCV leads to a nonignorable overfitting effect in the final selected model. They used the BIC for choosing the tuning parameter, and the method was shown to be consistent in selecting the true sparse model. We also propose using a BIC-type approach.
For the sake of comparison, we carry out simulations with three penalties as described above. The performance of proposed method for variable selection is investigated via simulations.
The simulation results are reported using the following two quantities: C: average number of zero regression coefficients that are correctly estimated as zero.
IC: average number of nonzero regression coefficients that are correctly estimated as zero.
(2) For a given sample size n, the performances of three variable selection procedures are similar in terms of model complexity. The performances of SCAD and HARD procedures are similar in terms of model error. Furthermore, the performances of SCAD and HARD are better than that of LASSO in terms of model error. However, this paper proposed method does not perform well for small sample sizes.
(3) For a given penalty function and sample size n, the performance of variable selection for components 1 and 2 in the location model is significantly better than that of the scale model in the sense of model error and model complexity. It is may be that the estimation of scale model parameters is not unbiased.
(4) For a given sample size n and the mixture proportion π 1 = π 2 = 0.5, as expected, the performances of three variable selection procedures in two subpopulation is similar in the sense of model error and model complexity. However, in the case π 1 = 0.35, π 2 = 0.65, the performance of three variable selection procedures in the second subpopulation is significantly better than that of the first subpopulation in the sense of model error and model complexity. §5 Application to real data In this section, we illustrate the proposed variable selection procedure by using the air quality index (AQI) data. We collected the daily average value of the AQI data of Hangzhou city and Zhengzhou city in China from May 1, 2015 to March 31, 2017, totaling 670 data. This AQI data set consists of the response variable Y -AQI and seven predictors: X 1 -fine particulate matter(PM2.5); X 2 -inhalable particulate matter(PM10); X 3 -Sulfur dioxide(SO 2 ); X 4 -Carbon monoxide (CO); X 5 -Nitrogen dioxide(N O 2 ); X 6 -Ozone(O 3 ) and X 7 -AQI day ranking. We are interested in establishing the relationship between the Y -AQI and the important predictors.   Figures 1 and 2 indicate that the AQI data of Hangzhou city and Zhengzhou city follow approximately a skew-normal distribution, respectively. AQI data in Hangzhou is concentrated in the 50-100, while the AQI data of Zhengzhou is mainly concentrated in 100-200. Thus, there are some differences in the air quality index of two part. It may not be appropriate to describe it with the same model, so it is analyzed using a finite mixture of models.
According to the above analysis, Y , the AQI data of Hangzhou city and Zhengzhou city follow approximately a skew-normal distribution. So, we can consider the AQI data variable selection for the following skew-normal mixture joint location and scale models(SNMJLSM): We apply the variable selection procedure based on the SCAD, LASSO and HARD proposed in Section 2 to the above model. The results are displayed in Table 4, where H and Z denote Hangzhou and Zhengzhou, respectively.  Table 4, we can see that in this data example, (1) The estimation and variable selection procedure based on the SCAD, LASSO and HARD method perform very similarly in terms of the selected variables.
(2) In the location model, β 3 is zero. This indicates X 3 (SO 2 ) has no significant impact the location of Y (AQI). For Hangzhou, X 2 (P M 10) and X 6 (O 3 ) are unimportant variable impact the location of Y (AQI). For Zhengzhou, X 5 (N O 2 ) is an unimportant variable impact the location of Y (AQI). There may be differences between North and South air pollutants. The northern winter supply of heating needs to burn a large number of coal so that variable selection results of Hangzhou and Zhengzhou are different.
(3) In the scale model, γ 2 , γ 3 and γ 4 are zero. This indicates the X 2 (P M 10), X 3 (SO 2 ) and X 4 (CO) have little influence on the scale parameters of Y (AQI) of Hangzhou and Zhengzhou. For Zhengzhou, however, X 5 (N O 2 ) and X 7 (AQI day ranking) have no significant impact on the scale parameters of Y (AQI).
(4) In the skewness parameter, the Y (AQI) of Hangzhou is λ 1 = 5.6845, Zhengzhou is λ 1 = 1.1668. This indicates the AQI data of Hangzhou and Zhengzhou have positively skewed, respectively. §6 Conclusions In this paper, our chief interest is to consider a variable selection procedure based on the skew-normal distribution for mixture of joint location and scale models. On the basis of the traditional selection of the mean variables in the finite mixture of regression models, we further consider the variable selection in variance. The simulation studies show that the procedure is to be consistent in selecting the most parsimonious mixture of joint location and scale models. The proposed method is applied to a real data set, the results show that the proposed variable selection procedure can be used in practical situations.
In addition, we only consider variable selection in mixture of joint location and scale models based on the skew-normal distribution. A natural future extension of this work is to consider generalizations of the skew-normal distribution (e.g. skew-t-normal distribution and skew-t distributions), which may be more suitable in different contexts. Furthermore, one interesting future direction is to extend the proposed model to the mixture of experts model framework [38].