Sparse principal component regression via singular value decomposition approach

Principal component regression (PCR) is a two-stage procedure: the first stage performs principal component analysis (PCA) and the second stage builds a regression model whose explanatory variables are the principal components obtained in the first stage. Since PCA is performed using only explanatory variables, the principal components have no information about the response variable. To address this problem, we present a one-stage procedure for PCR based on a singular value decomposition approach. Our approach is based upon two loss functions, which are a regression loss and a PCA loss from the singular value decomposition, with sparse regularization. The proposed method enables us to obtain principal component loadings that include information about both explanatory variables and a response variable. An estimation algorithm is developed by using the alternating direction method of multipliers. We conduct numerical studies to show the effectiveness of the proposed method.


Introduction
Principal component regression (PCR), invented by Massy (1965) and Jolliffe (1982), is widely used in various fields of research including chemometrics, bioinformatics, and psychology, and then has been extensively studied by a lot of researchers (Frank and Friedman, 1993;Hartnett et al., 1998 Jolliffe, 2002), followed by regression in which explanatory variables are the selected principal components.However, owing to the two-stage procedure, the principal components do not have information on the response variable.This causes low prediction accuracy for PCR, if the response variable is related with the principal components having small eigenvalues.
To address the problem, a one-stage procedure for PCR has been proposed by Kawano et al. (2015).Its one-stage procedure is developed by combining a regression squared loss function with a sparse PCA (SPCA) loss function by Zou et al. (2006).The estimate of the regression parameter and loading matrix in PCA is obtained as the minimizer of the combination of two loss functions with sparse regularization.By virtue of sparse regularization, it enables us to obtain sparse estimates of the parameters.Kawano et al. (2015) called the one-stage procedure sparse principal component regression (SPCR).Kawano et al. (2018) have also extended SPCR in the framework of generalized linear models.It is, however, doubtful whether using the PCA loss function by Zou et al. (2006) is the best choice for SPCR, because there exist various formulae for PCA.This paper proposes a novel formulation for SPCR.As a PCA loss for SPCR, we adopt a loss function by the singular value decomposition approach (Shen and Huang, 2008).
Using the basic loss function, a combination of the PCA loss and the regression squared loss, with sparse regularization, we derive an alternative formulation for SPCR.We call the proposed method sparse principal component regression based on singular value decomposition approach (SPCRsvd).An estimation algorithm of SPCRsvd is developed by using an alternating direction method of multipliers (Boyd et al., 2010) and a linearized alternating direction method of multipliers (Wang and Yuan, 2012;Li et al., 2014).
The rest of this paper is organized as follows.In Section 2, we review SPCA by Zou et al. (2006) and Shen and Huang (2008), and SPCR by Kawano et al. (2015).We present SPCRsvd in Section 3. Section 4 derives two computational algorithms for SPCRsvd and discusses the selection of tuning parameters included in SPCRsvd.Monte Carlo simulations and real data analyses are presented in Section 5. Conclusions are given in Section 6. Supplementary materials can be found at https://github.com/ShuichiKawano/spcr-svd/blob/master/suppl spcr-svd.pdf.

Preliminaries
where A and B = (β 1 , . . ., β k ) are p × k principal component (PC) loading matrices, k denotes the number of principal components, I k is the k × k identity matrix, λ, λ 1,1 , . . ., λ 1,k are regularization parameters with non-negative value, and • q is the L q norm for an arbitrary finite vector.The SPCA formulation can be regarded as a least squares approach.
The first term represents to perform PCA by least squares.The second and third terms represent sparse regularization similar with the elastic net penalty (Zou and Hastie, 2005).
The terms enables us to set some estimates of B to zero.If λ = 0, the regularization terms reduce to the adaptive lasso penalty (Zou, 2006).
A simple calculation leads to min This minimization problem is easy to optimize the parameters A and B. Given a fixed A, the SPCA problem (2) turns out to be a simple elastic net problem.Therefore, the estimate of B can be obtained by the least angle regression algorithm (Efron et al., 2004) or the coordinate descent algorithm (Friedman et al., 2007;Wu and Lange, 2008).Given a fixed B, the estimate of A is obtained by solving the reduced rank Procrustes rotation problem (Zou et al., 2006).By alternating the procedures, we obtain the final estimates Â and B of A and B, respectively.Note that only B is used as the principal component loading matrix in Zou et al. (2006).
On the other hand, Shen and Huang (2008) proposed another formulation of SPCA, which can be regarded as a singular value decomposition (SVD) approach.Consider a low rank approximation of the data matrix X by SVD in the form where U = (u 1 , . . ., u r ) is an n × r matrix with To achieve sparseness of V , Shen and Huang (2008) adopted the rank-one approximation procedure.First we obtain the first PC loading vector ṽ1 by solving the minimization problem min ũ1 ,ṽ 1 X − ũ1 ṽT Here ũ1 , ṽ1 are defined as rescaled vectors such that ũ1 ṽT 1 = d 1 u 1 v T 1 , P (•) is a penalty function that induces the sparsity of ṽ1 , and • F is the Frobenius norm defined by A F = tr(A T A) for an arbitrary matrix A. As the penalty function, Shen and Huang (2008) used the lasso penalty (Tibshirani, 1996), the hard-thresholding penalty (Donoho and Johnstone, 1994), and the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001).It is easy to solve the rank-one approximation problem (4); see Algorithm 1 of Shen and Huang (2008).The remaining PC loading vectors are provided by performing the rank-one approximations of the corresponding residual matrices.For example, to derive the second PC loading vector ṽ2 , we solve the minimization problem min ũ2 ,ṽ 2 where The regularization parameter λ is selected by cross-validation.

Sparse principal component regression
For a one-dimensional continuous response variable Y and a p-dimensional explanatory variable x, we postulate to obtain a dataset {(y i , x i ); i = 1, . . ., n}.We assume that the response variable is explained by variables composed by PCA of Ordinary PCR is a regression model with a few PC scores corresponding to large eigenvalues.
Note that the PC scores are previously constructed by PCA.This two-stage procedure might then fail to predict the response if the response variable is related with PCs corresponding to small eigenvalues.
To attain the one-stage procedure for PCR, Kawano et al. (2015) proposed SPCR that is formulated by the following minimization problem min where γ 0 is an intercept, γ = (γ 1 , .

SVD-based sparse principal component regression
SPCR consists of basic two loss functions: the squared regression loss function and the PCA loss function by Zou et al. (2006).However, it is unclear whether the PCA loss is the best for SPCR or not.To investigate the issue, we propose another formulation for SPCR by using the SVD approach by Shen and Huang (2008).
We consider the following minimization problem min where β 0 is an intercept, k is the number of PCs, β is a k-dimensional coefficient vector, Z is an n × k matrix of PCs, V is a p × k PC loading matrix, and 1 n is an n-dimensional vector of which all elements are one.In addition, w (≥ 0) is a tuning parameter and λ V , λ β are regularization parameters with non-negative values.
The first term is the least squared loss function between the response and the PCs XV .The second term is the PCA loss function in the SVD approach by Shen and Huang (2008).Although the formula is seemingly different from the first term in Formula (4), these are essentially equivalent: our approach aims to estimate the k PCs simultaneously, while Shen and Huang (2008) estimate sequentially.The third and fourth terms are the lasso penalty that induces zero estimates of the parameters V and β, respectively.The tuning parameter w controls the degree of the second term.A smaller value for w is used when we aim to obtain better prediction accuracies, while a larger value for w is used when we aim to obtain the exact expression of the PC loadings.The minimization problem (6) enables us to perform regression analysis and PCA simultaneously.We call this procedure SPCRsvd.In Section 5, we will confirm that SPCRsvd is competitive with or better than SPCR through numerical studies.
We remark two points here.First, it is possible to use Z in the first term of (6) instead of XV , since Z is also the PCs.However, the formulation by Z instead of XV did not perform well in numerical studies.We, then, adopt the formulation by XV .Second, SPCR imposes the ridge penalty for the PC loading, but SPCRsvd does not.6).It is possible to add the ridge penalty and replace the lasso penalty with other penalties that induce sparsity, e.g., the adaptive lasso penalty, the SCAD penalty, and minimax concave penalty (Zhang, 2010), but our aim of this paper is to establish the basic procedure of Formula (6).

Computational algorithm
To obtain the estimates of the parameters β, Z, V in Formula (6), we employ an alternating direction method of multipliers (ADMM) and a linearized alternating direction method of multipliers (LADMM).ADMM and LADMM are used in various models with sparse The scaled augmented Lagrangian for the problem ( 7) is then given by where Λ 1 , Λ 2 , λ 3 are dual variables and ρ 1 , ρ 2 , ρ 3 (> 0) are penalty parameters.This leads to the ADMM algorithm as follows: Step 1 Set the values of the tuning parameter w, the regularization parameters λ V , λ β , and the penalty parameters ρ 1 , ρ 2 , ρ 3 .
Step 4 Update V 1 as follows: where ⊗ represents the Kronecker product.
Step 5 Update V as follows: where P and Q are the matrices given by the SVD Step 6 Update V 0 as follows: where v is the (i, j)-th element of the matrix Λ ℓ (ℓ = 1, 2), and S(•, •) is the soft-thresholding operator defined by Step 7 Update Z by Z (m+1) = XV (m+1) .
The derivation of the updates is given in Appendix A.
To apply LADMM into the minimization problem (6), we consider the following problem min The augmented Lagrangian for this problem is then given by where Λ, λ are dual variables and ρ 1 , ρ 2 (> 0) are penalty parameters.The updates of the LADMM algorithm is almost same with those of the ADMM algorithm.We summarize the updates and the derivation in Appendix B.

Determination of tuning parameters
We have the six tuning parameters: w, λ V , λ β , ρ 1 , ρ 2 , ρ 3 .The penalty parameters ρ 1 , ρ 2 , ρ 3 are fixed as ρ 1 = ρ 2 = ρ 3 = 1 according to Boyd et al. (2011).The tuning parameter w is set according to the purpose of the analysis.A small value is allocated to the value for w, if a user considers that the regression loss is more important than the PCA loss.This idea follows Kawano et al. (2015;2018).
The two regularization parameters λ V , λ β are objectively selected by K-fold crossvalidation.When we have divided K datasets (y (1) , X (1) ), . . ., (y (K) , X (K) ) from the original dataset, the criterion for the K-fold cross-validation in ADMM is given by where , β(−k) are the estimates of β 0 , V 1 , β, respectively, computed with the data removing the k-th part.We omit the CV criterion for LADMM, since we only replace . In our numerical studies, we set K = 5.
5 Numerical study

Monte Carlo simulations
We conducted Monte Carlo simulations to investigate the effectiveness of SPCRsvd.The simulations have five cases, which are the same as Kawano et al. (2015).The five cases are given as follows.
The details of the setteings are referred to Kawano et al. (2015).
The sample size was set to n = 50, 200.The standard deviation was set to σ = 1, 2.
SPCRsvd was fitted to the simulated data with one or five components (k = 1, 5).We set the value of the tuning parameter w to 0.1.We considered two algorithms in Section We summarize the means and the standard deviations of MSEs from Table 1 to Table 5.The results for σ = 1, 2 had similar tendencies.PCR and PLS were worst in almost all cases.First, we discuss the results among SPCRsvd-LADMM, SPCRsvd-ADMM, and SPCR.SPCRsvd-LADMM and SPCRsvd-ADMM were competitive with SPCR.In particular, SPCRsvd-LADMM and SPCRsvd-ADMM provided smaller MSEs than SPCR in almost all cases when k = 1.Note that SPCR provided large values of standard deviation in some cases.This means that SPCR sometimes produces so large value of MSE.This fact can cause instability of SPCR.Compared to SPLS, SPCRsvd-LADMM and SPCRsvd-ADMM were slightly inferior in many cases when k = 5.However, SPLS produced so large values of MSEs in many cases when k = 1.From this experiment, we observed that SPCRsvd-LADMM and SPCRsvd-ADMM provided relatively stable smaller values of MSEs than other methods.
The true positive rate (TPR) and the true negative rate (TNR) were also computed for SPCRsvd-LADMM, SPCRsvd-ADMM, SPCR, and SPLS.TPR and TNR are, respectively, defined by TPR = 1 100 is the estimated j-th coefficient for the k-th simulation, and |{ * }| is the number of elements included in a set { * }.Table 6 represents the means and standard deviations of TPR and TNR.Many methods provided higher ratios of TPRs, whereas SPCR sometimes did not.SPLS provided the highest ratios of TNRs in all situations.These tendencies were essentially unchanged among all cases.The results from Case 2 to Case 5 are shown in the supplementary material.

Real data analyses
We applied SPCRsvd into real datasets.We used eight real datasets: housing, communities, concrete, diabetes, parkinsons, triazines, winequality-red, and winequality-white, which are available from the UCI database (http://archive.ics.uci.edu/ml/index.html).The sample size and the number of covariates are depicted in Table 7.If the sample size was larger than 1,100, we randomly extracted 1,100 observations from the dataset.For each dataset, we randomly selected 100 observations as training data and remaining as test data to estimate MSEs.We standardized the covariates for each dataset.We run two algorithms: SPCRsvd-LADMM and SPCRsvd-ADMM.The procedure was repeated 50 times.
We compared SPCRsvd with four methods used in Section 5.1.The number of principal components was set to k = 1.The value of the tuning parameter w in SPCRsvd was set to 0.01, and then λ V and λ β were selected by five-fold cross-validation.The tuning parameters in other methods were selected in similar manners to in Section 5.1.

Conclusions
We presented SPCRsvd, a one-stage procedure for PCR with the loss functions that combine a regression loss with a PCA loss from the SVD.To obtain the estimates of the parameters in SPCRsvd, we developed the computational algorithm by using ADMM and LADMM.
Our one-stage method was competitive or better than competing approaches.Specifically, SPCRsvd produced more stable MSEs than SPCR.
A major limitation of SPCRsvd is the computational cost.The limitation causes some problems.For example, we observe that SPCRsvd provides relatively low ratios of TPR and TNR from Table 6.To address the issue, the adaptive lasso would be applied in the regularization term in SPCRsvd.However, owing to the computational cost, it may be difficult to perform SPCRsvd with the adaptive lasso, because the adaptive lasso generally requires more computational times than lasso.With the equality constraint V T V = I k , we get arg min .
This follows the Procrustes rotation by Zou et al. (2006).
Update of V 0 .
V 0 := arg min By a simple calculation, the first two terms of the right-hand side are calculated by Formula (A.1) is rewritten by Thus, we obtain the update of V 0 .

Update of Z.
Z := arg min Z w n X − ZV T 2 F .
We have the solution Z = XV from the first order optimality condition.

Update of β.
β := arg min The first order optimality condition leads to This leads to the update of β.
β 0 := arg min It is clear that the update of β 0 is simply obtained by element-wise soft-threshold operator.
Next, we describe the update of only V 0 , because the derivations of other updates are same with Appendix A. Similar with Appendix A, we omit the index m for iteration.
We consider Formula (A) is calculated as This leads to the update of V 0 in Formula (B.1).

2. 1
Sparse principal component analysis PCA finds a loading matrix that induces a low-dimensional structure in data.To interpret the principal component loading matrix easily, SPCA has been proposed.Many researchers have studied various formulae for SPCA until now (Zou et al., 2006; d'Aspremont et al., 2007; Shen and Huang, 2008; Witten et al., 2009; Vu et al., 2013; Bresler et al., 2018; Chen et al., 2019; Erichson et al., 2019).For overview of SPCA, we refer the reader to Zou and Xue (2018) and references therein.In this subsection, we review two formulae for SPCA by Zou et al. (2006) and Shen and Huang (2008).Let X = (x 1 , . . ., x n ) T denote an n × p data matrix, where n and p are the number of observations and the number of variables, respectively.Without loss of generality, we assume that the columns of the matrix X are centered.Zou et al. (2006) proposed SPCA by min r ) is an r × r orthogonal matrix, D = diag(d 1 , . . ., d r ), and r < min(n, p).The singular values are assumed to be ordered such that d r ≥ • • • ≥ d p ≥ 0. By the connection between PCA and SVD, Shen and Huang (2008) obtained the sparse PC loading by estimating V with sparse regularization.

4. 1 :
ADMM for SPCRsvd (SPCRsvd-ADMM) and LADMM for SPCRsvd (SPCRsvd-LADMM).SPCRsvd was compared with SPCR, PCR, sparse partial least squares (SPLS) byChun and Keleş (2010), and partial least squares (PLS) byWold (1975).SPCR was computed by the package spcr, SPLS by spls, and PLS and PCR by pls.These packages are included in R (R Core Team, 2020).The values of the tuning parameters w and ξ in SPCR were set to 0.1 and 0.01, respectively, and then the regularization parameters were selected by five-fold cross-validation.The values of tuning parameters in SPLS, PLS, and PCR were selected by 10-fold cross-validation.The performance was evaluated in terms of MSE = E[(y − ŷ) 2 ].The simulation was conducted 100 times.MSE was estimated by 1,000 random samples.
SPCRsvd cannot treat binary data for the explanatory variables.To perform PCA for binary data, Lee et al. (2010) introduced the logistic PCA with sparse regularization.It is interesting to extend SPCRsvd in the context of the method by Lee et al. (2010).We leave them as a future research.
The ridge penalty is basically from SPCA by Zou et al. (2006).Because SPCRsvd is not based on SPCA by Zou et al. (2006), we do not add the ridge penalty in Formula (

Table 1 :
Mean (standard deviation) values of the MSE for Case 1.The bold values correspond to the smallest means among SPCRsvd-LADMM, SPCRsvd-ADMM, and SPCR.

Table 2 :
Mean (standard deviation) values of the MSE for Case 2. The bold values correspond to the smallest means among SPCRsvd-LADMM, SPCRsvd-ADMM, and SPCR.

Table 3 :
Mean (standard deviation) values of the MSE for Case 3. The bold values correspond to the smallest means among SPCRsvd-LADMM, SPCRsvd-ADMM, and SPCR.

Table 4 :
Mean (standard deviation) values of the MSE for Case 4. The bold values correspond to the smallest means among SPCRsvd-LADMM, SPCRsvd-ADMM, and SPCR.

Table 5 :
Mean (standard deviation) values of the MSE for Case 5.The bold values correspond to the smallest means among SPCRsvd-LADMM, SPCRsvd-ADMM, and SPCR.

Table 6 :
Mean (standard deviation) values of TPR and TNR for Case 1.The bold values correspond to the largest means.

Table 7 :
Sample size and the numbers of covariates in real datasets.

Table 8 :
Mean (standard deviation) values of the MSE for real datasets.The bold values correspond to the smallest means.

Table 8
indicates the means and standard deviations of MSEs.PLS and PCR were competitive and did not provide the smallest MSEs for all datasets.SPCR was slightly better than PLS and PCR.SPCRsvd-LADMM and SPCRsvd-ADMM provided smaller MSEs than other methods in many cases.Although SPLS provided smaller MSEs than other methods, SPLS had the worst MSEs in some cases.From the result, we may conclude that SPCRsvd-LADMM and SPCRsvd-ADMM give smaller and more stable MSEs than other methods, which is consistent with the result in Section 5.1.