1 Introduction

Principal component regression (PCR), invented by Jolliffe (1982) and Massy (1965), is widely used in various fields of research, including chemometrics, bioinformatics, and psychology, and has been extensively studied (Chang and Yang 2012; Dicker et al. 2017; Febrero-Bande et al. 2017; Frank and Friedman 1993; Hartnett et al. 1998; Reiss and Ogden 2007; Rosipal et al. 2001; Wang and Abbott 2008). PCR is a two-stage procedure: one first performs principal component analysis (PCA) (Jolliffe 2002; Pearson 1901), and then performs regression in which the explanatory variables are the selected principal components. However, the principal components have no information on the response variable. Because of this, the prediction accuracy of the PCR could be low, if the response variable is related to principal components having small eigenvalues.

To address this problem, a one-stage procedure for PCR was proposed in Kawano et al. (2015). This one-stage procedure was developed by combining a regression squared loss function with the sparse PCA (SPCA) loss function in Zou et al. (2006). The estimate of the regression parameter and loading matrix in the PCA is obtained as the minimizer of the combination of two loss functions with sparse regularization. By virtue of sparse regularization, sparse estimates of the parameters can be obtained. Kawano et al. (2015) referred to the one-stage procedure as sparse principal component regression (SPCR). Kawano et al. (2018) also extended SPCR within the framework of generalized linear models. However, it is unclear whether the PCA loss function in Zou et al. (2006) is the best choice for building SPCR, as there exist several formulae for PCA.

This paper proposes a novel formulation for SPCR. As a PCA loss for SPCR, we adopt a loss function based on a singular value decomposition approach (Shen and Huang 2008). Using the basic loss function, a combination of the PCA loss and the regression squared loss, with sparse regularization, we derive an alternative formulation for SPCR. We call the proposed method as sparse principal component regression based on a singular value decomposition approach (SPCRsvd). An estimation algorithm of SPCRsvd is developed using an alternating direction method of multipliers (Boyd et al. 2011) and a linearized alternating direction method of multipliers (Li et al. 2014; Wang and Yuan 2012). We show the effectiveness of SPCRsvd through numerical studies. Specifically, the performance of SPCRsvd is shown to be competitive with or better than that of SPCR.

As an alternative approach, partial least squares (PLS) (Frank and Friedman 1993; Wold 1975) is a widely used statistical method that regresses a response variable on composite variables built by combining a response variable and explanatory variables. In Chun and Keleş (2010), sparse partial least squares (SPLS) was proposed, which enables the removal of irrelevant explanatory variables when constructing the composite variables. PLS and SPLS are similar to SPCR and SPCRsvd in terms of using new explanatory variables with information relating the response variable to the original explanatory variables. Herein, these methods are compared using simulated data and real data.

The remainder of the paper is organized as follows. In Sect. 2, we review SPCA in Zou et al. (2006) and Shen and Huang (2008), and SPCR in Kawano et al. (2015). We present SPCRsvd in Sect. 3. Section 4 derives two computational algorithms for SPCRsvd and discusses the selection of tuning parameters. Monte Carlo simulations and real data analyses are presented in Sect. 5. Conclusions are given in Sect. 6.

2 Preliminaries

2.1 Sparse principal component analysis

PCA finds a loading matrix that induces a low-dimensional structure in the data. As an easy way to interpret the principal component loading matrix, SPCA has been proposed. To date, several formulae for SPCA have been proposed (Bresler et al. 2018; Chen et al. 2020; d’Aspremont et al. 2007; Erichson et al. 2020; Shen and Huang 2008; Vu et al. 2013; Witten et al. 2009; Zou et al. 2006). For an overview of SPCA, we refer the reader to Zou and Xue (2018) and the references therein. In this subsection, we review the two formulae for SPCA in Zou et al. (2006) and Shen and Huang (2008).

Let \(X=({\varvec{x}}_1, \ldots , {\varvec{x}}_n)^\top \) denote an \(n \times p\) data matrix, where n and p are the number of observations and the number of variables, respectively. Without loss of generality, we assume that the columns of the matrix X are centered. In Zou et al. (2006), SPCA was proposed as

$$\begin{aligned}&\min _{A, B} \left\{ \sum _{i=1}^n \Vert {\varvec{x}}_i - A B^\top {\varvec{x}}_i \Vert ^2_2 + \lambda \sum _{j=1}^k \Vert {\varvec{\beta }}_j \Vert ^2_2 + \sum _{j=1}^k \lambda _{1,j} \Vert {\varvec{\beta }}_j \Vert _1 \right\} \nonumber \\&\quad \ \mathrm{subject \ to} \ \ A^\top A = I_{k}, \end{aligned}$$
(1)

where \(A=({\varvec{\alpha }}_1,\ldots ,{\varvec{\alpha }}_k)\) and \(B=({\varvec{\beta }}_1,\ldots ,{\varvec{\beta }}_k)\) are \(p \times k\) principal component (PC) loading matrices, k denotes the number of principal components, \(I_k\) is the \(k \times k\) identity matrix, \(\lambda ,\lambda _{1,1},\ldots ,\lambda _{1,k}\) are non-negative regularization parameters, and \(\Vert \cdot \Vert _q\) is the \(L_q\) norm for an arbitrary finite vectors. This SPCA formulation can be regarded as a least squares approach. The first term represents performing PCA by least squares. The second and third terms represent sparse regularization similar to elastic net regularization (Zou and Hastie 2005). These terms enable us to set some of the estimates of B to zero. If \(\lambda =0\), then the regularization terms reduce to the adaptive lasso (Zou 2006).

A simple calculation gives

$$\begin{aligned} \min _{A, B} \sum _{j=1}^k \left\{ \Vert X {\varvec{\alpha }}_j - X {\varvec{\beta }}_j \Vert ^2_2 + \lambda \Vert {\varvec{\beta }}_j \Vert ^2_2 + \lambda _{1,j} \Vert {\varvec{\beta }}_j \Vert _1 \right\} \ \ \mathrm{subject \ to} \ \ A^\top A = I_{k}. \end{aligned}$$
(2)

Optimizing the parameters A and B for this minimization problem is straightforward. Given a fixed A, the SPCA problem (2) turns out to be a simple elastic net problem. Thus, the estimate of B can be obtained by the least angle regression algorithm (Efron et al. 2004) or the coordinate descent algorithm (Friedman et al. 2007; Wu and Lange 2008). Given a fixed B, an estimate of A can be obtained by solving the reduced rank Procrustes rotation problem (Zou et al. 2006). By alternating procedures, we can obtain the final estimates \({\hat{A}}\) and \({\hat{B}}\) of A and B, respectively. Note that only \({\hat{B}}\) is used as the principal component loading matrix.

Alternately, Shen and Huang (2008) proposed another formulation of SPCA, which can be regarded as a singular value decomposition (SVD) approach. Consider a low-rank approximation of the data matrix X obtained by SVD in the form

$$\begin{aligned} UDV^\top = \sum _{k=1}^r d_k {\varvec{u}}_k {\varvec{v}}_k^\top , \end{aligned}$$
(3)

where \(U=({\varvec{u}}_1,\ldots ,{\varvec{u}}_r)\) is an \(n \times r\) matrix with \(U^\top U = I_r\), \(V=({\varvec{v}}_1,\ldots ,{\varvec{v}}_r)\) is an \(r \times r\) orthogonal matrix, \(D = \mathrm{diag} (d_1,\ldots ,d_r)\), and \(r < \min (n,p)\). The singular values are assumed to be ordered such that \(d_r \ge \cdots \ge d_p \ge 0\). Using the connection between PCA and SVD, Shen and Huang (2008) obtained the sparse PC loading by estimating V with sparse regularization.

To achieve sparseness of V, Shen and Huang (2008) adopted the rank-one approximation procedure. First, the first PC loading vector \(\tilde{\varvec{v}}_1\) is obtained by solving the minimization problem

$$\begin{aligned} \min _{\tilde{\varvec{u}}_1, \tilde{\varvec{v}}_1} \left\{ \Vert X - \tilde{\varvec{u}}_1 \tilde{\varvec{v}}_1^\top \Vert _F^2 + \lambda P (\tilde{\varvec{v}}_1) \right\} \ \ \mathrm{subject \ to} \ \ \Vert \tilde{\varvec{u}}_1 \Vert _2=1. \end{aligned}$$
(4)

Here \(\tilde{\varvec{u}}_1, \tilde{\varvec{v}}_1\) are defined as rescaled vectors such that \(\tilde{\varvec{u}}_1 \tilde{\varvec{v}}_1^\top = d_1 {\varvec{u}}_1 {\varvec{v}}_1^\top \), \(P(\cdot )\) is a penalty function that induces the sparsity of \(\tilde{\varvec{v}}_1\), and \(\Vert \cdot \Vert _F\) is the Frobenius norm defined by \(\Vert A \Vert _F = \sqrt{ \mathrm{tr}(A^\top A) }\) for an arbitrary matrix A. As the penalty function, Shen and Huang (2008) used the lasso penalty (Tibshirani 1996), the hard-thresholding penalty (Donoho and Johnstone 1994), or the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li 2001). The rank-one approximation problem is easy to solve (4); see Algorithm 1 in Shen and Huang (2008). The remaining PC loading vectors are obtained by performing rank-one approximations of the corresponding residual matrices. For example, to derive the second PC loading vector \(\tilde{\varvec{v}}_2\), we solve the minimization problem

$$\begin{aligned} \min _{\tilde{\varvec{u}}_2, \tilde{\varvec{v}}_2} \left\{ \Vert X^\dagger - \tilde{\varvec{u}}_2 \tilde{\varvec{v}}_2^\top \Vert _F^2 + \lambda P (\tilde{\varvec{v}}_2) \right\} \ \ \ \mathrm{subject \ to} \ \ \ \Vert \tilde{\varvec{u}}_2 \Vert _2=1, \end{aligned}$$

where \(X^\dagger = X - \tilde{\varvec{u}}_1 \tilde{\varvec{v}}_1^\top \). The regularization parameter \(\lambda \) is selected by cross-validation.

2.2 Sparse principal component regression

For a one-dimensional continuous response variable Y and a p-dimensional explanatory variable \(\varvec{x}\), suppose we have obtained a dataset \(\{ (y_i, {\varvec{x}}_i) ; i=1,\ldots ,n \}\). We assume that the response variable is explained by variables composed by PCA of \(X = ({\varvec{x}}_1,\ldots , {\varvec{x}}_n)^\top \). Traditional PCR uses a regression model with a few PC scores corresponding to large eigenvalues. Note that these PC scores are derived by PCA prior to the regression. This two-stage procedure might then fail to predict the response if the response variable is related to PCs corresponding to small eigenvalues.

To attain a one-stage procedure for PCR, the SPCR proposed in Kawano et al. (2015) was formulated as the following minimization problem:

$$\begin{aligned}&\min _{A, B, \gamma _0, {\varvec{\gamma }}} \Bigg \{ \sum _{i=1}^n \left( y_i - \gamma _0 - {\varvec{\gamma }}^\top B^\top {\varvec{x}}_i \right) ^2 + w \sum _{i=1}^n \Vert {\varvec{x}}_i - A B^\top {\varvec{x}}_i \Vert ^2_2 \nonumber \\& + \lambda _{\beta } \xi \sum _{j=1}^k \Vert {\varvec{\beta }}_j \Vert ^2_2 + \lambda _{\beta } (1-\xi ) \sum _{j=1}^k \Vert {\varvec{\beta }}_{j} \Vert _1 + \lambda _{\gamma } \Vert {\varvec{\gamma }} \Vert _1\Bigg \} \\&\mathrm{subject \ to} \ \ \ A^\top A = I_{k}, \nonumber \end{aligned}$$
(5)

where \(\gamma _0\) is an intercept, \({\varvec{\gamma }} = (\gamma _1,\ldots ,\gamma _k)^\top \) comprises coefficients for regression, \( \lambda _{\beta }\) and \(\lambda _{\gamma }\) are non-negative regularization parameters, w is a positive tuning parameter, and \(\xi \) in [0, 1] is a tuning parameter. The first term in Formula (5) is the regression squared loss function including the PCs \(B^\top {\varvec{x}}\) as explanatory variables, while the second term is the PCA loss function used in SPCA in Zou et al. (2006). Sparse regularization in SPCR has two roles: sparseness and identifiability of parameters. For the identifiability by sparse regularization, we refer the reader to Choi et al. (2010), Jennrich (2006), Kawano et al. (2015). Kawano et al. (2018) also extended SPCR from the viewpoint of generalized linear models, which can deal with binary, count, and multi-categorical data for the response variable.

3 SVD-based sparse principal component regression

SPCR uses two basic loss functions: the regression squared loss function and the PCA loss function in Zou et al. (2006). However, it is unclear whether the PCA loss is the best choice for building SPCR. To investigate this issue, we propose another formulation for SPCR using the SVD approach in Shen and Huang (2008).

We consider the following minimization problem:

$$\begin{aligned}&\min _{\beta _0,{\varvec{\beta }}, Z, V} \left\{ \frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \lambda _V \Vert V \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }\Vert _1 \right\} \nonumber \\&\mathrm{subject \ to} \quad V^\top V = I_k, \end{aligned}$$
(6)

where \(\beta _0\) is an intercept, k is the number of PCs, \({\varvec{\beta }}\) is a k-dimensional coefficient vector, Z is an \(n \times k\) matrix of PCs, V is a \(p \times k\) PC loading matrix, and \(\varvec{1}_n\) is an n-dimensional vector of ones. In addition, w is a positive tuning parameter and \(\lambda _V,\lambda _{\varvec{\beta }}\) are non-negative regularization parameters.

The first term is the regression squared loss function relating the response and the PCs XV. The second term is the PCA loss function in the SVD approach in Shen and Huang (2008). Although the formula is seemingly different from the first term in Formula (4), they are essentially equivalent: we estimate the k PCs simultaneously, while Shen and Huang (2008) estimates them sequentially. The third and fourth terms constitute the lasso penalty that induces zero estimates of the parameters V and \(\varvec{\beta }\), respectively. The tuning parameter w controls the degree of the second term. A smaller value for w is used when our aim is to obtain better prediction accuracies, while a larger value for w is used when we want to obtain exact expressions of the PC loadings. The minimization problem (6) allows us to perform regression analysis and PCA simultaneously. We call this method SPCRsvd. In Sect. 5, we will observe that SPCRsvd is competitive with or better than SPCR through numerical studies.

We remark on two points here. First, it is possible to use Z in the first term of (6) instead of XV, since Z is also the PCs. However, the formulation with Z instead of XV did not perform well in numerical studies, so we adopt the formulation with XV here. Second, SPCR imposes a ridge penalty for the PC loading but SPCRsvd does not. The ridge penalty basically comes from SPCA in Zou et al. (2006). Because SPCRsvd is not based on SPCA in Zou et al. (2006), a ridge penalty does not appear in Formula (6). It is possible to add a ridge penalty and replace the lasso penalty with other penalties that induce sparsity, e.g., the adaptive lasso penalty, the SCAD penalty, or minimax concave penalty (Zhang 2010), but the our aim of this paper is to establish the basic procedure of Formula (6).

4 Implementation

4.1 Computational algorithm

To obtain the estimates of the parameters \({\varvec{\beta }}, Z, V\) in Formula (6), we employ the alternating direction method of multipliers (ADMM) and the linearized alternating direction method of multipliers (LADMM). ADMM and LADMM have recently been used in various models with sparse regularization; see, for example, Boyd et al. (2011); Danaher et al. (2014); Li et al. (2014); Ma and Huang (2017); Price et al. (2019); Tan et al. (2014); Wang et al. (2018); Yan and Bien (2020) and Ye and Xie (2011).

To solve the minimization problem (6) by using ADMM, we rewrite the problem as

$$\begin{aligned}&\min _{\beta _0, {\varvec{\beta }}, {\varvec{\beta }}_0, Z, V, V_0, V_1} \left\{ \frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_1 {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X \right. \nonumber \\&\left. - Z V^\top \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }_0 \Vert _1 \right\} \nonumber \\&\mathrm{subject \ to} \quad V^\top V = I_k, \quad V=V_0=V_1, \quad \varvec{\beta }= \varvec{\beta }_0. \end{aligned}$$
(7)

The scaled augmented Lagrangian for the problem (7) is then given by

$$\begin{aligned}&\frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_1 {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }_0 \Vert _1 \\&+ \frac{\rho _1}{2} \Vert V - V_0 + \Lambda _1 \Vert _F^2 +\frac{\rho _2}{2} \Vert V_1 - V_0 + \Lambda _2 \Vert _F^2 + \frac{\rho _3}{2} \Vert \varvec{\beta }- {\varvec{\beta }}_0 + {\varvec{\lambda }}_3 \Vert _2^2 \\&\mathrm{subject \ to} \quad V^\top V = I_k, \end{aligned}$$

where \(\Lambda _1, \Lambda _2, {\varvec{\lambda }}_3\) are dual variables and \(\rho _1, \rho _2, \rho _3 \ (>0)\) are penalty parameters. This gives rise to the following ADMM algorithm:

  1. Step 1

    Set the values of the tuning parameter w, the regularization parameters \(\lambda _V, \lambda _{\varvec{\beta }}\), and the penalty parameters \(\rho _1, \rho _2, \rho _3\).

  2. Step 2

    Initialize all the parameters as \(\beta _0^{(0)}, {\varvec{\beta }}^{(0)}, {\varvec{\beta }}_0^{(0)}, Z^{(0)}, V^{(0)}, V_0^{(0)}, V_1^{(0)},\Lambda _1^{(0)}, \) \(\Lambda _2^{(0)}, {\varvec{\lambda }}_3^{(0)}\).

  3. Step 3

    For \(m=0,1,2,\ldots \), repeat from Steps 4 to 11 until convergence.

  4. Step 4

    Update \(V_1\) as follows:

    $$\begin{aligned} \mathrm{vec} (V_1^{(m+1)})&= \left( \frac{1}{n} {\varvec{\beta }}^{(m)} {\varvec{\beta }}^{(m)\top } \otimes X^\top X + \frac{\rho _2}{2} I_k \otimes I_p \right) ^{-1} \\&\mathrm{vec} \bigg \{ \frac{1}{n} X^\top ({\varvec{y}} - \beta _0^{(m)} {\varvec{1}}_n) {\varvec{\beta }}^{(m)\top } \\& + \frac{\rho _2}{2} (V_0^{(m)} - \Lambda _2^{(m)}) \bigg \}, \end{aligned}$$

    where \(\otimes \) represents the Kronecker product.

  5. Step 5

    Update V as follows:

    $$\begin{aligned} V^{(m+1)}=PQ^\top , \end{aligned}$$

    where P and Q are the matrices given by the SVD

    $$\begin{aligned} \displaystyle {\frac{w}{n} X^\top Z^{(m)} + \frac{\rho _1}{2} \left( V_0^{(m)} - \Lambda _1^{(m)} \right) = P \varOmega Q^\top }. \end{aligned}$$
  6. Step 6

    Update \(V_0\) as follows:

    $$\begin{aligned}&v_{0ij}^{(m+1)} = {{\mathcal {S}}} \left( \frac{\rho _1(v_{ij}^{(m+1)} + \lambda _{1ij}^{(m)}) +\rho _2(v_{ij}^{(m+1)} + \lambda _{2ij}^{(m)})}{\rho _1 + \rho _2}, \frac{\lambda _V}{\rho _1 + \rho _2} \right) , \\&i=1,\ldots ,p, \ j=1,\ldots ,k, \end{aligned}$$

    where \(v_{0ij}^{(m)}=(V_0^{(m)})_{ij}\), \(v_{ij}^{(m)}=(V^{(m)})_{ij}\), \(\lambda _{\ell ij} \ (\ell =1,2)\) is the (ij)-th element of the matrix \(\Lambda _\ell \ (\ell =1,2)\), and \({{\mathcal {S}}} (\cdot ,\cdot ) \) is the soft-thresholding operator defined by \({{\mathcal {S}}}(x,\lambda )=\mathrm{sign}(x)(|x|-\lambda )_+\).

  7. Step 7

    Update Z by \(Z^{(m+1)}=X V^{(m+1)}\).

  8. Step 8

    Update \({\varvec{\beta }}\) as follows:

    $$\begin{aligned} \varvec{\beta }^{(m+1)}&= \left( \frac{1}{n} V_1^{(m+1)\top } X^\top X V_1^{(m+1)} + \frac{\rho _3}{2} I_k \right) ^{-1} \bigg \{ \frac{1}{n} V_1^{(m+1)T} X^\top ({\varvec{y}} - \beta _0^{(m)} {\varvec{1}}_n) \\&\quad + \frac{\rho _3}{2} ({\varvec{\beta }}_0^{(m)} - {\varvec{\lambda }}^{(m)}_3) \bigg \}. \end{aligned}$$
  9. Step 9

    Update \({\varvec{\beta }}_0\) as follows:

    $$\begin{aligned} \beta _{0j}^{(m+1)} = {{\mathcal {S}}} \left( \beta _j^{(m+1)} + \lambda _{3j}^{(m)}, \frac{\lambda _\beta }{\rho _3} \right) , \quad j=1,\ldots ,k, \end{aligned}$$

    where \(\lambda _{3j}^{(m)}\) and \(\beta _j^{(m)}\) are the j-th elements of the vectors \({\varvec{\lambda }}_3^{(m)}\) and \({\varvec{\beta }}^{(m)}\), respectively.

  10. Step 10

    Update \(\beta _0\) as follows:

    $$\begin{aligned} \beta _{0}^{(m+1)} = \frac{1}{n} {\varvec{1}}_n^\top ( {\varvec{y}} - X V_1^{(m+1)} {\varvec{\beta }}^{(m+1)} ). \end{aligned}$$
  11. Step 11

    Update \({\Lambda }_1,{\Lambda }_2,{\varvec{\lambda }}_3\) as follows:

    $$\begin{aligned} \Lambda _1^{(m+1)}&= \Lambda _1^{(m)} + V^{(m+1)} - V_0^{(m+1)},\\ \Lambda _2^{(m+1)}&= \Lambda _2^{(m)} + V_1^{(m+1)} - V_0^{(m+1)},\\ {\varvec{\lambda }}_3^{(m+1)}&= {\varvec{\lambda }}_3^{(m)} + {\varvec{\beta }}^{(m+1)} - {\varvec{\beta }}^{(m+1)}_0. \end{aligned}$$

The derivations of the updates are given in “Appendix A”.

To apply LADMM to the minimization problem (6), we consider the following problem:

$$\begin{aligned}&\min _{\beta _0, {\varvec{\beta }}, {\varvec{\beta }}_0, Z, V, V_0} \left\{ \frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_0 {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }_0 \Vert _1 \right\} \nonumber \\&\mathrm{subject \ to} \quad V^\top V = I_k, \quad V=V_0, \quad \varvec{\beta }= \varvec{\beta }_0. \end{aligned}$$
(8)

The augmented Lagrangian for this problem is given by

$$\begin{aligned}&\frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_0 {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }_0 \Vert _1 \\&+ \frac{\rho _1}{2} \Vert V_0 - V + \Lambda \Vert _F^2 + \frac{\rho _2}{2} \Vert \varvec{\beta }- {\varvec{\beta }}_0 + {\varvec{\lambda }} \Vert _2^2 \\&\mathrm{subject \ to} \quad V^\top V = I_k, \end{aligned}$$

where \(\Lambda , {\varvec{\lambda }}\) are dual variables and \(\rho _1, \rho _2 \ (>0)\) are penalty parameters.

The updates of the LADMM algorithm are almost the same as those of the ADMM algorithm. We summarize the updates and the derivations in “Appendix B”.

Here we remark on the main differences between ADMM and LADMM. LADMM has two penalty parameters (\(\rho _1,\rho _2\)), while ADMM has three penalty parameters (\(\rho _1,\rho _2,\rho _3\)). This means that the total number of tuning parameters in LADMM is only one less than that in ADMM. This is an advantage of LADMM regardless of whether the user tunes the penalty parameters subjectively or objectively. On the other hand, approximation by Taylor expansion is used in LADMM. If this approximation is inappropriate, LADMM may fail to estimate parameters. In terms of running times, ADMM seems to be faster than LADMM, based on several numerical studies. These results will be presented in Sect. 6 when discussing the limitations of the current study.

4.2 Determination of tuning parameters

We have the six tuning parameters: \(w, \lambda _V, \lambda _{\varvec{\beta }}, \rho _1, \rho _2, \rho _3\). The penalty parameters \(\rho _1, \rho _2, \rho _3\) are fixed as \(\rho _1=\rho _2=\rho _3=1\) in accordance with Boyd et al. (2011). The tuning parameter w is set according to the purpose of the analysis. A small value is allocated to w when the user considers the regression loss to be more important than the PCA loss. This idea follows Kawano et al. (2015, 2018).

The two regularization parameters \(\lambda _V, \lambda _{\varvec{\beta }}\) are objectively selected by K-fold cross-validation. For the original dataset divided into the K datasets \(({\varvec{y}}^{(1)}, X^{(1)}), \ldots , ({\varvec{y}}^{(K)}, X^{(K)})\), the criterion for the K-fold cross-validation in ADMM is given by

$$\begin{aligned} \mathrm{CV} = \frac{1}{K} \sum _{k=1}^K \frac{1}{n} \left\| {\varvec{y}}^{(k)} - {\hat{\beta }}_0^{(-k)} {\varvec{1}}_{(k)} - X^{(k)} {\hat{V}}_1^{(-k)} \hat{\varvec{\beta }}^{(-k)} \right\| _2^2, \end{aligned}$$
(9)

where \({\hat{\beta }}_0^{(-k)}, {\hat{V}}_1^{(-k)}, \hat{\varvec{\beta }}^{(-k)}\) are the estimates of \({\beta }_0, {V}_1, {\varvec{\beta }}\), respectively, computed with the data excluding the k-th dataset. We omit the CV criterion for LADMM, since we only replace \({\hat{V}}_1^{(-k)}\) in (9) with \({\hat{V}}_0^{(-k)}\).

We choose the values of the regularization parameters \(\lambda _V, \lambda _{\varvec{\beta }}\) from the minimizers of CV in (9).

5 Numerical study

5.1 Monte Carlo simulations

We conducted Monte Carlo simulations to investigate the effectiveness of SPCRsvd. The simulations had six cases, which were the same as those in Kawano et al. (2015) except for Case 6. These six cases are given as follows.

  1. Case 1

    The 10-dimensional covariate vector \({\varvec{x}}=(x_1,\ldots ,x_{10})\) follows a multivariate normal distribution having a zero mean vector and variance-covariance matrix \(\varSigma \). The response was obtained by

    $$\begin{aligned} {y_i = \zeta _1 {\varvec{e}}_1^\top {\varvec{x}}_i + \zeta _2 {\varvec{e}}_2^\top {\varvec{x}}_i + \varepsilon _i}, \quad i=1,\ldots ,n, \end{aligned}$$

    where \({\varvec{e}}_1 = (1,\underbrace{0,\ldots ,0}_{9})^\top \), \({\varvec{e}}_2 = (0,1,\underbrace{0,\ldots ,0}_{8})^\top \), and \(\varepsilon _i\) are independently distributed as a normal distribution with mean zero and variance \(\sigma ^2\). We used \(\zeta _1=2, \zeta _2=1, \varSigma =I_{10}\). Then we note that \({\varvec{e}}_1\) and \({\varvec{e}}_2\) are eigenvectors of \(\varSigma \).

  2. Case 2

    This case is the same as Case 1 except with \(\zeta _1=8, \zeta _2=1, \varSigma =\mathrm{diag} (1,3^2,\underbrace{1,\ldots ,1}_{8})\). Then \({\varvec{e}}_2\) becomes the first eigenvector. In addition, \(\mathrm {Cov}(y, x_1)=8\) and \(\mathrm {Cov}(y, x_2)=9\). For more details of this setting, we refer to p. 196 in Kawano et al. (2015).

  3. Case 3

    The 20-dimensional covariate vector \({\varvec{x}}=(x_1,\ldots ,x_{20})\) has multivariate normal distribution \(N_{20}({\varvec{0}}, \varSigma )\). The response was obtained as

    $$\begin{aligned} y_i = 4 {\varvec{\zeta }}^\top {\varvec{x}}_i + \varepsilon _i, \quad i=1,\ldots ,n, \end{aligned}$$

    where \(\varepsilon _i\) are independently distributed as \(N(0,\sigma ^2)\). We used \({\varvec{\zeta }}=({\varvec{\nu }}, \underbrace{0,\ldots ,0}_{11})^\top \) and \(\varSigma =\mathrm{block \ diag}(\varSigma _1, I_{11})\), where \({\varvec{\nu }}=(-1,0,1,1,0,-1,-1,0,1)\) and \(\left( \varSigma _1 \right) _{ij}=0.9^{|i-j|} \ (i,j,=1,\ldots ,9)\). Note that \({\varvec{\nu }}\) is a sparse approximation of the fourth eigenvector of \(\varSigma _1\). This case deals with the situation where the response is associated with the fourth principal component.

  4. Case 4

    The 30-dimensional covariate vector \({\varvec{x}}=(x_1,\ldots ,x_{30})\) has multivariate normal distribution \(N_{30}({\varvec{0}}, \varSigma )\). The response was obtained as

    $$\begin{aligned} y_i = 4 {\varvec{\zeta }}_1^\top {\varvec{x}}_i + 4 {\varvec{\zeta }}_2^\top {\varvec{x}}_i + \varepsilon _i, \quad i=1,\ldots ,n, \end{aligned}$$

    where \(\varepsilon _i\) are independently distributed as \(N(0,\sigma ^2)\). We used \({\varvec{\zeta }}_1=({\varvec{\nu }}_1, \underbrace{0,\ldots ,0}_{21})^\top , \) \({\varvec{\zeta }}_2=(\underbrace{0,\ldots ,0}_{9},{\varvec{\nu }}_2, \underbrace{0,\ldots ,0}_{15})^\top , \varSigma =\mathrm{block \ diag}(\varSigma _1, \varSigma _2, I_{15})\). Here \({\varvec{\nu }}_1=(-1,0, 1,1,0,-1,-1,0,1), {\varvec{\nu }}_2=(\underbrace{1,\ldots ,1}_{6})\), and \(\left( \varSigma _2 \right) _{ij}=0.9^{|i-j|} \ (i,j,=1,\ldots ,6)\). Note that \({\varvec{\nu }}_1\) is a sparse approximation of the third eigenvector of \(\varSigma _1\) and \({\varvec{\nu }}_2\) is the first eigenvector of \(\varSigma _2\). This case deals with the situation where the response is associated with the third principal component from \(\varSigma _1\) and the first principal component from \(\varSigma _2\).

  5. Case 5

    This case is the same as Case 4 except with \(\varvec{\nu }_2=(1,0,-1,-1,0,1)\). Note that \({\varvec{\nu }}_2\) is a sparse approximation of the third eigenvector of \(\varSigma _2\). This case deals with the situation where the response is associated with the third principal components from \(\varSigma _1\) and \(\varSigma _2\).

  6. Case 6

    This case is the same as Case 2 except with \(\varvec{x} =(x_1,\ldots ,x_{100})\). This is a high-dimensional case of Case 2.

The sample size was set to \(n=50, 200\). The standard deviation was set to \(\sigma =1,2\). We considered the two algorithms given in Sect. 4.1: ADMM for SPCRsvd (SPCRsvd-ADMM) and LADMM for SPCRsvd (SPCRsvd-LADMM). SPCRsvd was fitted to the simulated data with one or five components \((k=1,5)\) except for Case 6 and one or two components \((k=1,2)\) for Case 6. We set the value of the tuning parameter w to 0.1 and employed five-fold cross-validation for selecting the regularization parameters \(\lambda _V\), \(\lambda _{\varvec{\beta }}\). We used a two-dimensional grid and evaluated the CV in (9) on the grid, as illustrated in Fig. 1. The cross-validation surface was obtained by SPCRsvd-ADMM with \(k=1\) and was estimated by data generated from Case 1 with \(n=50\), \(\sigma =1\). The minimum is achieved for the combination of the first candidate of \(\lambda _V\) and the seventh candidate of \(\lambda _{\varvec{\beta }}\).

Fig. 1
figure 1

Cross-validation surface in SPCRsvd-ADMM estimated by data generated from Case 1

SPCRsvd was compared with SPCR, PCR, SPLS, and PLS. SPCR was computed by the package spcr, SPLS by spls, and PLS and PCR by pls. These packages are included in the software R (R Core Team 2020). We used the default settings of the packages when determining the values of tuning parameters in SPCR, PCR, SPLS, and PLS. The values of the tuning parameters w and \(\xi \) in SPCR were set to 0.1 and 0.01, respectively, and then the regularization parameters were selected by five-fold cross-validation. The value of the regularization parameter in SPLS was selected by 10-fold cross-validation. The number of components in SPLS, PLS, and PCR was also selected by 10-fold cross-validation from ranges from one to five when SPCRsvd-ADMM, SPCRsvd-LADMM, and SPCR employ \(k=5\) and from one to two when \(k=2\). The performance was evaluated in terms of \(\mathrm{MSE}=E[ (y-{\hat{y}})^2 ]\). The simulation was conducted 100 times. MSE was estimated from 1,000 random samples.

We summarize the means and standard deviations of MSEs in Tables 1, 2, 3, 4, 5 and 6. The results for \(\sigma =1,2\) had similar tendencies. PCR and PLS were worst in almost all cases, so we will focus on comparing the other methods. SPCRsvd-LADMM and SPCRsvd-ADMM were competitive with SPCR. In particular, SPCRsvd-LADMM and SPCRsvd-ADMM provided smaller MSEs than SPCR in almost all cases when \(k=1\). Compared to SPLS, SPCRsvd-LADMM and SPCRsvd-ADMM were slightly inferior in many cases when \(k=5\). However, SPLS produced so large values of MSEs in many cases when \(k=1\).

Table 1 Mean (standard deviation) of MSE for Case 1
Table 2 Mean (standard deviation) of MSE for Case 2
Table 3 Mean (standard deviation) of MSE for Case 3
Table 4 Mean (standard deviation) of MSE for Case 4
Table 5 Mean (standard deviation) of MSE for Case 5
Table 6 Mean (standard deviation) of MSE for Case 6
Table 7 Mean (standard deviation) of TPR, TNR, and MCC for Case 1

The true positive rate (TPR), the true negative rate (TNR), and the Matthews correlation coefficient (MCC) (Matthews 1975) were also computed for SPCRsvd-LADMM, SPCRsvd-ADMM, SPCR, and SPLS. TPR and TNR are respectively defined by

$$\begin{aligned} \mathrm {TPR}&= \frac{\mathrm {TP}}{ \left| \left\{ j:\zeta ^{*}_{j}\ne 0 \right\} \right| } =\frac{1}{100} \sum _{k=1}^{100} \frac{\left| \left\{ j:{\hat{\zeta }}^{(k)}_{j}\ne 0 \wedge \zeta ^{*}_{j}\ne 0 \right\} \right| }{\left| \left\{ j:\zeta ^{*}_{j}\ne 0 \right\} \right| }, \\\mathrm {TNR}&=\frac{\mathrm {TN}}{ \left| \left\{ j:\zeta ^{*}_{j} = 0 \right\} \right| } = \frac{1}{100} \sum _{k=1}^{100} \frac{\left| \left\{ j:{\hat{\zeta }}^{(k)}_{j}=0 \wedge \zeta ^{*}_{j}=0 \right\} \right| }{\left| \left\{ j:\zeta ^{*}_{j}= 0 \right\} \right| }, \end{aligned}$$

where \(\mathrm {TP}=\sum _{k=1}^{100}| \{ j:{\hat{\zeta }}^{(k)}_{j}\ne 0 \wedge \zeta ^{*}_{j}\ne 0 \}|/100\), \(\mathrm {TN}=\sum _{k=1}^{100}| \{ j:{\hat{\zeta }}^{(k)}_{j}=0 \wedge \zeta ^{*}_{j}= 0 \}|/100\), \({\zeta }^{*}_{j}\) is the true j-th coefficient, \({\hat{\zeta }}^{(k)}_{j}\) is the estimated j-th coefficient for the k-th simulation, and \(|\{*\}|\) is the number of elements included in set \(\{*\}\). MCC is defined by

$$\begin{aligned} \text {MCC} = \frac{ \text {TP} \times \text {TN} - \text {FP} \times \text {FN} }{ \sqrt{ (\text {TP}+\text {FP}) (\text {TP}+\text {FN}) (\text {TN}+\text {FP}) (\text {TN}+\text {FN}) } }, \end{aligned}$$

where \(\mathrm {FP}=\sum _{k=1}^{100}| \{ j:{\hat{\zeta }}^{(k)}_{j}\ne 0 \wedge \zeta ^{*}_{j}= 0 \}|/100\) and \(\mathrm {FN}=\sum _{k=1}^{100}| \{ j:{\hat{\zeta }}^{(k)}_{j}=0 \wedge \zeta ^{*}_{j}\ne 0 \}|/100\).

Table 7 represents the means and standard deviations of TPR, TNR, and MCC for Case 1. Many methods provided higher TPR, whereas SPCR sometimes did not. SPLS provided the highest TNR and MCC among the methods in all situations. For all cases, these tendencies for TPR and TNR were essentially unchanged, while SPCRsvd-ADMM sometimes provided the highest ratios of MCC. The results from Cases 2 to 6 are shown in the supplementary material.

We also investigated the sensitivity of the tuning parameter w and the penalty parameters. Table 8 shows MSEs for SPCRsvd with \(w=1,0.5,0.01\). Note that we could not compute MSEs for \(w=0.01\) in Case 6. Table 9 shows MSEs for SPCRsvd with \(\rho =1.5, 0.5\), where \(\rho =\rho _1=\rho _2\) for SPCRsvd-LADMM and \(\rho =\rho _1=\rho _2=\rho _3\) for SPCRsvd-ADMM. Note that the number of iterations of the simulation was 10 times and we set \(n=50\), \(\sigma =1\), and \(k=1\) in both settings. From the results, we observe that varying w has little influence on MSEs in SPCRsvd-ADMM, whereas it has a small influence in SPCRsvd-LADMM. For the penalty parameters, we observe that varying \(\rho \) has a small influence on MSEs (in particular, Case 6 seems to be affected by \(\rho \).). However, we note that the influences do not essentially change the conclusions derived from Tables 1, 2, 3, 4, 5 and 6 in almost all cases. This means that MSEs of SPCRsvd may be relatively insensitive to w and \(\rho \).

Table 8 Mean (standard deviation) of MSE for \(w=1,0.5,0.01\)
Table 9 Mean (standard deviation) of MSE for \(\rho =1.5,0.5\), where we set \(\rho =\rho _1=\rho _2\) for SPCRsvd-LADMM and \(\rho =\rho _1=\rho _2=\rho _3\) for SPCRsvd-ADMM

5.2 Real data analyses

Table 10 Sample size and number of covariates in real datasets

We applied SPCRsvd to real datasets. Specifically, we applied it to eight real datasets (housing, communities, concrete, diabetes, parkinsons, triazines, winequality-red, and winequality-white), which are available from the UCI database (http://archive.ics.uci.edu/ml/index.html). The sample sizes and the numbers of covariates are listed in Table 10. If the sample size was larger than 1100, we randomly extracted 1,100 observations from the dataset. For each dataset, we randomly selected 100 observations as training data and used the remaining as test data to estimate MSEs. We standardized the covariates for each dataset. We applied two algorithms: SPCRsvd-LADMM and SPCRsvd-ADMM. The procedure was repeated 50 times.

Table 11 Mean (standard deviation) of MSE for real datasets

We compared SPCRsvd with the four methods used in Sect. 5.1. The number of components was set as \(k=1\). The value of the tuning parameter w in SPCRsvd was set to 0.01, and then \(\lambda _V\) and \(\lambda _{\varvec{\beta }}\) were selected by five-fold cross-validation. The tuning parameters in the other methods were selected in similar manners to in Sect. 5.1.

Table 11 lists the means and standard deviations of MSEs. PLS and PCR were competitive but did not provide the smallest MSEs for any dataset. SPCR was slightly better than PLS and PCR. SPCRsvd-LADMM and SPCRsvd-ADMM provided smaller MSEs than the other methods in many cases. Although SPLS sometimes provided smaller MSEs than other methods, SPLS also had the worst MSEs in some cases. From the result, we may conclude that SPCRsvd-LADMM and SPCRsvd-ADMM are superior to the other methods in terms of giving smaller MSEs, which is consistent with the results in Sect. 5.1.

6 Conclusions

In this paper, we proposed SPCRsvd, a one-stage procedure for PCR with a loss function for regression loss and PCA loss of SVD. To obtain the estimates of the parameters in SPCRsvd, we developed two computational algorithms based on ADMM and LADMM. From our numerical studies, we observed that our one-stage method is competitive with or better than competing approaches.

A major limitation of SPCRsvd is the computational cost. Figure 2 shows common logarithm of the run-times for the simulation presented in Sect. 5.1. Note that the number of iterations of the simulation was 10 times and we set \(n=50\), \(\sigma =1\), and \(k=1\). In these results, we observe that SPCRsvd-ADMM was faster than SPCRsvd-LADMM, and that the SPCRsvd-based methods required more computation time than the other four methods in almost cases. This high computational cost causes some problems. For example, SPCRsvd provides relatively low TNR, based on Table 7. To address this issue, one could apply the adaptive lasso to the regularization term in SPCRsvd. However, owing to the computational cost, it may be difficult to perform SPCRsvd with the adaptive lasso because the adaptive lasso generally requires more computation time than lasso.

Fig. 2
figure 2

Common logarithm of run-times (seconds) for the simulation in Sect. 5.1

SPCRsvd cannot handle binary data for the explanatory variables. To perform PCA for binary data, Lee et al. (2010) introduced the logistic PCA with sparse regularization. It would be interesting to extend SPCRsvd in the context of the method in Lee et al. (2010). We leave them as future research.