In this section, we first formulate the research problem, then present the design of proposed algorithm with a brief algorithm analysis.
Problem definition for \(\ell _2\)-penalized PCA model selection
We formulate the model selection problem as selecting the empirically-best \(\ell _2\)-Penalized PCA for the given dataset with respect to a performance evaluator.
-
\(\lambda \in \Lambda \subseteq {\mathbb {R}}^+\)—the tuning parameters and the set of possible tuning parameters (which is a subset of positive reals);
-
\({\mathbf {X}}_{\text {train}} \in {\mathbb {R}}^{n_{\text {train}} \times d}\) and \({\mathbf {X}}_{\text {val}} \in {\mathbb {R}}^{n_{\text {val}} \times d}\)—the training data matrix and the validation data matrix, with \(n_{\text {train}}\) samples and \(n_{\text {val}}\) samples respectively;
-
\({\mathbf {Y}}^j\)—a given \(n \times 1\) vector referring to the estimate of the \(j^{th}\) principal subspace;
-
\(\hat{\beta }^{j}(\lambda )\)—the \(j^{th}\) projection vector, or the corresponding loading vector of the \(j^{th}\) PC, \(\hat{\beta }^{j}(\lambda ) = \left( {\mathbf {X}}^{\top } {\mathbf {X}}+ n\lambda \mathbf {I}\right) ^{-1} {\mathbf {X}}^{\top } {\mathbf {Y}}^{j}\) solution of Eq. (4);
-
\(\hat{\varvec{\beta }}(\lambda ) = [\hat{\beta }^{1}(\lambda ), \hat{\beta }^{2}(\lambda ), \cdots , \hat{\beta }^{d^{\prime }}(\lambda )] \in {\mathbb {R}}^{d \times d^{\prime }}\)—the projection matrix based on \(\ell _2\)-penalized PCA with the tuning parameter \(\lambda \), where each column \(\hat{\beta }^{j}(\lambda )\) is the corresponding loadings of the \(j^{th}\) principal component;
-
\({\mathbf {X}}_{\text {train}} \hat{\varvec{\beta }}(\lambda )\) and \({\mathbf {X}}_{\text {val}} \hat{\varvec{\beta }}(\lambda )\)—the dimension-reduced training data matrix and the dimension-reduced validation data matrix, respectively;
-
\(\mathtt {model(\cdot )}:{\mathbb {R}}^{n_{\mathrm {train}}\times d^{\prime }}\rightarrow \mathcal {H}\)—the target learner for performance tuning that outputs a model \(h\in \mathcal {H}\) using the dimension-reduced training data \({\mathbf {X}}_{\text {train}} \hat{\varvec{\beta }}(\lambda )\).
-
\(\mathtt {evaluator(\cdot )}:{\mathbb {R}}^{n_{\mathrm {val}}\times d^{\prime }}\times \mathcal {H}\rightarrow {\mathbb {R}}\)—the evaluator that outputs the reward (a real scalar) of the input model h based on the dimension-reduced validation data \({\mathbf {X}}_{\text {val}} \hat{\varvec{\beta }}(\lambda )\).
Then the model selection problem can be defined as follows.
$$\begin{aligned} \begin{aligned}&\underset{\lambda \in \Lambda }{\text {maximize} }&\ \mathtt {evaluator}\left( {\mathbf {X}}_{\text {val}} \hat{\varvec{\beta }}(\lambda ), \ h(\lambda ) \right) \ , \\&\text { subject to }&\ h(\lambda ) = \mathtt {model}\left( {\mathbf {X}}_{\text {train}} \hat{\varvec{\beta }}(\lambda ) \right) . \end{aligned} \end{aligned}$$
(10)
Where
$$\begin{aligned}&\bar{\beta }^j(\lambda )=\mathop {\arg \min }_{\beta \in {\mathbb {R}}^d} \left\{ \frac{1}{2n} \left\| {\mathbf {Y}}^j - {\mathbf {X}}\beta \right\| _2^2 + \frac{\lambda }{2} \Vert \beta \Vert _{2}^2\right\} ,\ \text { for } j=1, \ldots , d^{\prime }, \nonumber \\&\quad \hat{\beta }^{j}(\lambda )=\frac{\bar{\beta }^j(\lambda )}{\Vert \bar{\beta }^j(\lambda )\Vert _2} . \end{aligned}$$
(11)
Note that \(\mathtt {model}(\cdot )\) can be any arbitrary target learner in the learning task and \(\mathtt {evaluator}(\cdot )\) can be any evaluation function of validation metrics. To make it clear, we take a classification problem as an example, thus the target learner \(\mathtt {model}(\cdot )\) can be the support vector machine (SVM) or random forest (RF), and the evaluation function \(\mathtt {evaluator}(\cdot )\) can be the classification error. To solve the above problem for arbitrary learning tasks \(\mathtt {model}(\cdot )\) under various validation metrics \(\mathtt {evaluator}(\cdot )\), there are at least two technical challenges needing to be addressed,
-
1.
Complexity - For any given and fixed \(\lambda \), the time complexity to solve the \(\ell _2\)-penalized PCA (for dimension reduction to \(d^{\prime }\)) based on the Ridge-regression is \(O(d^{\prime }\cdot d^3)\), as it requests to solve the Ridge regression (to get the Ridge estimator) \(d^{\prime }\) rounds to obtain the corresponding loadings of the top-\(d^{\prime }\) principal components and the complexity to calculate the Ridge estimator in one round is \(O(d^3)\).
-
2.
Size of \(\Lambda \) - The performance of the model selection relies on the evaluation of models over a wide range of \(\lambda \), while the overall complexity to solve the problem should be \(O(|\Lambda |\cdot d^{\prime }\cdot d^3)\). Thus, we need to obtain a well-sampled set of tuning parameters \(\Lambda \) that can balance the cost and the quality of model selection.
Model selection for \(\ell _2\)-penalized PCA over approximated gradient flow
In this section, we present the design of AgFlow algorithm (Algorithm 1) for obtaining the whole path of the loadings corresponding to each principal component for \(\ell _2\)-penalized PCA. Consider the \(j^{th}\) principal component. Let \({\mathbf {Y}}^j\) be the \(j^{th}\) principal subspace, which can be approximated by the Quasi-Principal Subspace Estimation Algorithm (QuasiPS) through the call of \(\mathtt {QuasiPS({\mathbf {X}},j)}\) (Algorithm 2). The path of \(\ell _2\)-penalized PCA should be the solution path of Ridge regression in Eq. (4) with varying \(\lambda \) from \(0\rightarrow \infty \).
With the implicit regularization of Stochastic Gradient Descent (SGD) for Ordinary Least Squares (OLS) (Ali et al. 2020), the solution path is equivalent to the optimization path of the following OLS estimator using SGD with zero initialization, such that
$$\begin{aligned} \mathop {\min }_{\beta \in {\mathbb {R}}^{d}}\ \frac{1}{2n} \Vert {\mathbf {Y}}^j - {\mathbf {X}}\beta \Vert _2^2\ . \end{aligned}$$
(12)
More specifically, with a constant step size \(\eta > 0\), an initialization \(\beta ^j_0 = \mathbf {0}_d\), and mini-batch size m, every SGD iteration updates the estimation as follows,
$$\begin{aligned} \beta ^{j}_{k}&= \beta ^{j}_{k-1} + \frac{\eta }{m} \cdot \sum _{i \in I_{k}} ({\mathbf {Y}}_i^j - {\mathbf {X}}_i^{\top }\beta ^{j}_{k-1} ) \mathbf {x}_i \ , \end{aligned}$$
(13)
for \(k=1,2,\ldots , K\), and thus the solutions on the (stochastic) gradient flow path for the \(j^{th}\) principal component can be obtained. According to (Ali et al. 2020), the relationship of the explicit regularization \(\lambda \) and the implicit regularization effects introduced by SGD is \(\lambda \propto \frac{1}{k \sqrt{\eta }}\). Thus, with the total number of iteration steps K large enough, the proposed algorithm could compete the path of penalized PCA for a full range of \(\lambda \), but with a much lower computation cost.
Since in the problem of model selection of \(\ell _2\)-penalized PCA based on Ridge estimator, we need to select the optimal \({\hat{\beta }}(\lambda ^{*})\) corresponding to the the optimal \(\lambda ^{*}\). Here we deal with the same model selection problem but with an alternative algorithm which uses the AgFlow algorithm instead of using matrix inverse in Ridge estimator. Therefore, we need to select the optimal \({\hat{\beta }}^{*}\) corresponding to the the optimal \(k^{*}\) with \(k^{*} \propto \frac{1}{\lambda ^{*}\sqrt{\eta }}\). To obtain the optimal \(\lambda ^{*}\), k-fold cross-validation is usually applied on a searching grid of \(\lambda \)-s in the model selection. As an analog of obtaining the optimal \(k^{*}\), the proposed AgFlow algorithm is firstly used to get the iterated projection vector \({\hat{\beta }}_{k}\) of the given training data, which corresponds to some \(\hat{\beta }(\lambda ^{*})\) with \(k \propto \frac{1}{\lambda \sqrt{\eta }}\), then to select the optimal \(\beta ^{*}\) based on the performance on the validation data.
Finally, Algorithm 1 outputs the best projection matrix \(\hat{\varvec{\beta }}_{k^*} \in {\mathbb {R}}^{d \times d^{\prime }}\), which maximizes the evaluator \(\mathtt {Eval}_{k} = \mathtt {evaluator} \left( {\mathbf {X}}_{\text {val}} \hat{\varvec{\beta }}_{k}, h(k) \right) \), for \(k=1, \ldots , K\). Where the index \(k \propto \frac{1}{\lambda \sqrt{\eta }}\), and each column of the projection matrix \(\hat{\varvec{\beta }}_{k}\) is a normalized projection vector \(\Vert \hat{\beta }^{j}_{k}\Vert _2=1\). Note that, as discussed in the preliminaries in Sect. 2, when the sample covariance matrix \(1/n{\mathbf {X}}^\top {\mathbf {X}}\) is non-singular (when \(n\gg d\)), there is no need to place any penalty here, i.e., \(\lambda \rightarrow 0\), and \(k \rightarrow \infty \), as the normalization would remove the effect of the \(\ell _2\)-regularization (Zou et al. 2006) considering Karush-Kuhn-Tucker conditions. However, when \(d\gg n\), the sample covariance matrix \(1/n{\mathbf {X}}^\top {\mathbf {X}}\) becomes singular, and the Ridge-liked estimator starts to shrink the covariance matrix as in Eq. (3), i.e., \(\lambda \ne 0\), and k is some finite integer but not \(\infty \), making the sample covariance matrix invertible and the results penalized in a covariance-regularization fashion (Witten et al. 2009). Even though the normalization would rescale the vectors to a \(\ell _2\)-ball, the regularization effect still remains.
Near-optimal initialization for quasi-principal subspace
The goal of the QuasiPS algorithm is to approximate the principal subspace of PCA with given data matrix with extremely low cost, and AgFlow would fine-tune the rough quasi-principal projection estimation (i.e. the loadings) and obtain the complete path of the \(\ell _2\)-penalized PCA accordingly. While there are various low-complexity algorithms in this area , such as (Hardt and Price 2014; Shamir et al. 2015; De Sa et al. 2015; Balsubramani et al. 2013), we derive the Quasi-Principal Subspace (QuasiPS) estimator (in Algorithm 2) using the stochastic algorithms proposed in (Shamir et al. 2015). More specifically, Algorithm 2 first pursues a rough estimation of the \(j^{th}\) principal component projection (denoted as \(\tilde{w}_L\) after L iterations) using the stochastic approximation (Shamir et al. 2015), then obtains the quasi-principal subspace \({\mathbf {Y}}^j\) through projection \({\mathbf {Y}}^j={\mathbf {X}}\tilde{w}_L\).
Note that \(\tilde{w}_L\) is not a precise estimation of the loadings corresponding to its principal component (compared to our algorithm and (Oja and Karhunen 1985) etc.), however it can provide a close solution in an extremely low cost. In this way, we consider \({\mathbf {Y}}^j={\mathbf {X}}\tilde{w}_L\) as a reasonable estimate of the principal subspace. With a random unit initialization \(\tilde{w}_{0}\), \(\tilde{w}_{L}\) converges to the true principal projection \(\mathbf {v}^*\) in a fast rate under mild conditions, even when \(\tilde{w}_{0}^\top \mathbf {v}^* \ge \frac{1}{\sqrt{2}}\) (Shamir et al. 2015). Thus, our setting should be non-trivial.
Algorithm analysis
In this section, we analyze the proposed algorithm from perspectives of statistical performance and its computational complexity.
Statistical performance
AgFlow algorithm consists of two steps: Quasi-PS initialization and solution path retrieval. As the goal of our research is fast model selection on the complete solution path for \(\ell _2\)-penalized PCA over varying penalties, the performance analysis of the proposed algorithm can be decomposed into two parts.
Approximation of Quasi-PS to the true principal subspace. In Algorithm 2, the \(\mathtt {QuasiPS}\) algorithm first obtains a Quasi-PS projection \(\tilde{w}_L\) using L epochs of low-complexity stochastic approximation, then it projects the sample \({\mathbf {x}}\) to get the \(j^{th}\) principal subspace \({\mathbf {Y}}^{j}\) via \({\mathbf {x}}\tilde{w}_L\).
Lemma 2
Under some mild conditions as in Shamir et al. (2015) and given the true principal projection \(w^*\), with probability at least \(1-{\mathrm {log}}_{2}(1/\varepsilon )\delta \), the the distance between \(w^*\) and \(\tilde{w}_L\) holds that
$$\begin{aligned} \Vert \tilde{w}_L-w^*\Vert _2^2\le 2-2\sqrt{1-\varepsilon } , \end{aligned}$$
(14)
provided that \(L={\mathrm {log}}(1/\varepsilon )/{\mathrm {log}}(2/\delta )\).
It can be easily derived from (Theorem 1. in Shamir et al. (2015)). When \(\varepsilon \rightarrow 0\), \(\Vert \tilde{w}_L-w^*\Vert _2^2\rightarrow 0\) and the error bound becomes tight. Suppose samples in \({\mathbf {X}}\) are i.i.d. realizations from the random variable X and \({\mathbb {E}}XX^\top =\Sigma ^*\) denote the true covariance.
Theorem 1
Under some mild conditions as in Shamir et al. (2015), the distance between the Quasi-PS and the true principal subspace holds that
$$\begin{aligned} \begin{aligned} \underset{{\mathbf {x}}\sim {X}}{{\mathbb {E}}}\Vert \mathbf {x}\tilde{w}_L-\mathbf {x}w^*\Vert _2^2&= (\tilde{w}_L-w^*)^\top \Sigma ^* (\tilde{w}_L-w^*)\\&\le \lambda _{\mathrm {max}}(\Sigma ^*)\Vert \tilde{w}_L-w^*\Vert _2^2\\&= (2-2\sqrt{1-\varepsilon })\cdot \lambda _{\mathrm {max}}(\Sigma ^*)\ , \end{aligned} \end{aligned}$$
(15)
where \(\lambda _{\mathrm {max}}(\cdot )\) refers to the largest eigenvalue of a matrix.
When considering the largest eigenvalue \(\lambda _{\mathrm {max}}(\Sigma ^*)\) as a constant, Quasi-PS is believed to achieve exponential coverage rate for principal subspace approximation for every sample. Thus, the statistical performance of Quasi-PS can be guaranteed.
Approximation of approximated stochastic gradient flow to the solution path of ridge. In (Ali et al. 2019, 2020), the authors have demonstrated that when the learning rate \(\eta \rightarrow 0\), the discrete-time SGD and GD algorithms would diffuse to two continuous-time dynamics over (stochastic) gradient flows, i.e., \(\hat{\beta }_{\mathrm {sgf}}(t)\) and \(\hat{\beta }_{\mathrm {gf}}(t)\) over continuous time \(t>0\). According to Theorem 1 in Ali et al. (2019), the statistical risk between Ridge and continuous-time gradient flow is bounded by
$$\begin{aligned} {\mathrm {Risk}}(\hat{\beta }_{\mathrm {gf}}(t),\beta ^*)\le 1.6862 \cdot {\mathrm {Risk}}({\widehat{\beta }}_{\text {ridge}}(1/t), \beta ^*) \end{aligned}$$
(16)
where \({\mathrm {Risk}}(\beta _1,\beta _2)={\mathbb {E}}\Vert \beta _1-\beta _2\Vert ^2_2\), \(\beta ^*\) refers to the true estimator, and \(\lambda =1/t\) for Ridge. While the stochastic gradient flow enjoys a faster convergence but with slightly larger statistical risk, such that
$$\begin{aligned} \begin{aligned} {\mathrm {Risk}}(\hat{\beta }_{\mathrm {sgf}}(t),\beta ^*)\le {\mathrm {Risk}}(\hat{\beta }_{\mathrm {gf}}(t),\beta ^*)+{o}\left( \frac{n}{m}\right) \ , \end{aligned} \end{aligned}$$
(17)
where m refers to the batch size and \({o}\left( \frac{n}{m}\right) \) is an error term caused by the stochastic gradient noises. Under mild conditions, with discretization (\(\mathbf {d} t=\sqrt{\eta }\)), we consider the \(k^{th}\) iteration of SGD for Ordinary Least Squares, denoted as \(\beta _k\), which tightly approximates to \(\hat{\beta }_{\mathrm {sgf}}(t)\) in a \({o}(\sqrt{\eta })\)-approximation with \(t=k\sqrt{\eta }\). In this way, given the learning rate \(\eta \) and the total number of iterations K, the implicit Ridge-like AgFlow screens the \(\ell _2\)-penalized PCA with varying \(\lambda \) in the range of
$$\begin{aligned} \frac{1}{K\sqrt{\eta }}\le \lambda \le \frac{1}{\sqrt{\eta }}\ , \end{aligned}$$
(18)
with bounded error in both statistics and approximation.
In this way, we could conclude that under mild conditions, QuasiPS can well approximate the true principal subspace (\(d^{\prime }\ll n\)) while AgFlow retrieves a tight approximation of the Ridge solution path..
Computational complexity
The proposed algorithm consists of two steps: the initialization of the quasi-principal subspace and the path retrieval. To obtain a fine estimate of Quasi-PS and hit the error in Eq. (14), one should run Shamir’s algorithm (Shamir et al. 2015) with
$$\begin{aligned} O\left( ({\mathrm {rank}}(\Sigma ^*)/{\mathrm {eigengap}}(\Sigma ^*))^2{\mathrm {log}}(1/\varepsilon )\right) \end{aligned}$$
iterations, where \({\mathrm {rank}}(\cdot )\) refers to the matrix rank, \({\mathrm {eigengap}}(\cdot )\) refers to the gap between the first and second eigenvalues, and \(\varepsilon \) has been defined in Lemma 2 referring to the error of principal subspace estimation.
Furthermore, to get the loadings corresponding to the \(j^{th}\) principal subspace, AgFlow uses K iterations for OLS to obtain the estimate of K models for \(\ell _2\)-penalized PCA, where each iteration only consumes \(O(m\cdot d^2)\) complexity with batch size m, which gets total \(O(K\cdot m \cdot d^2)\) for K models, and total \(O(d^{\prime } \cdot K\cdot m \cdot d^2)\) with the reduced-dimension \(d^{\prime }\). Moreover, we also propose to run \(\texttt {AgFlow}\) with full-batch size \(m=n\) using gradient descent per iteration, which only consumes \(O(d^2)\) per iteration with lazy evaluation of \({\mathbf {X}}^\top {\mathbf {X}}\) and \({\mathbf {X}}^\top {\mathbf {Y}}\), with total \(O(K\cdot d^2)\) for K models, which gets \(O(d^{\prime } \cdot K \cdot \cdot d^2)\) with the reduced-dimension \(d^{\prime }\).
To further improve AgFlow without incorporating higher-order complexity, we carry out the experiments by running a mini-batch AgFlow, and a full-batch AgFlow (i.e., \(m=n\)) with lazy evaluation of \({\mathbf {X}}^\top {\mathbf {X}}\) and \({\mathbf {X}}^\top {\mathbf {Y}}\) in parallel for model selection.