Abstract
Principal component analysis (PCA) has been widely used as an effective technique for feature extraction and dimension reduction. In the High Dimension Low Sample Size setting, one may prefer modified principal components, with penalized loadings, and automated penalty selection by implementing model selection among these different models with varying penalties. The earlier work (Zou et al. in J Comput Graph Stat 15(2):265–286, 2006; Gaynanova et al. in J Comput Graph Stat 26(2):379–387, 2017) has proposed penalized PCA, indicating the feasibility of model selection in \(\ell _2\)penalized PCA through the solution path of Ridge regression, however, it is extremely timeconsuming because of the intensive calculation of matrix inverse. In this paper, we propose a fast model selection method for penalized PCA, named approximated gradient flow (AgFlow), which lowers the computation complexity through incorporating the implicit regularization effect introduced by (stochastic) gradient flow (Ali et al. in: The 22nd international conference on artificial intelligence and statistics, pp 1370–1378, 2019; Ali et al. in: International conference on machine learning, 2020) and obtains the complete solution path of \(\ell _2\)penalized PCA under varying \(\ell _2\)regularization. We perform extensive experiments on realworld datasets. AgFlow outperforms existing methods (Oja and Karhunen in J Math Anal Appl 106(1):69–84, 1985; Hardt and Price in: Advances in neural information processing systems, pp 2861–2869, 2014; Shamir in: International conference on machine learning, pp 144–152, PMLR, 2015; and the vanilla Ridge estimators) in terms of computation costs.
Introduction
Principal component analysis (PCA) (Jolliffe 1986; Dutta et al. 2019) is widely used as an effective technique for feature transformation, data processing and dimension reduction in unsupervised data analysis, with numerous applications in machine learning and statistics such as handwritten digits classification (LeCun et al. 1995; Hastie et al. 2009), human faces recognition (Huang et al. 2008; Mohammed et al. 2011), and gene expression data analysis (Yeung and Ruzzo 2001; Zhu et al. 2007). Generally, given a data matrix \({\mathbf {X}}\in {\mathbb {R}}^{n\times d}\) , where n refers to the number of samples and d refers to the number of variables in each sample, PCA can be formulated as a problem of projecting samples to a lower \(d^{\prime }\)dimensional subspace (\(d^{\prime }\ll d\)) with variances maximized. To achieve the goal, numerous algorithms, such as Oja’s algorithm (Oja and Karhunen 1985), power iteration algorithm (Hardt and Price 2014), and stochastic/incremental algorithm (Shamir et al. 2015; Arora et al. 2012; Mitliagkas et al. 2013; De Sa et al. 2015) have been proposed, and the convergence behaviors of these algorithms have also been intensively investigated. In summary, given the matrix of raw data samples, the eigensolvers above output \(d^{\prime }\)dimensional vectors which are linear combinations of the original predictors, projecting original samples into the \(d^{\prime }\)dimensional subspace desired while capturing maximal variances.
In addition to the above estimates of PCA, penalized PCA has been proposed (Zou et al. 2006; Irina et al. 2017; Witten et al. 2009; Lee et al. 2012) to improve its performance using regularization. For example, Zou et al. (2006) introduced a direct estimation of \(\ell _2\)penalized PCA using Ridge estimator (see also in Theorem 1 in Section 3.1 of (Zou et al. 2006)), where an \(\ell _2\)regularization hyperparameter (denoted as \(\lambda \)) has been used to balance the error term for fitting and the penalty term for regularization. Whereas an \(\ell _1\)regularization is usually introduced for achieving sparsity (Irina et al. 2017). Though the effects of \(\lambda \) in \(\ell _2\)penalized PCA would be waived by normalization when the sample covariance matrix is nonsingular (i.e., \(d<n\)), the penalty term indeed regularizes the sample covariance matrix (Witten et al. 2009) for a stable inverse under High Dimension Low Sample Size (HDLSS) settings. Therefore, it is desirably necessary to do model selection to get the optimal \(\lambda \) on the solution path, where the solution path is formed by all the solutions corresponding to all the candidate \(\lambda \)s in the \(\ell _2\)penalized problem, and each model is determined by the parameter \(\lambda \). Thus, given the datasets for training and validation, it is only needed to retrieve the complete solution path (Friedman et al. 2010; Zou and Hastie 2005) for penalized PCA using the training dataset, where each solution corresponds to an Ridge estimator, and iterate every model on the solution path for validation and model selection. An example of \(\ell _2\)penalized PCA for dimension reduction over the solution path is listed in Fig. 1, where we can see that both the validation and testing accuracy are heavily affected by the value of \( \lambda \), the \(\ell _2\) regularization in \(\ell _2\)penalized PCA, no matter what kind of classifier is employed after dimension reduction.
While a solution path for penalized PCA is highly required, the computational complexity of estimating a large number of models using grid searching of hyperparameters is usually unacceptable. Specifically, to obtain the complete solution path or the models for \(\ell _2\)penalized PCA, it is needed to repeatedly solve the Ridge estimator with a wide range of values for \(\lambda \), where the matrix inverse to get a shrunken sample covariance matrix is required in \(O(d^3)\) complexity for every possible setting of \(\lambda \). To lower the complexity, inspired by the recent progress on implicit regularization effects of gradient descent (GD) and stochastic gradient descent (SGD) in solving Ordinary LeastSquare (OLS) problems (Ali et al. 2019, 2020), we propose a fast model selection method, named Approximated Gradient Flow (AgFlow), which is an efficient and effective algorithm to accelerate model selection of \(\ell _2\)penalized PCA with varying penalties.
Our contributions. We make three technical contributions as follows.

We study the problem of lowering the computational complexity while accelerating the model selection for penalized PCA under varying penalties, where we particularly pay attention to \(\ell _2\)penalized PCA via the commonlyused Ridge estimator (Zou et al. 2006) under High Dimension Low Sample Size (HDLSS) settings.

We propose AgFlow to do fast model selection in \(\ell _2\)penalized PCA with \(O(K\cdot d^2)\) complexity, where K refers to the total number of models estimated for selection, and d is the number of dimension. More specifically, AgFlow first adopts algorithms in Shamir et al. (2015) to sketch \(d^{\prime }\) principal subspaces, then retrieves the complete solution path of the corresponding loadings for every principal subspace. Especially, AgFlow incorporates the implicit \(\ell _2\)regularization of approximated (stochastic) gradient flow over the Ordinary Least Squares (OLS) to screen and validate the \(\ell _2\)penalized loadings, under varying \(\lambda \) from \(+\infty \rightarrow 0^+\), using the training and validation datasets respectively.

We conduct extensive experiments to evaluate AgFlow, where we compare the proposed algorithm with vanilla PCA, including (Hardt and Price 2014; Shamir et al. 2015; Oja and Karhunen 1985) and \(\ell _2\)penalized PCA via Ridge estimator (Zou et al. 2006) on realworld datasets. Specifically, the experiments are all based on HDLSS settings, where a limited number of highdimensional samples have been given for PCA estimation and model selection. The results showed that the proposed algorithm can significantly outperform the vanilla PCA algorithms (Hardt and Price 2014; Shamir et al. 2015; Oja and Karhunen 1985) with better performance on validation/testing datasets gained by the flexibility of performance tuning (i.e., model selection). On the other hand, AgFlow consumes even less computation time to select models from 50 times more models compared to Ridgebased estimator (Zou et al. 2006).
Note that we don’t intend to propose the “offtheshelf” estimators to reduce the computational complexity of PCA estimation. Instead, we study the problem of model selection for \(\ell _2\)regularized PCA, where we combine the existing algorithms (Zou et al. 2006; Ali et al. 2019; Shamir et al. 2015) to lower the complexity of model selection and accelerate the procedure. The unique contribution made here is to incorporate with the novel continuoustime dynamics of gradient descent (gradient flow) Ali et al. (2019, 2020) to obtain the timevarying implicit regularization effects of \(\ell _2\)type for PCA model selection purposes.
Notations The following key notations are used in the rest of the paper. Let \(\mathbf {x} \in {\mathbb {R}}^{d}\) be the ddimensional predictors and \(y \in {\mathbb {R}}\) be the response, and denote \({\mathbf {X}}=[\mathbf {x}_1, \ldots , \mathbf {x}_n]^\top = [X_1, \ldots , X_d]\) and \(Y = [y_1, \ldots , y_n]^\top \), where n is the sample size and d is the number of variables. Without loss of generality, assume \(X_{j},\ j=1, \ldots , d\) and Y are centered. Given a ddimensional vector \(\mathbf {z} \in {\mathbb {R}}^{d}\), denote the \(\ell _2\) vectornorm \( \Vert \mathbf {z}\Vert _2 = \left( \sum _{i=1}^d z_i^2 \right) ^{1/2} \).
Preliminaries
In this section, firstly the Ordinary Least Squares and Ridge Regression is briefly introduced, then followed with the formulation that PCA is rewritten as a regressiontype optimization problem with an explicit \(\ell _2\) regularization parameter \(\lambda \), and lastly ended up with the introduction of the implicit regularization effect introduced by the (stochastic) gradient flow.
Ordinary least squares and ridge regression
Let \({\mathbf {X}}\in {\mathbb {R}}^{n \times d}\) and \(Y \in {\mathbb {R}}^{n}\) be a matrix of predictors (or features) and a response vector, respectively, with n observations and d predictors. Assume the columns of \({\mathbf {X}}\) and Y are centered. Consider the ordinary least squares (linear) regression problem
To enhance the solution of OLS for linear regression, regularization is commonly used as a popular technique in optimization problems in order to achieve a sparse solution or alleviate the multicollinearity problem (Friedman et al. 2010; Zou and Hastie 2005; Tibshirani 1996; Fan and Li 2001; Yuan and Lin 2006; Candes and Tao 2007). Recently an enormous amount of literature has focused on the related regularization methods, such as the lasso (Tibshirani 1996) which is friendly to interpretability with a sparse solution, the grouped lasso (Yuan and Lin 2006) where variables are included or excluded in groups, the elastic net (Zou and Hastie 2005) for correlated variables which compromises \(\ell _1\) and \( \ell _2\) penalties, the Dantzig selector (Candes and Tao 2007) which serves as a slightly modified version of the lasso, and some variants (Fan and Li 2001).
The ridge regression is the \(\ell _2\)regularized version of the linear regression in Eq. (1), imposing an explicit \(\ell _2\) regularization on the coefficients (Hoerl and Kennard 1970; Hoerl et al. 1975). Thus, the ridge estimator \(\hat{\beta }_{\text {ridge}}(\lambda )\), a penalized least squares estimator, can be obtained by minimizing the ridge criterion
The solution of the ridge regression has an explicit closedform,
We can see that the ridge estimator, Eq. (3), applies a type of shrinkage in comparison to the OLS solution \(\hat{\beta }_{\text {OLS}} = ({\mathbf {X}}^{\top } {\mathbf {X}})^{1} {\mathbf {X}}^{\top }Y\), which shrinks the coefficients of correlated predictors towards each other and thus alleviates the multicolinearity problem.
PCA as ridge regression
PCA can be formulated as a regressiontype optimization problem which was first proposed by Zou et al. (2006), where the loadings could be recovered by regressing the principal components on the d variables given the principal subspace.
Consider the \(j^{th}\) principal component. Let \({\mathbf {Y}}^j\) be a given \(n \times 1\) vector referring to the estimate of the \(j^{th}\) principal subspace. For any \(\lambda \ge 0\), the Ridgebased estimator (Theorem 1 in Zou et al. (2006)) of \(\ell _2\)penalized PCA is defined as
Obviously, the estimator above highly depends on the estimate of the principal subspace \({\mathbf {Y}}^{j}\). Given the original data matrix \({\mathbf {X}}=[\mathbf {x}_1,\ldots ,\mathbf {x}_n]^\top \), we could obtain its singular value decomposition as \({\mathbf {X}}=\mathbf {USV}^\top \) and the estimate of subspace could be \({\mathbf {Y}}^j={\mathbf {U}}_j{\mathbf {S}}_j\), where \({\mathbf {U}}_j\) and \({\mathbf {S}}_j\) refers to the \(j^{th}\) columns of the corresponding matrices, respectively. Then the normalized vector can be used as the penalized loadings of the \(j^{th}\) principal component
Note that, when the sample covariance matrix \(\frac{1}{n}{\mathbf {X}}^\top {\mathbf {X}}\) is nonsingular (\(d\le n\)), \({\hat{\beta }}^j(\lambda )\) would be invariant on \(\lambda \) and \({\hat{\beta }}^j(\lambda )\propto \mathbf {V}_j\). When the sample covariance matrix is singular (\(d>n\)), the \(\ell _2\)norm penalty would regularize the inverse of shrunken covariance matrix (Witten et al. 2009) with respect to the strength of \(\lambda \).
Implicit regularization with (stochastic) gradient flow
The implicit regularization effect of an estimation method means that the method produces an estimate exhibiting a kind of regularization, even though the method does not employ an explicit regularizer (Ali et al. 2019, 2020; Friedman and Popescu 2003, 2004). Consider gradient descent applied to Eq. (1), with initialization value \(\beta _0 = \mathbf {0}\), and a constant step size \(\eta > 0\), which gives the iterations
for \(k=1, 2, 3, \ldots \). With simply rearrangement, we get
To adopt a continuoustime (gradient flow) view, consider infinitesimal step size in gradient descent, i.e., \(\eta \rightarrow 0 \). The gradient flow differential equation for the OLS problem can be obtained with the following equation,
which is a continuoustime ordinary differential equation over time \(t \ge 0\) with an initial condition \(\beta (0) = \mathbf {0}\). We can see that by setting \(\beta (t) = \beta _{k}\) at time \(t=k \eta \), the lefthand side of Eq. (7) could be viewed as the discrete derivative of \(\beta (t)\) at time t, which approaches its continuoustime derivative as \(\eta \rightarrow 0 \). To make it clear, \(\beta (t)\) denotes the continuoustime view, and \(\beta _{k}\) the discretetime view.
Lemma 1
With fixed predictor matrix \({\mathbf {X}}\) and fixed response vector Y, the gradient flow problem in Eq. (8), subject to \(\beta (0) = \mathbf {0}\), admits the following exact solution (Ali et al. 2019)
for all \(t \ge 0\). Here \(A^{+}\) is the MoorePenrose generalized inverse of a matrix A, and \(\exp (A) = I + A + A^2/2! + A^3/3! + \cdots \) is the matrix exponential.
In continuoustime, \(\ell _2\)regularization corresponds to taking the estimator \(\hat{\beta }_{\mathrm {gf}}(t)\) in Eq. (9) for any finite value of \(t \ge 0\), where smaller t corresponds to greater regularization. Specifically, the time t of gradient flow and the tuning parameter \(\lambda \) of ridge regression are related by \(\lambda = 1/t\).
The proposed AgFlow algorithm
In this section, we first formulate the research problem, then present the design of proposed algorithm with a brief algorithm analysis.
Problem definition for \(\ell _2\)penalized PCA model selection
We formulate the model selection problem as selecting the empiricallybest \(\ell _2\)Penalized PCA for the given dataset with respect to a performance evaluator.

\(\lambda \in \Lambda \subseteq {\mathbb {R}}^+\)—the tuning parameters and the set of possible tuning parameters (which is a subset of positive reals);

\({\mathbf {X}}_{\text {train}} \in {\mathbb {R}}^{n_{\text {train}} \times d}\) and \({\mathbf {X}}_{\text {val}} \in {\mathbb {R}}^{n_{\text {val}} \times d}\)—the training data matrix and the validation data matrix, with \(n_{\text {train}}\) samples and \(n_{\text {val}}\) samples respectively;

\({\mathbf {Y}}^j\)—a given \(n \times 1\) vector referring to the estimate of the \(j^{th}\) principal subspace;

\(\hat{\beta }^{j}(\lambda )\)—the \(j^{th}\) projection vector, or the corresponding loading vector of the \(j^{th}\) PC, \(\hat{\beta }^{j}(\lambda ) = \left( {\mathbf {X}}^{\top } {\mathbf {X}}+ n\lambda \mathbf {I}\right) ^{1} {\mathbf {X}}^{\top } {\mathbf {Y}}^{j}\) solution of Eq. (4);

\(\hat{\varvec{\beta }}(\lambda ) = [\hat{\beta }^{1}(\lambda ), \hat{\beta }^{2}(\lambda ), \cdots , \hat{\beta }^{d^{\prime }}(\lambda )] \in {\mathbb {R}}^{d \times d^{\prime }}\)—the projection matrix based on \(\ell _2\)penalized PCA with the tuning parameter \(\lambda \), where each column \(\hat{\beta }^{j}(\lambda )\) is the corresponding loadings of the \(j^{th}\) principal component;

\({\mathbf {X}}_{\text {train}} \hat{\varvec{\beta }}(\lambda )\) and \({\mathbf {X}}_{\text {val}} \hat{\varvec{\beta }}(\lambda )\)—the dimensionreduced training data matrix and the dimensionreduced validation data matrix, respectively;

\(\mathtt {model(\cdot )}:{\mathbb {R}}^{n_{\mathrm {train}}\times d^{\prime }}\rightarrow \mathcal {H}\)—the target learner for performance tuning that outputs a model \(h\in \mathcal {H}\) using the dimensionreduced training data \({\mathbf {X}}_{\text {train}} \hat{\varvec{\beta }}(\lambda )\).

\(\mathtt {evaluator(\cdot )}:{\mathbb {R}}^{n_{\mathrm {val}}\times d^{\prime }}\times \mathcal {H}\rightarrow {\mathbb {R}}\)—the evaluator that outputs the reward (a real scalar) of the input model h based on the dimensionreduced validation data \({\mathbf {X}}_{\text {val}} \hat{\varvec{\beta }}(\lambda )\).
Then the model selection problem can be defined as follows.
Where
Note that \(\mathtt {model}(\cdot )\) can be any arbitrary target learner in the learning task and \(\mathtt {evaluator}(\cdot )\) can be any evaluation function of validation metrics. To make it clear, we take a classification problem as an example, thus the target learner \(\mathtt {model}(\cdot )\) can be the support vector machine (SVM) or random forest (RF), and the evaluation function \(\mathtt {evaluator}(\cdot )\) can be the classification error. To solve the above problem for arbitrary learning tasks \(\mathtt {model}(\cdot )\) under various validation metrics \(\mathtt {evaluator}(\cdot )\), there are at least two technical challenges needing to be addressed,

1.
Complexity  For any given and fixed \(\lambda \), the time complexity to solve the \(\ell _2\)penalized PCA (for dimension reduction to \(d^{\prime }\)) based on the Ridgeregression is \(O(d^{\prime }\cdot d^3)\), as it requests to solve the Ridge regression (to get the Ridge estimator) \(d^{\prime }\) rounds to obtain the corresponding loadings of the top\(d^{\prime }\) principal components and the complexity to calculate the Ridge estimator in one round is \(O(d^3)\).

2.
Size of \(\Lambda \)  The performance of the model selection relies on the evaluation of models over a wide range of \(\lambda \), while the overall complexity to solve the problem should be \(O(\Lambda \cdot d^{\prime }\cdot d^3)\). Thus, we need to obtain a wellsampled set of tuning parameters \(\Lambda \) that can balance the cost and the quality of model selection.
Model selection for \(\ell _2\)penalized PCA over approximated gradient flow
In this section, we present the design of AgFlow algorithm (Algorithm 1) for obtaining the whole path of the loadings corresponding to each principal component for \(\ell _2\)penalized PCA. Consider the \(j^{th}\) principal component. Let \({\mathbf {Y}}^j\) be the \(j^{th}\) principal subspace, which can be approximated by the QuasiPrincipal Subspace Estimation Algorithm (QuasiPS) through the call of \(\mathtt {QuasiPS({\mathbf {X}},j)}\) (Algorithm 2). The path of \(\ell _2\)penalized PCA should be the solution path of Ridge regression in Eq. (4) with varying \(\lambda \) from \(0\rightarrow \infty \).
With the implicit regularization of Stochastic Gradient Descent (SGD) for Ordinary Least Squares (OLS) (Ali et al. 2020), the solution path is equivalent to the optimization path of the following OLS estimator using SGD with zero initialization, such that
More specifically, with a constant step size \(\eta > 0\), an initialization \(\beta ^j_0 = \mathbf {0}_d\), and minibatch size m, every SGD iteration updates the estimation as follows,
for \(k=1,2,\ldots , K\), and thus the solutions on the (stochastic) gradient flow path for the \(j^{th}\) principal component can be obtained. According to (Ali et al. 2020), the relationship of the explicit regularization \(\lambda \) and the implicit regularization effects introduced by SGD is \(\lambda \propto \frac{1}{k \sqrt{\eta }}\). Thus, with the total number of iteration steps K large enough, the proposed algorithm could compete the path of penalized PCA for a full range of \(\lambda \), but with a much lower computation cost.
Since in the problem of model selection of \(\ell _2\)penalized PCA based on Ridge estimator, we need to select the optimal \({\hat{\beta }}(\lambda ^{*})\) corresponding to the the optimal \(\lambda ^{*}\). Here we deal with the same model selection problem but with an alternative algorithm which uses the AgFlow algorithm instead of using matrix inverse in Ridge estimator. Therefore, we need to select the optimal \({\hat{\beta }}^{*}\) corresponding to the the optimal \(k^{*}\) with \(k^{*} \propto \frac{1}{\lambda ^{*}\sqrt{\eta }}\). To obtain the optimal \(\lambda ^{*}\), kfold crossvalidation is usually applied on a searching grid of \(\lambda \)s in the model selection. As an analog of obtaining the optimal \(k^{*}\), the proposed AgFlow algorithm is firstly used to get the iterated projection vector \({\hat{\beta }}_{k}\) of the given training data, which corresponds to some \(\hat{\beta }(\lambda ^{*})\) with \(k \propto \frac{1}{\lambda \sqrt{\eta }}\), then to select the optimal \(\beta ^{*}\) based on the performance on the validation data.
Finally, Algorithm 1 outputs the best projection matrix \(\hat{\varvec{\beta }}_{k^*} \in {\mathbb {R}}^{d \times d^{\prime }}\), which maximizes the evaluator \(\mathtt {Eval}_{k} = \mathtt {evaluator} \left( {\mathbf {X}}_{\text {val}} \hat{\varvec{\beta }}_{k}, h(k) \right) \), for \(k=1, \ldots , K\). Where the index \(k \propto \frac{1}{\lambda \sqrt{\eta }}\), and each column of the projection matrix \(\hat{\varvec{\beta }}_{k}\) is a normalized projection vector \(\Vert \hat{\beta }^{j}_{k}\Vert _2=1\). Note that, as discussed in the preliminaries in Sect. 2, when the sample covariance matrix \(1/n{\mathbf {X}}^\top {\mathbf {X}}\) is nonsingular (when \(n\gg d\)), there is no need to place any penalty here, i.e., \(\lambda \rightarrow 0\), and \(k \rightarrow \infty \), as the normalization would remove the effect of the \(\ell _2\)regularization (Zou et al. 2006) considering KarushKuhnTucker conditions. However, when \(d\gg n\), the sample covariance matrix \(1/n{\mathbf {X}}^\top {\mathbf {X}}\) becomes singular, and the Ridgeliked estimator starts to shrink the covariance matrix as in Eq. (3), i.e., \(\lambda \ne 0\), and k is some finite integer but not \(\infty \), making the sample covariance matrix invertible and the results penalized in a covarianceregularization fashion (Witten et al. 2009). Even though the normalization would rescale the vectors to a \(\ell _2\)ball, the regularization effect still remains.
Nearoptimal initialization for quasiprincipal subspace
The goal of the QuasiPS algorithm is to approximate the principal subspace of PCA with given data matrix with extremely low cost, and AgFlow would finetune the rough quasiprincipal projection estimation (i.e. the loadings) and obtain the complete path of the \(\ell _2\)penalized PCA accordingly. While there are various lowcomplexity algorithms in this area , such as (Hardt and Price 2014; Shamir et al. 2015; De Sa et al. 2015; Balsubramani et al. 2013), we derive the QuasiPrincipal Subspace (QuasiPS) estimator (in Algorithm 2) using the stochastic algorithms proposed in (Shamir et al. 2015). More specifically, Algorithm 2 first pursues a rough estimation of the \(j^{th}\) principal component projection (denoted as \(\tilde{w}_L\) after L iterations) using the stochastic approximation (Shamir et al. 2015), then obtains the quasiprincipal subspace \({\mathbf {Y}}^j\) through projection \({\mathbf {Y}}^j={\mathbf {X}}\tilde{w}_L\).
Note that \(\tilde{w}_L\) is not a precise estimation of the loadings corresponding to its principal component (compared to our algorithm and (Oja and Karhunen 1985) etc.), however it can provide a close solution in an extremely low cost. In this way, we consider \({\mathbf {Y}}^j={\mathbf {X}}\tilde{w}_L\) as a reasonable estimate of the principal subspace. With a random unit initialization \(\tilde{w}_{0}\), \(\tilde{w}_{L}\) converges to the true principal projection \(\mathbf {v}^*\) in a fast rate under mild conditions, even when \(\tilde{w}_{0}^\top \mathbf {v}^* \ge \frac{1}{\sqrt{2}}\) (Shamir et al. 2015). Thus, our setting should be nontrivial.
Algorithm analysis
In this section, we analyze the proposed algorithm from perspectives of statistical performance and its computational complexity.
Statistical performance
AgFlow algorithm consists of two steps: QuasiPS initialization and solution path retrieval. As the goal of our research is fast model selection on the complete solution path for \(\ell _2\)penalized PCA over varying penalties, the performance analysis of the proposed algorithm can be decomposed into two parts.
Approximation of QuasiPS to the true principal subspace. In Algorithm 2, the \(\mathtt {QuasiPS}\) algorithm first obtains a QuasiPS projection \(\tilde{w}_L\) using L epochs of lowcomplexity stochastic approximation, then it projects the sample \({\mathbf {x}}\) to get the \(j^{th}\) principal subspace \({\mathbf {Y}}^{j}\) via \({\mathbf {x}}\tilde{w}_L\).
Lemma 2
Under some mild conditions as in Shamir et al. (2015) and given the true principal projection \(w^*\), with probability at least \(1{\mathrm {log}}_{2}(1/\varepsilon )\delta \), the the distance between \(w^*\) and \(\tilde{w}_L\) holds that
provided that \(L={\mathrm {log}}(1/\varepsilon )/{\mathrm {log}}(2/\delta )\).
It can be easily derived from (Theorem 1. in Shamir et al. (2015)). When \(\varepsilon \rightarrow 0\), \(\Vert \tilde{w}_Lw^*\Vert _2^2\rightarrow 0\) and the error bound becomes tight. Suppose samples in \({\mathbf {X}}\) are i.i.d. realizations from the random variable X and \({\mathbb {E}}XX^\top =\Sigma ^*\) denote the true covariance.
Theorem 1
Under some mild conditions as in Shamir et al. (2015), the distance between the QuasiPS and the true principal subspace holds that
where \(\lambda _{\mathrm {max}}(\cdot )\) refers to the largest eigenvalue of a matrix.
When considering the largest eigenvalue \(\lambda _{\mathrm {max}}(\Sigma ^*)\) as a constant, QuasiPS is believed to achieve exponential coverage rate for principal subspace approximation for every sample. Thus, the statistical performance of QuasiPS can be guaranteed.
Approximation of approximated stochastic gradient flow to the solution path of ridge. In (Ali et al. 2019, 2020), the authors have demonstrated that when the learning rate \(\eta \rightarrow 0\), the discretetime SGD and GD algorithms would diffuse to two continuoustime dynamics over (stochastic) gradient flows, i.e., \(\hat{\beta }_{\mathrm {sgf}}(t)\) and \(\hat{\beta }_{\mathrm {gf}}(t)\) over continuous time \(t>0\). According to Theorem 1 in Ali et al. (2019), the statistical risk between Ridge and continuoustime gradient flow is bounded by
where \({\mathrm {Risk}}(\beta _1,\beta _2)={\mathbb {E}}\Vert \beta _1\beta _2\Vert ^2_2\), \(\beta ^*\) refers to the true estimator, and \(\lambda =1/t\) for Ridge. While the stochastic gradient flow enjoys a faster convergence but with slightly larger statistical risk, such that
where m refers to the batch size and \({o}\left( \frac{n}{m}\right) \) is an error term caused by the stochastic gradient noises. Under mild conditions, with discretization (\(\mathbf {d} t=\sqrt{\eta }\)), we consider the \(k^{th}\) iteration of SGD for Ordinary Least Squares, denoted as \(\beta _k\), which tightly approximates to \(\hat{\beta }_{\mathrm {sgf}}(t)\) in a \({o}(\sqrt{\eta })\)approximation with \(t=k\sqrt{\eta }\). In this way, given the learning rate \(\eta \) and the total number of iterations K, the implicit Ridgelike AgFlow screens the \(\ell _2\)penalized PCA with varying \(\lambda \) in the range of
with bounded error in both statistics and approximation.
In this way, we could conclude that under mild conditions, QuasiPS can well approximate the true principal subspace (\(d^{\prime }\ll n\)) while AgFlow retrieves a tight approximation of the Ridge solution path..
Computational complexity
The proposed algorithm consists of two steps: the initialization of the quasiprincipal subspace and the path retrieval. To obtain a fine estimate of QuasiPS and hit the error in Eq. (14), one should run Shamir’s algorithm (Shamir et al. 2015) with
iterations, where \({\mathrm {rank}}(\cdot )\) refers to the matrix rank, \({\mathrm {eigengap}}(\cdot )\) refers to the gap between the first and second eigenvalues, and \(\varepsilon \) has been defined in Lemma 2 referring to the error of principal subspace estimation.
Furthermore, to get the loadings corresponding to the \(j^{th}\) principal subspace, AgFlow uses K iterations for OLS to obtain the estimate of K models for \(\ell _2\)penalized PCA, where each iteration only consumes \(O(m\cdot d^2)\) complexity with batch size m, which gets total \(O(K\cdot m \cdot d^2)\) for K models, and total \(O(d^{\prime } \cdot K\cdot m \cdot d^2)\) with the reduceddimension \(d^{\prime }\). Moreover, we also propose to run \(\texttt {AgFlow}\) with fullbatch size \(m=n\) using gradient descent per iteration, which only consumes \(O(d^2)\) per iteration with lazy evaluation of \({\mathbf {X}}^\top {\mathbf {X}}\) and \({\mathbf {X}}^\top {\mathbf {Y}}\), with total \(O(K\cdot d^2)\) for K models, which gets \(O(d^{\prime } \cdot K \cdot \cdot d^2)\) with the reduceddimension \(d^{\prime }\).
To further improve AgFlow without incorporating higherorder complexity, we carry out the experiments by running a minibatch AgFlow, and a fullbatch AgFlow (i.e., \(m=n\)) with lazy evaluation of \({\mathbf {X}}^\top {\mathbf {X}}\) and \({\mathbf {X}}^\top {\mathbf {Y}}\) in parallel for model selection.
Experiments
In this section, we show some experiments on realworld datasets with a significantly large number of features; that fits well in the natural High Dimension Low Sample Size (HDLSS) settings. Since cancer classification has remained a great challenge to researchers in microarray technology, we try to adopt our new algorithm on these gene expression datasets. In particular, except for three publicly available gene expression datasets (Zhu et al. 2007), the wellknown FACES dataset (Huang et al. 2008) in machine learning is also considered in our study. A brief overview of these four datasets is summarized in Table 1.
Experiment setups
Evaluation procedure of the AgFlow algorithm. There are two regimes to demonstrate the performance of the proposed model selection method; the first is to evaluate the accuracy of the AgFlow algorithm based on kfold crossvalidation, which we call it evaluationbased model selection; the second is to do prediction based on the given trainingvalidationtesting set which consists of three steps, i.e., model selection, model evaluation and prediction, which we call it predictionbased model selection. Usually, in the realworld applications, the predictionbased model selection is used, where the testing set is unseen in advance. The proceeding step would be to split the raw data into trainingvalidation set for further crossvalidation in evaluationbased model selection and trainingvalidationtesting set for predictionbased model selection. There are two main steps, first is to get the the projection matrix of the the training data using the AgFlow algorithm; second is to apply the projection matrix to the validation/testing set.
Here we take the predictionbased model selection as an example. To do model selection using the AgFlow algorithm, firstly we need to get the projection matrix flow of the given training set by running the AgFlow algorithm, e.g. \(\hat{\varvec{\beta }}_{k} \in {\mathbb {R}}^{d \times d^{\prime }}\) for \(k=1,\ldots , K\). Then the dimensionreduced trainingvalidationtesting data matrix flow can be obtained by matrix multiplication, e.g. \(\tilde{{\mathbf {X}}}_{\text {train}}(k) = {\mathbf {X}}_{\text {train}} \hat{\varvec{\beta }}_{k}\), where \(\hat{\varvec{\beta }}_{k} = [\hat{\beta }^{1}_{k}, \hat{\beta }^{2}_{k}, \ldots , \hat{\beta }^{d^{\prime }}_{k}] \in {\mathbb {R}}^{d \times d^{\prime }}\). Each column \(\hat{\beta }^{j}_{k}\) is the \(j^{th}\) projection vector, i.e., the \(j^{th}\) loadings corresponding to the \(j^{th}\) principal component, which approximates the \(\ell _2\)penalized PCA with the tuning parameter \(\lambda \), under the calibration \(\lambda \propto 1/(k \sqrt{\eta })\). Then the dimensionreduced training data matrix flow is fed into the target learner \(h(k) = \mathtt {model}({\mathbf {X}}_{\text {train}} \hat{\varvec{\beta }}_{k})\) for performance tuning which outputs models h(k) for \(k=1, \ldots , K\). Lastly, the dimensionreduced validation data matrix flow is used to choose the optimal model with best performance according to the evaluator \(\mathtt {evaluator}( {\mathbf {X}}_{\text {val}} \hat{\varvec{\beta }}_{k}, h(k))\) for \(k=1,\ldots , K\), which gives the optimal \(\hat{\varvec{\beta }}^{*}=\mathop {\arg \max }_{\hat{\varvec{\beta }}_{k}}\mathtt {evaluator}( {\mathbf {X}}_{\text {val}} \hat{\varvec{\beta }}_{k}, h(k))\) and the optimal \(k^{*}\). Note that each data flow matrix possesses some implicit regularization introduced by the AgFlow algorithm, which corresponds to an explicit penalty in Ridge. Under the calibration \(\lambda \propto 1/(k \sqrt{\eta })\), we have \(\hat{\varvec{\beta }}_{k}\approx \hat{\varvec{\beta }}(\lambda )\), \(\hat{\varvec{\beta }}^{*}\approx \hat{\varvec{\beta }}(\lambda ^{*})\), with \(\lambda ^{*} \propto 1/(k^{*} \sqrt{\eta })\), thus we can do model selection using results based on AgFlow algorithm.
Settings of the AgFlow algorithm.

Construction of trainingvalidationtesting set. For the above four datasets, we randomly split the raw data samples into trainingvalidationtesting set with a fixed split ratio of \(60\%20\%20\%\) within each class. Then the sample size for the trainingvalidationtesting set is (240, 80, 80), (37, 12, 13), (43, 14, 15), (35, 11, 14) for FACES, Colon Tumor, ALLAML Leukemia, Central Nervous System data, respectively. Thus dimension/sample size ratio d/n of the training set is 17.1, 54.1, 165.8, 203.7 accordingly.

Settings of default parameters. For the default parameters in AgFlow, the number of iterations is set \(K=5000\), the step size \(\eta = 0.5\times 10^{4}\), the batch size \(\min (100, n/2)\), and the reduced dimension \(d^{\prime }=30\).
For the values of explicit regularization of \(\lambda \) in Ridge, the \(\ell _2\)penalized PCA, we take 100 values in the logscale ranging from \(10^{4}\) to \(10^{4}\) as the searching grid. For the default parameters in \(\texttt {QuasiPS}({\mathbf {X}}, j)\), we take the same default values as those specified in the original paper Shamir et al. (2015), where the step size \(\eta _2 = \frac{1}{\bar{r}{n}}\), and \(\bar{r} = \frac{1}{n}\sum _{i=1}^{n} \Vert {\mathbf {x}}_i\Vert _2^2\), the epoch length \(M=n\), and the number of iterations \(L=100\).
Baseline PCA Algorithms. To demonstrate the performance of the AgFlow algorithm, we compare the results with some other comparable methods, such as Oja’s method (Oja and Karhunen 1985), Power iteration (Golub and Loan 2013), Shamir’s Variance Reduction method (Shamir et al. 2015), vanilla PCA (Jolliffe 1986), and Ridgebased PCA (Zou et al. 2006) (two variants: the closedform ridge estimator in Eq. (3), Ridge_C, and that based on scikitlearn solvers, Ridge_S).
Overall comparisons of model selection
In this section, we evaluate the performance of the proposed AgFlow algorithm and compare it with other baseline algorithms (especially in the performance comparisons with Ridgebased estimator) using FACES data, and three gene expression data of Colon Tumor, ALLAML Leukemia, and Central Nervous System, respectively. In all these experiments, the training datasets have a limited number of samples and a significantly large number of features in the dimension reduction problem. For example, d/n ranges from 10 to 120, which is significantly larger than one in the four datasets. The common learning problem becomes illposed and models are all overfit to the small training datasets. Model selection with the validation set becomes a crucial issue to improve the performance.
Figure 2 presents the overall performance comparisons on the dimension reduction problem between AgFlow and other baseline algorithms using FACES dataset, where the classification accuracy with dimensionreduced data is used as the metric. As only AgFlow and Ridge are capable of estimating penalized PCA models for model selection, in Fig. 2, we select the best models of both AgFlow and Ridge in terms of validation accuracy. For a fair comparison, we compare AgFlow with Ridge for model selection in a similar range of penalties (\(\lambda \)) using a similar budget of computation time, while we make sure that the time spent by AgFlow algorithm is much shorter than Ridge (Please refer Table 2 for the time consumption comparisons between AgFlow and Ridge.).
Under such critical HDLSS settings, usually all algorithms work poorly while AgFlow outperforms all these algorithms in most cases. Furthermore, Shamir’s (Shamir et al. 2015) method, Oja’s method (Oja and Karhunen 1985), Power iteration method and the vanilla PCA based on SVD, all achieve the similar performance in these settings, it seems these algorithms beat the best performance achievable for the unbiased PCA estimator without any regularization under illposed and HDLSS settings. The comparison between AgFlow and unbiased PCA estimators demonstrates the performance improvement contributed by the implicit regularization effects (Ali et al. 2020) and the potentials of model selection with validation accuracy. Furthermore, the comparison between AgFlow and Ridge indicates that the implicit regularization effect of SGD provides the model estimator with higher stability than Ridge in estimating penalized PCA under HDLSS settings, as the matrix inverse used in Ridge is unstable when the model is illposed (Eldad et al. 2008). Furthermore, the continuous trace of SGD provides model selector with more flexibility than Ridge in screening massive models under varying penalties with finegrained granularity.
Figure 3 gives the performance comparison of validation and testing accuracy of different dimension reduction methods on different datasets, including AgFlow and other baseline algorithms such as vanilla PCA based on SVD (Jolliffe 1986), Oja’s Stochastic PCA method (Oja and Karhunen 1985), Power Iteration method (Golub and Loan 2013), Shamir’s Variance Reduction method (Shamir et al. 2015), and Ridge: Ridgebased PCA (Zou et al. 2006). Figure 3 shows that for the gene expression dataset of the Colon Tumor and Central Nervous System, the AgFlow algorithm outperforms other baseline algorithms with an overwhelming improvement with respect to the validation accuracy as well as the testing accuracy. For the FACES dataset, not much advantage of AgFlow is gained because all the algorithms achieve an accuracy above \(90\%\), thus the improvement is less than \(5\%\). For the ALLAML dataset, the performance of all the algorithms varies a lot, our AgFlow is still the best one with respect to the validation accuracy, however, it is not the case when applied to the testing accuracy. The reason may be that, with one shot of trainingvalidationtesting splitting, there is some variability in the data splitting and as the sample size is not that large that makes this uncertainty worse, which also explains that the testing accuracy is somewhat larger than the validation accuracy for some algorithms.
In this way, based on the comparisons of different dimension reduction methods using the same data with a given classifier function as in Fig. 2 and the the comparisons using different datasets Fig. 3, we can conclude that AgFlow is more effective than Ridge for estimating massive models and selecting the best models for penalized PCA, with the same or even stricter budget conditions. We also present the comparison results based on different datasets in Fig. 3 using various classifiers. Similar results are obtained: Ridge works well as more samples provided and AgFlow outperforms Ridge estimator in most cases.
Comparisons of time consumption and performance tuning
Table 2 illustrates the time consumption of the AgFlow algorithm and Ridgebased algorithms over varying penalties on the four datasets. We can see from the table that the time used in the AgFlow algorithm is only a small portion of that of the Ridge_S and Ridge_C which are two versions of Ridgebased algorithms. When the sample size and the number of predictors are both small, as in the Colon Tumor dataset with \((n,d) =(62, 2000)\), the time consumption is acceptable for both AgFlow and Ridgebased algorithms. However, when the number of the dimension becomes extremely large as in the ALLAML dataset with \((n,d) =(72, 7129)\) or the Central Nervous System data with \((n,d) =(60, 7129)\), the time consumption of Ridge_S and Ridge_C becomes dramatically large. For example, when \(d^{\prime }=30\) for the ALLAML dataset, Ridge_S requires more than 12.39 hours, which is unacceptable in practice application, whereas the the AgFlow algorithm requires 14 minutes, which has dramatically reduced the computation time.
More specifically, when considering the time consumption of AgFlow and RidgePath for the above performance tuning procedure, we can find AgFlow is much more efficient. Table 2 shows that AgFlow only consumes 246 seconds to obtain the estimates of 10,000 penalized PCA models when \(d^{\prime }=30\) for the Colon Tumor data with \(d=2000\) genes and 851 seconds for the ALLAML Leukemia data with \(d=7129\) genes, while RidgePath needs 1226 seconds/1276 seconds to obtain only 100 penalized PCA models for the Colon Tumor data and 44, 606 seconds/43, 268 seconds for the ALLAML Leukemia data, whether using closedform Ridge estimators or solverbased ones.
Figure 4 illustrates the examples of performance tuning using RidgePath and AgFlow over varying penalties with Random Forest classifiers. While AgFlow estimates the \(\ell _2\)penalized PCA with varying penalty by stopping the SGD optimizer with different number of iterations, RidgePath needs to shrink the sample covariance matrix with varying \(\lambda \) and estimate \(\ell _2\)penalized PCA through the timeconsuming matrix inverse. It is obvious that both AgFlow and RidgePath have certain capacity to screen models with different penalties.
In conclusion, AgFlow demonstrates both efficiency and effectiveness in model selection for penalized PCA, in comparisons with a wide range of classic and newlyfashioned algorithms (Zou et al. 2006; Shamir et al. 2015; Oja and Karhunen 1985; Golub and Loan 2013). Note that, the classification accuracy of some tasks here might not be as good as those reported in Zhu et al. (2007). While our goal is to compare the performance of \(\ell _2\)penalized PCA model selection with classification accuracy as the selection objective, the work (Zhu et al. 2007) focus on selecting a discriminative set of features for classification.
Conclusions
Since PCA has been widely used for data processing, feature extraction and dimension reduction in unsupervised data analysis, we have proposed AgFlow algorithm to do fast model selection with a much lower complexity in \(\ell _2\)penalized PCA where the regularization is usually incorporated to deal with the multicolinearity and singularity issues encountered under HDLSS settings. Experiments show that our AgFlow algorithm beats the existing methods with an overwhelming improvement with respect to the accuracy and computational complexity, especially, when compared with the ridgebased estimator which is implemented as a timeconsuming model estimation and selection procedure among a wide range of penalties with matrix inverse. Meanwhile, the proposed AgFlow algorithm naturally retrieves the complete solution path of each principal component, which shows an implicit regularization and can help us do the model estimation and selection simultaneously. Thus we can identify the best model from an endtoend optimization procedure using low computational complexity. In addition, except for the advantage of the accuracy and computational complexity, the AgFlow enlarges the capacities of performance tuning in a more intuitive and easily way. The observations backup our claims.
Future work
Though the AgFlow algorithm naturally retrieves the complete solution path of each principal component and can do model selection under the implicit \(\ell _2\)norm regularization effect, the linear combination of all the original variables is often not friendly to interpret the results. New methods with implicit or explicit \(\ell _1\)norm regularization (lasso penalty) are in great demand, where \(\ell _1\)norm regularization produces sparse solutions and we can do variable estimation and selection simultaneously.
In addition to the Approximated Gradient Flow, we are also interested in the implicit regularization introduced by other (stochastic) optimizers, such as Adam and/or Nesterov’s momentum methods, with potential new applications to Markov Chain Monte Carlo or other statistical computations. Furthermore, the implicit regularization of the AgFlow running nonlinear models for statistical inference would be interesting too.
References
Ali, A., Dobriban, E., & Tibshirani, R. J. (2020). The implicit regularization of stochastic gradient flow for least squares. In International conference on machine learning (pp. 233–244). PMLR.
Ali, A., Kolter, J. Z., & Tibshirani, R. J. (2019). A continuoustime view of early stopping for least squares regression. In The 22nd international conference on artificial intelligence and statistics (pp 1370–1378).
Arora, R., Cotter, A., Livescu, K., & Srebro, N. (2012). Stochastic optimization for PCA and PLS. In 2012 50th annual allerton conference on communication, control, and computing (allerton) (pp. 861–868). IEEE.
Balsubramani, A., Dasgupta, S., & Freund, Y. (2013). The fast convergence of incremental PCA. In F. Bach, & D. Blei (Ed.), Advances in neural information processing systems (pp. 3174–3182).
Candes, E., Tao, T., et al. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351.
De Sa, C., Re, C., & Olukotun, K. (2015). Global convergence of stochastic gradient descent for some nonconvex matrix problems. In International conference on machine learning (pp. 2332–2341).
Dutta, A., Hanzely, F., & Richtárik, P. (2019). A nonconvex projection method for robust PCA. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 1468–1476).
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Friedman, J., & Popescu, B. E. (2003). Gradient directed regularization for linear regression and classification. Technical report, Technical Report, Statistics Department, Stanford University.
Friedman, J., & Popescu, B. E. (2004). Gradient directed regularization. Unpublished manuscript. http://wwwstat.stanford.edu/hf/ftp/pathlite.pdf. Accessed 24 June 2021.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1.
Gaynanova, I., Booth, J. G., & Wells, M. T. (2017). Penalized versus constrained generalized eigenvalue problems. Journal of Computational and Graphical Statistics, 26(2), 379–387.
Golub, G., & Loan, C. (2013). Matrix computations (4th ed.). Baltimore: Johns Hopkins University Press.
Haber, E., Horesh, L., & Tenorio, L. (2008). Numerical methods for experimental design of largescale linear illposed inverse problems. Inverse Problems, 24(5), 055012.
Hardt, M., & Price, E. (2014). The noisy power method: a meta algorithm with applications. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 2861–2869). MIT Press.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Berlin: Springer.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.
Hoerl, A. E., Kannard, R. W., & Baldwin, K. F. (1975). Ridge regression: Some simulations. Communications in StatisticsTheory and Methods, 4(2), 105–123.
Huang, G. B., Mattar, M., Berg, T., & LearnedMiller, E. (2008). Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’RealLife’Images: detection, alignment, and recognition.
Jolliffe, I. T. (1986). Principal components in regression analysis. In I. T. Jolliffe (Ed.), Principal component analysis (pp. 129–155). Springer.
LeCun, Y., Jackel, L. D., Bottou, L., Brunot, A., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Muller, U. A., Sackinger, E., & Simard, P. (1995). Comparison of learning algorithms for handwritten digit recognition. In International conference on artificial neural networks (Vol. 60, pp. 53–60). Australia: Perth.
Lee, Y. K., Lee, E. R., & Park, B. U. (2012). Principal component analysis in very highdimensional spaces. Statistica Sinica, 22(3), 933–956.
Mitliagkas, I., Caramanis, C., & Jain, P. (2013). Memory limited, streaming PCA. In C. J. Burges, L. Bottou, M. Welling, Z. Ghahramani & K.Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 2886–2894).
Mohammed, A. A., Minhas, R., Jonathan Wu, Q. M., & SidAhmed, M. A. (2011). Human face recognition based on multidimensional PCA and extreme learning machine. Pattern Recognition, 44(10–11), 2588–2597.
Oja, E., & Karhunen, J. (1985). On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106(1), 69–84.
Shamir, O. (2015). A stochastic PCA and SVD algorithm with an exponential convergence rate. In International conference on machine learning (pp. 144–152). PMLR.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
Witten, D. M., & Tibshirani, R. (2009). Covarianceregularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3), 615–636.
Witten, D. M., Tibshirani, R., & Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515–534.
Yeung, K. Y., & Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics, 17(9), 763–774.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
Zhu, Z., Ong, Y.S., & Dash, M. (2007). Markov blanketembedded genetic algorithm for gene selection. Pattern Recognition, 40(11), 3236–3248.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265–286.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann.
Rights and permissions
About this article
Cite this article
Jiang, H., Xiong, H., Wu, D. et al. AgFlow: fast model selection of penalized PCA via implicit regularization effects of gradient flow. Mach Learn 110, 2131–2150 (2021). https://doi.org/10.1007/s10994021060253
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994021060253
Keywords
 Model selection
 Gradient flow
 Implicit regularization
 Penalized PCA
 Ridge