Abstract
We study two learning algorithms generated by kernel partial least squares (KPLS) and kernel minimal residual (KMR) methods. In these algorithms, regularization against overfitting is obtained by early stopping, which makes stopping rules crucial to their learning capabilities. We propose a stopping rule for determining the number of iterations based on cross-validation, without assuming a priori knowledge of the underlying probability measure, and show that optimal learning rates can be achieved. Our novel analysis consists of a nice bound for the number of iterations in a priori knowledge-based stopping rule for KMR and a stepping stone from KMR to KPLS. Technical tools include a recently developed integral operator approach based on a second order decomposition of inverse operators and an orthogonal polynomial argument.
Similar content being viewed by others
References
Bauer, F., Pereverzev, S., Rosasco, L.: On regularization algorithms in learning theory. J. Complex. 23, 52–72 (2007)
Blanchard, G., Kr\(\ddot{\text{a}}\)mer, N.: Kernel partial least square is universally consistent. In: Proceeding of the 13th International Conference on Artificial Intelligence and Statistics, JMLR Workshop & Conference Proceedings, vol. 9, pp. 57–64 (2010)
Blanchard, G., Kr\(\ddot{\text{ a }}\)mer, N.: Optimal learning rates for kernel conjugate gradient regression. In: NIPs, pp. 226–234 (2010)
Caponnetto, A., DeVito, E.: Optimal rates for the regularized least squares algorithm. Found. Comput. Math. 7, 331–368 (2007)
Caponnetto, A., Yao, Y.: Cross-validation based adaptation for regularization operators in learning theory. Anal. Appl. 8, 161–183 (2010)
Chun, H., Keles, S.: Sparse partial least squares for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. Ser. B 72, 3–25 (2010)
De Vito, E., Pereverzyev, S., Rosasco, L.: Adaptive kernel methods using the balancing principle. Found. Comput. Math. 10, 455–479 (2010)
Engle, H., Hanke, M., Neubauer, A.: Regularization of Inverse Problems. Kluwer Academic, Amsterdam (2000)
Evgeniou, T., Pontil, M., Poggio, T.: Regularization networks and support vector machines. Adv. Comput. Math. 13, 1–50 (2000)
Guo, X., Zhou, D.X.: An empirical feature-based learning algorithm producing sparse approximations. Appl. Comput. Harmon. Anal. 32, 389–400 (2012)
Guo, Z.C., Lin, S.B., Zhou, D.X.: Optimal learning rates for spectral algorithm. In: Inverse Problems, Minor Revision Under Review (2016)
Hanke, M.: Conjugate gradient type methods for Ill-posed problems. Pitman Research Notes in Mathematics Series, vol. 327 (1995)
Hu, T., Fan, J., Wu, Q., Zhou, D.X.: Regularization schemes for minimum error entropy principle. Anal. Appl. 13, 437–455 (2015)
Li, S., Liao, C., Kwok, J.: Gene feature extraction using T-test statistics and kernel partial least squares. In: Neural Information Processing, pp. 11–20. Springer, Berlin (2006)
Lin, S.B., Guo, X., Zhou, D.X.: Distributed learning with regularization schemes. J. Mach. Learn. Res. (Revision under review) (2016). arXiv:1608.03339
Lo Gerfo, L., Rosasco, L., Odone, F., De Vito, E., Verri, A.: Spectral algorithms for supervised learning. Neural Comput. 20, 1873–1897 (2008)
Raskutti, G., Wainwright, M., Yu, B.: Early stopping and non-parametric regression: an optimal data-dependent stopping rule. J. Mach. Learn. Res. 15, 335–366 (2014)
Rosipal, R., Trejo, L.: Kernel partial least squares regression in reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 2, 97–123 (2001)
Smale, S., Zhou, D.X.: Learning theory estimates via integral operators and their approximations. Constr. Approx. 26, 153–172 (2007)
Wold, H.: Path models with latent variables: the NIPALS approach. In: Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building, pp. 307–357. Academic Press, New York (1975)
Wu, Q., Ying, Y.M., Zhou, D.X.: Learning rates of least square regularized regression. Found. Comput. Math. 6, 171–192 (2006)
Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26, 289–315 (2007)
Acknowledgements
The work described in this paper is partially supported by the NSFC/RGC Joint Research Scheme [RGC Project No. N_CityU120/14 and NSFC Project No. 11461161006] and by the National Natural Science Foundation of China [Grant Nos. 61502342 and 11471292]. The paper was written when the second author visited Shanghai Jiao Tong University jointly sponsored by Ministry of Education of China for which the hospitality is acknowledged.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Massimo Fornasier.
Appendix
Appendix
This appendix provides five technical lemmas and proofs of two propositions concerning the a priori knowledge based learning algorithms. The first lemma concerning the range of the operator \(L_{K,D}\) is well known in the literature [1, 3, 10, 16].
Lemma 1
Denote \({\mathscr {H}}_{K, \mathbf{x}} = \hbox {span}\{K(\cdot , x_i)\}_{i=1}^N\) for \(D \in {\mathscr {Z}}^N\). Then \(f_{K,D} \in {\mathscr {H}}_{K, \mathbf{x}}\). The space \({\mathscr {H}}_{K, \mathbf{x}}\) equals the range of \(L_{K,D}\) and is spanned by all eigenfunctions of \(L_{K,D}\) with positive eigenvalues. Its dimension equals the rank \(d_\mathbf{x}\) of the Gramian matrix \({\mathbb K}\).
The following two lemmas found in [12] describe some properties of \(p_m^{[u]}\) and \(q_{m-1}^{[u]}\).
Lemma 2
Let \(m\in \mathbb {N}_0\). The following identities hold
where \((p^{[1]}_{m+1})'(0)\ne (p^{[0]}_{m+1})'(0)\).
The above three identities are stated in Corollary 2.6, Corollary 2.9, and Proposition 2.8 of [12].
Lemma 3
Let \(u\in \{0,1,2\}, m\in \mathbb {N},\) and \(\{t^{[u]}_{k,m}\}_{k=1}^m\) be the simple zeros of \(p^{[u]}_m\) in the increasing order. Then the following statements hold
The first two statements above are stated in Corollary 2.7 of [12], while the last two follow from the first statement and the representation of \(p_m^{[u]}\) in terms of its constant term 1 and zeros as
The fourth lemma focuses on bounding \(\mathscr {Q}_{D,\lambda }, \mathscr {P}_{D,\lambda }\) and \(\mathscr {R}_D\).
Lemma 4
Let D be a sample drawn independently according to \(\rho \) and \(0< \delta <1\). Then each of the following estimates holds with confidence at least \(1-\delta \),
The proofs of (45), (46) and (44) can be found in [4, 22] and [11], respectively.
The last lemma refers to a concentration inequality stated in [5, Proposition 11].
Lemma 5
Let \(\{\xi _i\}_{i=1}^n\) be a sequence of real valued independent random variables with mean \(\mu \), satisfying \(|\xi _i|\le B\) and \( E[(\xi _i-\mu )^2]\le \tau ^2\) for \(i\in \{1,2,\dots ,n\}\). Then for any \(a>0\) and \(\varepsilon >0\), there hold
and
With the help of the above lemmas, we can now prove Propositions 1 and 2.
Proof of Proposition 1
We start with proving the first statement (24). Since \(p^{[1]}_{0}(t)=1\) and \((p^{[1]}_{0})'(t)=0\) for all \(t\in [0,\kappa ^2]\), (24) holds obviously for \(\hat{m}=1\). It then suffices to prove (24) for \(\hat{m}\ge 2\). It was presented in [12, p. 41] (see also [3, p. 16]) that
Here \(F_{t^{[1]}_{1,\hat{m}-1}} \phi ^{[1]}_{\hat{m}-1}(L_{K,D})\) is the linear operator on \({\mathscr {H}}_K\) defined in terms of the orthonormal basis \(\{\phi _j^\mathbf{x}\}_j\) and the orthogonal projection \(F_{t^{[1]}_{1,\hat{m}-1}}\) by spectral calculus as
where \(\phi ^{[1]}_{\hat{m}-1}(t)\) is the function defined on \([0, t^{[1]}_{1, \hat{m}-1})\) by
Then we decompose \(f_{K,D}\) as \(f_{K,D} - L_{K,D }f_\rho + L_{K,D }f_\rho \) and bound the norm as
We continue our estimates by bounding the first term I. Applying (26) with \(\alpha =1/2\) gives
Furthermore, the representation (43) for \(p^{[1]}_{\hat{m}-1}\) and the definition of \(\phi ^{[1]}_{\hat{m}-1}\) yield
It was shown in [12, Equation (3.10)] that for an arbitrary \(\nu >0\),
Combining the above three bounds yields an estimate for the first term of (47) as
We now turn to the second term II of (47). By the regularity condition (7) for \(f_\rho = L_K^r h_\rho \) and the identity \(\Vert L_K^{1/2} h_\rho \Vert _K = \Vert h_\rho \Vert _\rho \), we find
where for simplicity we denote the norm as
When \(1/2\le r\le 3/2\), we express \((L_K + \lambda I)^{r-1/2}\) as \((L_{K, D} + \lambda I)^{r-1/2} (L_{K, D} + \lambda I)^{1/2 -r}(L_K + \lambda I)^{r-1/2}\) and apply (26) with \(\alpha =r-1/2\) to get
When \(r>3/2\), we decompose the operator \((L_K + \lambda I)^{r-1/2}\) in \(\widetilde{II}\) as
The bounds \(\Vert L_{K,D}\Vert \le \kappa ^2\), \(\Vert L_K\Vert \le \kappa ^2\), and the Lipschitz property of the function \(x\mapsto x^{r-1/2}\) imply
Hence
Combining this with (51) and the following norm estimate with \(\gamma , \beta \ge 0\),
derived from spectral calculus and the inequality (48), we have with the notation \({\mathscr {I}} = \lambda |(p^{[1]}_{{\hat{m}-1}})' (0)|\),
This together with (50), the bound (49) for I, (47), and the definition (16) of the quantity \(\Lambda _{\rho , \lambda , r}\) tells us that
On the other hand, \(\hat{m}\ge 2\) is the smallest nonnegative integer satisfying (16), so for the smaller integer \(\hat{m}-1\), we must have
This together with (53) implies
and thereby
One of the above terms in the summation is at least \(\frac{2}{3}\). It follows that \({\mathscr {I}} \le \frac{9}{4}<3\). It follows that \(|(p_{\hat{m}-1}^{[1]})'(0)| = \frac{{\mathscr {I}}}{\lambda } <\frac{3}{\lambda }\). This proves the first statement (24).
To prove the second statement, we first claim that for \(v\in \{1, 2\}\),
This claim is obviously true for \(\hat{m} =1\) with equality valid since in this case \(p_{\hat{m}-1}^{[v]} \equiv 1\) and \(p_{\hat{m}-1}^{[v]}(L_{K,D})\) is the identity operator.
Consider the case \(\hat{m}\ge 2\). Since \(\varepsilon =\lambda /3\), we have from (24), (41) and (40) that
It follows from (43) that for \(v\in \{1, 2\}\), there holds
Recall the eigenpairs \(\{(\sigma _i^\mathbf{x},\phi _i^\mathbf{x})\}_{i}\) of \(L_{K,D}\). Expressing \(f_{K,D}=\sum _jc_j\phi _j^\mathbf{x}\) implies
So the claim (54) is also true in the case \(\hat{m}\ge 2\). This proves the claim.
To prove the second statement of the proposition, we estimate the norm \(\Vert F_\varepsilon [f_{K,D}]\Vert _K\). Under the condition (7),
where the operators \(F_\varepsilon (L_{K,D}+\lambda I)^{1/2} \) and \(F_\varepsilon L_{K,D}L_K^{r-1/2}\) are defined by spectral calculus.
When \(1/2\le r\le 3/2\), we have
When \(r>3/2\), it follows from (52) that
Combining the above bounds for \(\Vert F_\varepsilon L_{K,D}L_K^{r-1/2}\Vert \) with (56) and noticing the choice \(\varepsilon = \lambda /3\) and the definition (17) of the quantity \(\Lambda _{\rho , \lambda , r}\), we find
But \(\hat{m}\) is the smallest nonnegative integer satisfying (16), the integer \(\hat{m}-1\) does not satisfy (16). Hence (22) implies
Then the desired statement of the proposition is verified. The proof of Proposition 1 is completed. \(\square \)
Proof of Proposition 2
We first prove (30). Since \(|(p^{[1]}_{0})'(0)|=0\), (30) obviously holds for \(\hat{m}=0\). We then consider the case \(\hat{m}\ge 1\). By (41),
From (36), we have
Therefore,
Then, it follows from (55), (22) and (5) that
But Proposition 1 with \(v=2\) gives
Hence
Therefore,
which together with (57), (58) and Proposition 1 yields
This proves the first statement (30) of Proposition 2.
To prove the second statement, we denote \(\varepsilon _0=\lambda /15\) and
We can decompose \(\Vert f^{[1]}_{D,\hat{m}}-f_\rho \Vert _\rho \) as
Note that \(A_1 =0\) and \(f^{[1]*}_{D,\hat{m}}-f_\rho = -f_\rho =- p^{[1]}_{\hat{m}}(L_{K,D}) f_\rho \) when \(\hat{m}=0\). If \(\hat{m} \ge 1\), we use (20), (59), (26), the definitions of \(\mathscr {P}_{D,\lambda }\) and \(\mathscr {Q}_{D,\lambda }\) to bound \(A_1\) as
By (43), for \(0\le t < {\varepsilon _0} \le t^{[1]}_{1,\hat{m}}\), we have
Furthermore, (43), (41), (61) and (30) imply
Therefore, the first term in (60) can be bounded as
We then bound the second term \(A_2\) in two cases involving r.
When \(1/2\le r\le 3/2\), we have \(r-1/2\le 1\), and the bound \(\sup _{0\le t < {\varepsilon _0}\le t^{[1]}_{1,\hat{m}}} |p^{[1]}_{\hat{m}}(t)| \le 1\) together with (26) and the regularization condition (7) yields
When \(r>3/2\), we use (21), (59) and the regularization condition (7) to get
Since \(|p^{[1]}_{\hat{m}}(t)|\le 1\) for all \(0\le t < {\varepsilon _0}\le t^{[1]}_{1,\hat{m}}\), we get
Combining these with (52) and the definition of \(\mathscr {R}_{D}\) yields
Finally, we turn to bound \(A_3\). From Lemma 1 and \(f_\rho \in \mathscr {H}_K\), we obtain that \(f_\rho \) is in the range of \(L_{K,D}\). Since \((a+b)^{1/2}\le a^{1/2}+b^{1/2}\) for \(a,b>0\), we then have
But \(\hat{m}\) satisfies (16). It follows that
Inserting (62), (63), (64) and (65) into (60) and noticing \(\varepsilon _0=\lambda /15\), we obtain
This verifies the second statement of the proposition. The proof of Proposition 2 is complete. \(\square \)
Rights and permissions
About this article
Cite this article
Lin, SB., Zhou, DX. Optimal Learning Rates for Kernel Partial Least Squares. J Fourier Anal Appl 24, 908–933 (2018). https://doi.org/10.1007/s00041-017-9544-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00041-017-9544-8