Optimal Learning Rates for Kernel Partial Least Squares

Lin, Shao-Bo; Zhou, Ding-Xuan

doi:10.1007/s00041-017-9544-8

Optimal Learning Rates for Kernel Partial Least Squares

Published: 07 April 2017

Volume 24, pages 908–933, (2018)
Cite this article

Journal of Fourier Analysis and Applications Aims and scope Submit manuscript

Shao-Bo Lin¹ &
Ding-Xuan Zhou²

445 Accesses
7 Citations
Explore all metrics

Abstract

We study two learning algorithms generated by kernel partial least squares (KPLS) and kernel minimal residual (KMR) methods. In these algorithms, regularization against overfitting is obtained by early stopping, which makes stopping rules crucial to their learning capabilities. We propose a stopping rule for determining the number of iterations based on cross-validation, without assuming a priori knowledge of the underlying probability measure, and show that optimal learning rates can be achieved. Our novel analysis consists of a nice bound for the number of iterations in a priori knowledge-based stopping rule for KMR and a stepping stone from KMR to KPLS. Technical tools include a recently developed integral operator approach based on a second order decomposition of inverse operators and an orthogonal polynomial argument.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Impact of Early Stopping in Multiple Kernel Learning

Estimates for the generalized cross-validation function via an extrapolation and statistical approach

Article 18 June 2018

Theory and algorithms for learning with rejection in binary classification

Article 13 December 2023

References

Bauer, F., Pereverzev, S., Rosasco, L.: On regularization algorithms in learning theory. J. Complex. 23, 52–72 (2007)
Article MathSciNet MATH Google Scholar
Blanchard, G., Kr$\ddot{\text{a}}$mer, N.: Kernel partial least square is universally consistent. In: Proceeding of the 13th International Conference on Artificial Intelligence and Statistics, JMLR Workshop & Conference Proceedings, vol. 9, pp. 57–64 (2010)
Blanchard, G., Kr$\ddot{\text{ a }}$mer, N.: Optimal learning rates for kernel conjugate gradient regression. In: NIPs, pp. 226–234 (2010)
Caponnetto, A., DeVito, E.: Optimal rates for the regularized least squares algorithm. Found. Comput. Math. 7, 331–368 (2007)
Article MathSciNet MATH Google Scholar
Caponnetto, A., Yao, Y.: Cross-validation based adaptation for regularization operators in learning theory. Anal. Appl. 8, 161–183 (2010)
Article MathSciNet MATH Google Scholar
Chun, H., Keles, S.: Sparse partial least squares for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. Ser. B 72, 3–25 (2010)
Article MathSciNet Google Scholar
De Vito, E., Pereverzyev, S., Rosasco, L.: Adaptive kernel methods using the balancing principle. Found. Comput. Math. 10, 455–479 (2010)
Article MathSciNet MATH Google Scholar
Engle, H., Hanke, M., Neubauer, A.: Regularization of Inverse Problems. Kluwer Academic, Amsterdam (2000)
Google Scholar
Evgeniou, T., Pontil, M., Poggio, T.: Regularization networks and support vector machines. Adv. Comput. Math. 13, 1–50 (2000)
Article MathSciNet MATH Google Scholar
Guo, X., Zhou, D.X.: An empirical feature-based learning algorithm producing sparse approximations. Appl. Comput. Harmon. Anal. 32, 389–400 (2012)
Article MathSciNet MATH Google Scholar
Guo, Z.C., Lin, S.B., Zhou, D.X.: Optimal learning rates for spectral algorithm. In: Inverse Problems, Minor Revision Under Review (2016)
Hanke, M.: Conjugate gradient type methods for Ill-posed problems. Pitman Research Notes in Mathematics Series, vol. 327 (1995)
Hu, T., Fan, J., Wu, Q., Zhou, D.X.: Regularization schemes for minimum error entropy principle. Anal. Appl. 13, 437–455 (2015)
Article MathSciNet MATH Google Scholar
Li, S., Liao, C., Kwok, J.: Gene feature extraction using T-test statistics and kernel partial least squares. In: Neural Information Processing, pp. 11–20. Springer, Berlin (2006)
Lin, S.B., Guo, X., Zhou, D.X.: Distributed learning with regularization schemes. J. Mach. Learn. Res. (Revision under review) (2016). arXiv:1608.03339
Lo Gerfo, L., Rosasco, L., Odone, F., De Vito, E., Verri, A.: Spectral algorithms for supervised learning. Neural Comput. 20, 1873–1897 (2008)
Article MathSciNet MATH Google Scholar
Raskutti, G., Wainwright, M., Yu, B.: Early stopping and non-parametric regression: an optimal data-dependent stopping rule. J. Mach. Learn. Res. 15, 335–366 (2014)
MathSciNet MATH Google Scholar
Rosipal, R., Trejo, L.: Kernel partial least squares regression in reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 2, 97–123 (2001)
MATH Google Scholar
Smale, S., Zhou, D.X.: Learning theory estimates via integral operators and their approximations. Constr. Approx. 26, 153–172 (2007)
Article MathSciNet MATH Google Scholar
Wold, H.: Path models with latent variables: the NIPALS approach. In: Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building, pp. 307–357. Academic Press, New York (1975)
Wu, Q., Ying, Y.M., Zhou, D.X.: Learning rates of least square regularized regression. Found. Comput. Math. 6, 171–192 (2006)
Article MathSciNet MATH Google Scholar
Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26, 289–315 (2007)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The work described in this paper is partially supported by the NSFC/RGC Joint Research Scheme [RGC Project No. N_CityU120/14 and NSFC Project No. 11461161006] and by the National Natural Science Foundation of China [Grant Nos. 61502342 and 11471292]. The paper was written when the second author visited Shanghai Jiao Tong University jointly sponsored by Ministry of Education of China for which the hospitality is acknowledged.

Author information

Authors and Affiliations

College of Mathematics and Information Science, Wenzhou University, Wenzhou, China
Shao-Bo Lin
Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong, China
Ding-Xuan Zhou

Authors

Shao-Bo Lin
View author publications
You can also search for this author in PubMed Google Scholar
Ding-Xuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shao-Bo Lin.

Additional information

Communicated by Massimo Fornasier.

Appendix

This appendix provides five technical lemmas and proofs of two propositions concerning the a priori knowledge based learning algorithms. The first lemma concerning the range of the operator $L_{K,D}$ is well known in the literature [1, 3, 10, 16].

Lemma 1

Denote ${\mathscr {H}}_{K, \mathbf{x}} = \hbox {span}\{K(\cdot , x_i)\}_{i=1}^N$ for $D \in {\mathscr {Z}}^N$. Then $f_{K,D} \in {\mathscr {H}}_{K, \mathbf{x}}$. The space ${\mathscr {H}}_{K, \mathbf{x}}$ equals the range of $L_{K,D}$ and is spanned by all eigenfunctions of $L_{K,D}$ with positive eigenvalues. Its dimension equals the rank $d_\mathbf{x}$ of the Gramian matrix ${\mathbb K}$.

The following two lemmas found in [12] describe some properties of $p_m^{[u]}$ and $q_{m-1}^{[u]}$.

Lemma 2

Let $m\in \mathbb {N}_0$. The following identities hold

$$\begin{aligned} \big (p^{[1]}_m\big )'(0)-\big (p^{[1]}_{m+1}\big )'(0)= & {} \frac{\big [p^{[1]}_m, p^{[1]}_m\big ]_{[0]} -\big [p^{[1]}_{m+1}, p^{[1]}_{m+1}\big ]_{[0]}}{[p^{[2]}_m, p^{[2]}_m]_{[1]}}, \end{aligned}$$

(36)

$$\begin{aligned} \big (p^{[1]}_{m+1}\big )'(0)-\big (p^{[0]}_{m+1}\big )'(0)= & {} \frac{\big [p^{[1]}_{m+1}, p^{[1]}_{m+1}\big ]_{[0]}}{\big [p^{[2]}_{m},p^{[2]}_{m}\big ]_{[1]}}, \end{aligned}$$

(37)

$$\begin{aligned} p_m^{[2]}(t)= & {} \frac{p_{m+1}^{[1]}(t)-p_{m+1}^{[0]}(t)}{t\big [\big (p^{[1] }_{m+1}\big )'(0)-\big (p^{[0]}_{m+1}\big )'(0)\big ]},\qquad \forall \ 0<t\le \kappa ^2, \nonumber \\ \end{aligned}$$

(38)

where $(p^{[1]}_{m+1})'(0)\ne (p^{[0]}_{m+1})'(0)$.

The above three identities are stated in Corollary 2.6, Corollary 2.9, and Proposition 2.8 of [12].

Lemma 3

Let $u\in \{0,1,2\}, m\in \mathbb {N},$ and $\{t^{[u]}_{k,m}\}_{k=1}^m$ be the simple zeros of $p^{[u]}_m$ in the increasing order. Then the following statements hold

$$\begin{aligned} 0<t^{[u]}_{k,m}< & {} t^{[u]}_{k,m-1}<t^{[u]}_{k+1,m}, \qquad \text{ for }\ m\ge 2, \end{aligned}$$

(39)

$$\begin{aligned} t^{[0]}_{k,m}< & {} t^{[1]}_{k,m}<t^{[2]}_{k,m}, \end{aligned}$$

(40)

$$\begin{aligned} q_{m-1}^{[u]}(0)= & {} -(p^{[u]}_m)'(0)=\sum _{k=1}^m\big (t^{[u]}_{k,m}\big )^{-1}=\max _{0\le t\le t^{[u]}_{1,m}} q^{[u]}_{m-1}(t), \end{aligned}$$

(41)

$$\begin{aligned} q^{[u]}_{m-1}(0)\le & {} q^{[u]}_m(0)\le \big (t^{[u]}_{1,m+1}\big )^{-1}+q^{[u]}_{m-1}(0). \end{aligned}$$

(42)

The first two statements above are stated in Corollary 2.7 of [12], while the last two follow from the first statement and the representation of $p_m^{[u]}$ in terms of its constant term 1 and zeros as

$$\begin{aligned} p_m^{[u]}(t)= \prod _{k=1}^m\left( 1-t/t^{[u]}_{k,m}\right) , \qquad m\in \mathbb {N}. \end{aligned}$$

(43)

The fourth lemma focuses on bounding $\mathscr {Q}_{D,\lambda }, \mathscr {P}_{D,\lambda }$ and $\mathscr {R}_D$.

Lemma 4

Let D be a sample drawn independently according to $\rho $ and $0< \delta <1$. Then each of the following estimates holds with confidence at least $1-\delta $,

$$\begin{aligned} \mathscr {Q}_{D,\lambda }\le & {} \frac{2\sqrt{2}(\kappa ^2 +\kappa ) \mathscr {A}_{D,\lambda }\log \frac{2}{\delta }}{\sqrt{\lambda }} +\sqrt{2}, \end{aligned}$$

(44)

$$\begin{aligned} \mathscr {P}_{D,\lambda }\le & {} 2(\kappa ^2 +\kappa ) \mathscr {A}_{D,\lambda } \log \bigl (2/\delta \bigr ), \end{aligned}$$

(45)

$$\begin{aligned} \mathscr {R}_D\le & {} \frac{2\kappa ^2}{\sqrt{|D|}} \log \frac{2}{\delta }. \end{aligned}$$

(46)

The proofs of (45), (46) and (44) can be found in [4, 22] and [11], respectively.

The last lemma refers to a concentration inequality stated in [5, Proposition 11].

Lemma 5

Let $\{\xi _i\}_{i=1}^n$ be a sequence of real valued independent random variables with mean $\mu $, satisfying $|\xi _i|\le B$ and $ E[(\xi _i-\mu )^2]\le \tau ^2$ for $i\in \{1,2,\dots ,n\}$. Then for any $a>0$ and $\varepsilon >0$, there hold

$$\begin{aligned} \mathbf P\left[ \frac{1}{n}\sum _{i=1}^n\xi _i-\mu \ge a\tau ^2+\varepsilon \right] \le e^{-\frac{6na\varepsilon }{3+4aB}}, \end{aligned}$$

and

$$\begin{aligned} \mathbf P\left[ \mu -\frac{1}{n}\sum _{i=1}^n\xi _i\ge a\tau ^2+\varepsilon \right] \le e^{-\frac{6na\varepsilon }{3+4aB}}. \end{aligned}$$

With the help of the above lemmas, we can now prove Propositions 1 and 2.

Proof of Proposition 1

We start with proving the first statement (24). Since $p^{[1]}_{0}(t)=1$ and $(p^{[1]}_{0})'(t)=0$ for all $t\in [0,\kappa ^2]$, (24) holds obviously for $\hat{m}=1$. It then suffices to prove (24) for $\hat{m}\ge 2$. It was presented in [12, p. 41] (see also [3, p. 16]) that

$$\begin{aligned} \Vert L_{K,D}f^{[1]}_{D,\hat{m}-1}-f_{K,D}\Vert _K \le \Vert F_{t^{[1]}_{1,\hat{m}-1}} \phi ^{[1]}_{\hat{m}-1}(L_{K,D})f_{K,D}\Vert . \end{aligned}$$

Here $F_{t^{[1]}_{1,\hat{m}-1}} \phi ^{[1]}_{\hat{m}-1}(L_{K,D})$ is the linear operator on ${\mathscr {H}}_K$ defined in terms of the orthonormal basis $\{\phi _j^\mathbf{x}\}_j$ and the orthogonal projection $F_{t^{[1]}_{1,\hat{m}-1}}$ by spectral calculus as

$$\begin{aligned} F_{t^{[1]}_{1,\hat{m}-1}} \phi ^{[1]}_{\hat{m}-1}(L_{K,D}) \left( \sum _jb_j\phi _j^\mathbf{x}\right) = \sum _{\sigma _j^\mathbf{x}<t^{[1]}_{1,\hat{m}-1}} \phi ^{[1]}_{\hat{m}-1}(\sigma _j^\mathbf{x})b_j\phi _j^\mathbf{x}, \end{aligned}$$

where $\phi ^{[1]}_{\hat{m}-1}(t)$ is the function defined on $[0, t^{[1]}_{1, \hat{m}-1})$ by

$$\begin{aligned} \phi ^{[1]}_{\hat{m}-1}(t)=p^{[1]}_{\hat{m}-1}(t)\left( \frac{t^{[1]}_{1, \hat{m}-1}}{t^{[1]}_{1, \hat{m}-1}-t}\right) ^{1/2}, \qquad 0\le t < t^{[1]}_{1, \hat{m}-1}. \end{aligned}$$

Then we decompose $f_{K,D}$ as $f_{K,D} - L_{K,D }f_\rho + L_{K,D }f_\rho $ and bound the norm as

$$\begin{aligned}&\Vert L_{K,D}f^{[1]}_{D,\hat{m}-1}-f_{K,D}\Vert _K \le \Vert F_{t^{[1]}_{1,{\hat{m}-1}}} \phi ^{[1]}_{{\hat{m}-1}}(L_{K,D })(f_{K,D}-L_{K,D }f_\rho )\Vert _K \nonumber \\&\qquad + \Vert F_{t^{[1]}_{1,{\hat{m}-1}}}\phi ^{[1]}_{{\hat{m}-1}}(L_{K,D })L_{K,D }f_\rho \Vert _K =: I + II. \end{aligned}$$

(47)

We continue our estimates by bounding the first term I. Applying (26) with $\alpha =1/2$ gives

$$\begin{aligned} I= & {} \Vert F_{t^{[1]}_{1,{\hat{m}-1}}}\phi ^{[1]}_{{\hat{m}-1}}(L_{K,D})(L_{K,D}+\lambda I)^{1/2} (L_{K,D}+\lambda I)^{-1/2}(L_K+\lambda I)^{1/2}\\&(L_K+\lambda I)^{-1/2}(f_{K,D}-L_{K,D}f_\rho )\Vert _K\\\le & {} \Vert F_{t^{[1]}_{1,{\hat{m}-1}}}\phi ^{[1]}_{{\hat{m}-1}}(L_{K,D})(L_{K,D}+\lambda I)^{1/2}\Vert \mathscr {Q}_{D,\lambda }\mathscr {P}_{D,\lambda }\\\le & {} \left( \sup _{t\in [0,t^{[1]}_{1,{\hat{m}-1}})}t^{1/2}|\phi ^{[1]}_{{\hat{m}-1}}(t)| +\lambda ^{1/2}\sup _{t\in [0,t^{[1]}_{1,{\hat{m}-1}})}|\phi ^{[1]}_{{\hat{m}-1}}(t)|\right) \mathscr {Q}_{D,\lambda }\mathscr {P}_{D,\lambda }. \end{aligned}$$

Furthermore, the representation (43) for $p^{[1]}_{\hat{m}-1}$ and the definition of $\phi ^{[1]}_{\hat{m}-1}$ yield

$$\begin{aligned} |\phi ^{[1]}_{{\hat{m}-1}}(t)|=\left| (1-t/t^{[1]}_{1,{\hat{m}-1}})^{1/2}\prod _{k=2}^{{\hat{m}-1}} (1-t/t^{[1]}_{k,{\hat{m}-1}})\right| \le 1, \qquad \forall \ 0\le t < t^{[1]}_{1,{\hat{m}-1}}. \end{aligned}$$

It was shown in [12, Equation (3.10)] that for an arbitrary $\nu >0$,

$$\begin{aligned} \sup _{t\in [0,t^{[1]}_{1,{\hat{m}-1}})}t^\nu (\phi ^{[1]}_{{\hat{m}-1}}(t))^2\le \nu ^\nu |(p^{[1]}_{{\hat{m}-1}})' (0)|^{-\nu }. \end{aligned}$$

(48)

Combining the above three bounds yields an estimate for the first term of (47) as

$$\begin{aligned} I \le (|(p^{[1]}_{{\hat{m}-1}})'(0)|^{-1/2}+\lambda ^{1/2})\mathscr {Q}_{D,\lambda }\mathscr {P}_{D,\lambda }. \end{aligned}$$

(49)

We now turn to the second term II of (47). By the regularity condition (7) for $f_\rho = L_K^r h_\rho $ and the identity $\Vert L_K^{1/2} h_\rho \Vert _K = \Vert h_\rho \Vert _\rho $, we find

$$\begin{aligned} II \le \Vert F_{t^{[1]}_{1,{\hat{m}-1}}}\phi ^{[1]}_{{\hat{m}-1}}(L_{K,D })L_{K,D } L_K^{r-1/2}\Vert \Vert h_\rho \Vert _\rho \le \widetilde{II} \Vert h_\rho \Vert _\rho , \end{aligned}$$

(50)

where for simplicity we denote the norm as

$$\begin{aligned} \widetilde{II} := \Vert F_{t^{[1]}_{1,{\hat{m}-1}}}\phi ^{[1]}_{{\hat{m}-1}}(L_{K, D})L_{K, D} (L_K + \lambda I)^{r-1/2}\Vert . \end{aligned}$$

When $1/2\le r\le 3/2$, we express $(L_K + \lambda I)^{r-1/2}$ as $(L_{K, D} + \lambda I)^{r-1/2} (L_{K, D} + \lambda I)^{1/2 -r}(L_K + \lambda I)^{r-1/2}$ and apply (26) with $\alpha =r-1/2$ to get

$$\begin{aligned} \widetilde{II} \le \Vert F_{t^{[1]}_{1,{\hat{m}-1}}}\phi ^{[1]}_{{\hat{m}-1}}(L_{K, D})L_{K, D} (L_{K, D} + \lambda I)^{r-1/2}\Vert \mathscr {Q}_{D,\lambda }^{2r-1}. \end{aligned}$$

(51)

When $r>3/2$, we decompose the operator $(L_K + \lambda I)^{r-1/2}$ in $\widetilde{II}$ as

$$\begin{aligned} (L_{K,D }+\lambda I)^{r-1/2} + \left\{ (L_K + \lambda I)^{r-1/2} - (L_{K,D }+\lambda I)^{r-1/2}\right\} . \end{aligned}$$

The bounds $\Vert L_{K,D}\Vert \le \kappa ^2$, $\Vert L_K\Vert \le \kappa ^2$, and the Lipschitz property of the function $x\mapsto x^{r-1/2}$ imply

$$\begin{aligned} \Vert L_{K,D}^{r-1/2}-L_K^{r-1/2} \Vert \le (r-1/2)\kappa ^{2r-3}\Vert L_{K,D}-L_K\Vert . \end{aligned}$$

(52)

Hence

$$\begin{aligned} \widetilde{II}\le & {} \Vert F_{t^{[1]}_{1,{\hat{m}-1}}}\phi ^{[1]}_{{\hat{m}-1}}(L_{K, D})L_{K, D} (L_{K, D} + \lambda I)^{r-1/2}\Vert \\&+ \Vert F_{t^{[1]}_{1,{\hat{m}-1}}}\phi ^{[1]}_{\hat{m}-1}(L_{K,D }) L_{K,D }\Vert (r-1/2)\kappa ^{2r-3} \mathscr {R}_{D}. \end{aligned}$$

Combining this with (51) and the following norm estimate with $\gamma , \beta \ge 0$,

$$\begin{aligned}&\Vert F_{t^{[1]}_{1,{\hat{m}-1}}}\phi ^{[1]}_{{\hat{m}-1}}(L_{K, D})L_{K, D}^{\gamma } (L_{K, D} + \lambda I)^{\beta }\Vert = \sup _{t\in [0,t^{[1]}_{1,{\hat{m}-1}})} \left\{ t^\gamma (t + \lambda )^{\beta } \left| \phi ^{[1]}_{{\hat{m}-1}}(t)\right| \right\} \\&\qquad \le 2^{\beta } \max \left\{ (2\gamma + 2\beta )^{\gamma + \beta } |(p^{[1]}_{{\hat{m}-1}})' (0)|^{-(\gamma + \beta )}, \lambda ^\beta (2\gamma )^{\gamma } |(p^{[1]}_{{\hat{m}-1}})' (0)|^{-\gamma }\right\} \end{aligned}$$

derived from spectral calculus and the inequality (48), we have with the notation ${\mathscr {I}} = \lambda |(p^{[1]}_{{\hat{m}-1}})' (0)|$,

$$\begin{aligned} \widetilde{II} \le \left\{ \begin{array}{ll} \left( 2^{r - \frac{1}{2}} (2 r + 1)^{r + \frac{1}{2}} {\mathscr {I}}^{-(r + \frac{1}{2})} + 2^{r + \frac{1}{2}} {\mathscr {I}}^{-1}\right) \lambda ^{r + \frac{1}{2}} \mathscr {Q}_{D,\lambda }^{2r-1}, &{} \hbox {when} \ \frac{1}{2} \le r \le \frac{3}{2}, \\ 2^{r - \frac{1}{2}} (2 r + 1)^{r + \frac{1}{2}} {\mathscr {I}}^{-(r + \frac{1}{2})} \lambda ^{r + \frac{1}{2}} &{} \\ + {\mathscr {I}}^{-1} \left( 2^{r + \frac{1}{2}} \lambda ^{r + \frac{1}{2}}+ 2(r-1/2)\kappa ^{2r-3}\lambda \mathscr {R}_{D}\right) , &{} \hbox {when} \ r > \frac{3}{2}. \end{array}\right. \end{aligned}$$

This together with (50), the bound (49) for I, (47), and the definition (16) of the quantity $\Lambda _{\rho , \lambda , r}$ tells us that

$$\begin{aligned} \Vert L_{K,D}f^{[1]}_{D,\hat{m}-1}-f_{K,D}\Vert _K \le \left( \frac{1}{3}{\mathscr {I}}^{-\frac{1}{2}} +\frac{1}{3} + \frac{1}{6} {\mathscr {I}}^{-(r + \frac{1}{2})} + \frac{1}{3} {\mathscr {I}}^{-1}\right) \lambda ^{\frac{1}{2}} \Lambda _{\rho , \lambda , r}.\nonumber \\ \end{aligned}$$

(53)

On the other hand, $\hat{m}\ge 2$ is the smallest nonnegative integer satisfying (16), so for the smaller integer $\hat{m}-1$, we must have

$$\begin{aligned} \Vert L_{K,D}f^{[1]}_{D,\hat{m}-1}-f_{K,D}\Vert _K > \lambda ^{\frac{1}{2}} \Lambda _{\rho , \lambda , r}. \end{aligned}$$

This together with (53) implies

$$\begin{aligned} \lambda ^{\frac{1}{2}} \Lambda _{\rho , \lambda , r} \le \left( \frac{1}{3}{\mathscr {I}}^{-\frac{1}{2}} +\frac{1}{3} + \frac{1}{6} {\mathscr {I}}^{-(r + \frac{1}{2})} + \frac{1}{3} {\mathscr {I}}^{-1}\right) \lambda ^{\frac{1}{2}} \Lambda _{\rho , \lambda , r} \end{aligned}$$

and thereby

$$\begin{aligned} {\mathscr {I}}^{-\frac{1}{2}} + \frac{1}{2} {\mathscr {I}}^{-(r + \frac{1}{2})} + {\mathscr {I}}^{-1} \ge 2. \end{aligned}$$

One of the above terms in the summation is at least $\frac{2}{3}$. It follows that ${\mathscr {I}} \le \frac{9}{4}<3$. It follows that $|(p_{\hat{m}-1}^{[1]})'(0)| = \frac{{\mathscr {I}}}{\lambda } <\frac{3}{\lambda }$. This proves the first statement (24).

To prove the second statement, we first claim that for $v\in \{1, 2\}$,

$$\begin{aligned} \Vert F_\varepsilon [p_{\hat{m}-1}^{[v]}( L_{K,D})f_{K,D}]\Vert _K \le \Vert F_\varepsilon [f_{K,D}]\Vert _K. \end{aligned}$$

(54)

This claim is obviously true for $\hat{m} =1$ with equality valid since in this case $p_{\hat{m}-1}^{[v]} \equiv 1$ and $p_{\hat{m}-1}^{[v]}(L_{K,D})$ is the identity operator.

Consider the case $\hat{m}\ge 2$. Since $\varepsilon =\lambda /3$, we have from (24), (41) and (40) that

$$\begin{aligned} \varepsilon =\frac{\lambda }{3}\le |(p_{\hat{m}-1}^{[1]})'(0)|^{-1} =\left[ \sum _{k=1}^{\hat{m}-1}(t_{k,\hat{m}-1}^{[1]})^{-1}\right] ^{-1}\le t_{1,\hat{m}-1}^{[1]}<t_{1,\hat{m}-1}^{[2]}. \end{aligned}$$

(55)

It follows from (43) that for $v\in \{1, 2\}$, there holds

$$\begin{aligned} \max _{0\le t\le \varepsilon }p_{\hat{m}-1}^{[v]}(t) \le \max _{0\le t\le t^{[v]}_{1,\hat{m}-1}}p_{\hat{m}-1}^{[v]}(t) = \max _{0\le t\le t^{[v]}_{1,\hat{m}-1}} \Pi _{k=1}^{\hat{m}-1} \left( 1- t/t^{[v]}_{k,\hat{m}-1}\right) \le 1. \end{aligned}$$

Recall the eigenpairs $\{(\sigma _i^\mathbf{x},\phi _i^\mathbf{x})\}_{i}$ of $L_{K,D}$. Expressing $f_{K,D}=\sum _jc_j\phi _j^\mathbf{x}$ implies

$$\begin{aligned} \Vert F_\varepsilon [p_{\hat{m}-1}^{[v]}( L_{K,D})f_{K,D}]\Vert _K= & {} \left\| F_\varepsilon \left[ \sum _{j }p_{\hat{m}-1}^{[v]}( \sigma _j^\mathbf{x}) c_j\phi _j^\mathbf{x} \right] \right\| _K\\= & {} \left\| \sum _{j:\sigma _j^\mathbf{x}<\varepsilon } p_{\hat{m}-1}^{[v]}( \sigma _j^\mathbf{x})c_j\phi _j^\mathbf{x} \right\| _K = \sqrt{ \sum _{j:\sigma _j^\mathbf{x}<\varepsilon } [ p_{\hat{m}-1}^{[v]}( \sigma _j^\mathbf{x})c_j]^2}\\\le & {} \sqrt{ \sum _{j:\sigma _j^\mathbf{x}<\varepsilon } c_j^2}= \Vert F_\varepsilon [f_{K,D}]\Vert _K. \end{aligned}$$

So the claim (54) is also true in the case $\hat{m}\ge 2$. This proves the claim.

To prove the second statement of the proposition, we estimate the norm $\Vert F_\varepsilon [f_{K,D}]\Vert _K$. Under the condition (7),

$$\begin{aligned}&\Vert F_\varepsilon [f_{K,D}]\Vert _K \le \Vert F_\varepsilon [f_{K,D}-L_{K,D}f_\rho ]\Vert _K + \Vert F_\varepsilon [L_{K,D}f_\rho ]\Vert _K\nonumber \\&\quad \le \Vert F_\varepsilon [(L_{K,D}+\lambda I)^{1/2}] \Vert \Vert (L_{K,D}+\lambda I)^{-1/2}(L_{K}+\lambda I)^{1/2}\Vert \nonumber \\&\quad \quad \times \Vert (L_{K}+\lambda I)^{-1/2}(f_{K,D}-L_{K,D}f_\rho )\Vert _K + \Vert F_\varepsilon L_{K,D}L_K^{r-1/2}\Vert \Vert L_{K}^{1/2}h_\rho \Vert _K \nonumber \\&\quad \le (\varepsilon +\lambda )^{1/2}\mathscr {Q}_{D,\lambda }\mathscr {P}_{D,\lambda }+ \Vert F_\varepsilon L_{K,D}L_K^{r-1/2}\Vert \Vert h_\rho \Vert _\rho , \end{aligned}$$

(56)

where the operators $F_\varepsilon (L_{K,D}+\lambda I)^{1/2} $ and $F_\varepsilon L_{K,D}L_K^{r-1/2}$ are defined by spectral calculus.

When $1/2\le r\le 3/2$, we have

$$\begin{aligned} \Vert F_\varepsilon L_{K,D}L_K^{r-1/2}\Vert \le \Vert F_\varepsilon L_{K,D} (L_{K,D}+\lambda I)^{r-1/2}\Vert \mathscr {Q}_{D,\lambda }^{2r-1}\le \varepsilon (\lambda +\varepsilon )^{r-1/2}\mathscr {Q}_{D,\lambda }^{2r-1}. \end{aligned}$$

When $r>3/2$, it follows from (52) that

$$\begin{aligned} \Vert F_\varepsilon L_{K,D}L_K^{r-1/2}\Vert\le & {} \Vert F_\varepsilon L_{K,D}L_{K,D}^{r-1/2}\Vert +\Vert F_\varepsilon L_{K,D}(L_K^{r-1/2}-L_{K,D}^{r-1/2})\Vert \\\le & {} \varepsilon ^{r+1/2}+(r-1/2)\kappa ^{2r-3}\varepsilon \mathscr {R}_{D}. \end{aligned}$$

Combining the above bounds for $\Vert F_\varepsilon L_{K,D}L_K^{r-1/2}\Vert $ with (56) and noticing the choice $\varepsilon = \lambda /3$ and the definition (17) of the quantity $\Lambda _{\rho , \lambda , r}$, we find

$$\begin{aligned} \Vert F_\varepsilon [p_{\hat{m}-1}^{[v]}(L_{K,D}) f_{K,D}]\Vert _K \le \frac{1}{2} \lambda ^{1/2} \Lambda _{\rho , \lambda , r}. \end{aligned}$$

But $\hat{m}$ is the smallest nonnegative integer satisfying (16), the integer $\hat{m}-1$ does not satisfy (16). Hence (22) implies

$$\begin{aligned} \lambda ^{1/2} \Lambda _{\rho , \lambda , r} \le \Vert L_{K,D}f_{D,\hat{m}-1}^{[1]}- f_{K,D}\Vert _K = [p_{\hat{m}-1}^{[1]}, p_{\hat{m}-1}^{[1]}]_{[0]}^{1/2}. \end{aligned}$$

Then the desired statement of the proposition is verified. The proof of Proposition 1 is completed. $\square $

Proof of Proposition 2

We first prove (30). Since $|(p^{[1]}_{0})'(0)|=0$, (30) obviously holds for $\hat{m}=0$. We then consider the case $\hat{m}\ge 1$. By (41),

$$\begin{aligned} |(p_{\hat{m}}^{[1] })'(0)|=-(p_{\hat{m}}^{[1] })'(0) =(p_{\hat{m}-1}^{[1] })'(0)-(p_{\hat{m}}^{[1] })'(0)+|(p_{\hat{m}-1}^{[1] })'(0)|. \end{aligned}$$

(57)

From (36), we have

$$\begin{aligned} (p_{\hat{m}-1}^{[1] })'(0)-(p_{\hat{m}}^{[1] })'(0) =\frac{[p_{\hat{m}-1}^{[1]},p_{\hat{m}-1}^{[1]}]_{[0]} -[p_{\hat{m}}^{[1]},p_{\hat{m}}^{[1]}]_{[0]}}{[p^{[2]}_{\hat{m}-1},p^{[2] }_{\hat{m}-1}]_{[1]}}. \end{aligned}$$

Therefore,

$$\begin{aligned} (p_{\hat{m}-1}^{[1] })'(0)-(p_{\hat{m}}^{[1] })'(0)\le \frac{[p_{\hat{m}-1}^{[1]}, p_{\hat{m}-1}^{[1]}]_{[0]}}{[p^{[2]}_{\hat{m}-1},p^{[2]}_{\hat{m}-1}]_{[1]}}. \end{aligned}$$

(58)

Then, it follows from (55), (22) and (5) that

$$\begin{aligned}{}[p_{\hat{m}-1}^{[1]}, p_{\hat{m}-1}^{[1]}]_0^{1/2}= & {} \Vert p^{[1]}_{\hat{m}-1}(L_{K,D })f_{K,D}\Vert _K = \Vert L_{K,D }f_{D,\hat{m}-1}^{[1]}-f_{K,D}\Vert _K \\\le & {} \Vert L_{K,D }f_{D,\hat{m}-1}^{[2]}-f_{K,D}\Vert _K = \Vert p^{[2]}_{\hat{m}-1}(L_{K,D })f_{K,D}\Vert _K\\\le & {} \Vert F_\varepsilon [p^{[2]}_{\hat{m}-1}(L_{K,D })f_{K,D}]\Vert _K +\Vert F^\bot _\varepsilon [p^{[2]}_{\hat{m}-1}(L_{K,D })f_{K,D}]\Vert _K. \end{aligned}$$

But Proposition 1 with $v=2$ gives

$$\begin{aligned} \Vert F_\varepsilon [p_{\hat{m}-1}^{[2]}( L_{K,D })f_{K,D}]\Vert _K \le \frac{1}{2} [p_{\hat{m}-1}^{[1]}, p_{\hat{m}-1}^{[1]}]_{[0]}^{1/2}. \end{aligned}$$

Hence

$$\begin{aligned} \, [p_{\hat{m}-1}^{[1]}, p_{\hat{m}-1}^{[1]}]_{[0]}^{1/2}\le & {} \frac{1}{2} [p_{\hat{m}-1}^{[1]}, p_{\hat{m}-1}^{[1]}]_{[0]}^{1/2}+\varepsilon ^{-1/2} \Vert p^{[2]}_{\hat{m}-1}(L_{K,D})L_{K,D}^{1/2}f_{K,D}\Vert _K\\= & {} \frac{1}{2} [p_{\hat{m}-1}^{[1]}, p_{\hat{m}-1}^{[1]}]_{[0]}^{1/2} +\varepsilon ^{-1/2}[p^{[2]}_{\hat{m}-1},p^{[2]}_{\hat{m}-1}]_{[1]}^{1/2}. \end{aligned}$$

Therefore,

$$\begin{aligned} \, [p_{\hat{m}-1}^{[1]}, p_{\hat{m}-1}^{[1]}]_{[0]}^{1/2} \le 2\varepsilon ^{-1/2}[p^{[2]}_{\hat{m}-1} ,p^{[2]}_{\hat{m}-1} ]_{[1]}^{1/2}, \end{aligned}$$

which together with (57), (58) and Proposition 1 yields

$$\begin{aligned} |(p_{\hat{m}}^{[1]})'(0)|\le 3\lambda ^{-1}+12\lambda ^{-1}=15\lambda ^{-1}. \end{aligned}$$

This proves the first statement (30) of Proposition 2.

To prove the second statement, we denote $\varepsilon _0=\lambda /15$ and

$$\begin{aligned} f^{[1]*}_{D,\hat{m}} = \left\{ \begin{array}{ll} q^{[1]}_{\hat{m}-1}(L_{K,D})L_{K,D}f_\rho , &{} \hbox {if} \ \hat{m} \ge 1, \\ 0, &{} \hbox {if} \ \hat{m} =0. \end{array}\right. \end{aligned}$$

(59)

We can decompose $\Vert f^{[1]}_{D,\hat{m}}-f_\rho \Vert _\rho $ as

$$\begin{aligned}&\Vert f^{[1]}_{D,\hat{m}}-f_\rho \Vert _\rho = \Vert L_K^{1/2}(f^{[1]}_{D,\hat{m}}-f_\rho )\Vert _K \le \Vert (L_K+\lambda I)^{1/2}(f^{[1]}_{D,\hat{m}}-f_\rho )\Vert _K \nonumber \\&\quad \le \mathscr {Q}_{D,\lambda }\Vert F_{\varepsilon _0}[(L_{K,D}+\lambda I)^{1/2}(f^{[1]}_{D,\hat{m}}-f^{[1]*}_{D,\hat{m}})]\Vert _K + \mathscr {Q}_{D,\lambda }\Vert F_{\varepsilon _0}[(L_{K,D}\nonumber \\&\quad +\lambda I)^{1/2}(f^{[1]*}_{D,\hat{m}}-f_\rho )]\Vert _K\nonumber \\&\quad \quad + \mathscr {Q}_{D,\lambda }\Vert F_{\varepsilon _0}^{\bot }[(L_{K,D}+\lambda I)^{1/2}(f^{[1]}_{D,\hat{m}}-f_\rho )]\Vert _K\nonumber \\&\quad =: \mathscr {Q}_{D,\lambda }(A_1+A_2+A_3). \end{aligned}$$

(60)

Due to (41) and (30), we have

$$\begin{aligned} \varepsilon _0=\lambda /15\le |(p^{[1] }_{\hat{m}})'(0)|^{-1} \le \left[ \sum _{k=1}^{\hat{m}}(t_{k,\hat{m}})^{-1}\right] ^{-1} \le t^{[1]}_{1,\hat{m}}. \end{aligned}$$

(61)

Note that $A_1 =0$ and $f^{[1]*}_{D,\hat{m}}-f_\rho = -f_\rho =- p^{[1]}_{\hat{m}}(L_{K,D}) f_\rho $ when $\hat{m}=0$. If $\hat{m} \ge 1$, we use (20), (59), (26), the definitions of $\mathscr {P}_{D,\lambda }$ and $\mathscr {Q}_{D,\lambda }$ to bound $A_1$ as

$$\begin{aligned}&A_1 = \Vert F_{\varepsilon _0}[(L_{K,D}+\lambda I)^{1/2}q_{\hat{m}-1}^{[1]}(L_{K,D})(f_{K,D}-L_{K,D}f_\rho )]\Vert _K\\&\quad \le \Vert F_{\varepsilon _0} [(L_{K,D}+\lambda I)^{1/2}q^{[1]}_{\hat{m}-1}(L_{K,D})(L_K+\lambda I)^{1/2}]\Vert \Vert (L_K+\lambda I)^{-1/2} (f_{K,D}\\&\qquad -L_{K,D }f_\rho ) \Vert _K\\&\quad \le \mathscr {Q}_{D,\lambda }\mathscr {P}_{D,\lambda }\max _{0\le t < {\varepsilon _0}}|(t+\lambda )q^{[1]}_{\hat{m}-1}(t)|. \end{aligned}$$

By (43), for $0\le t < {\varepsilon _0} \le t^{[1]}_{1,\hat{m}}$, we have

$$\begin{aligned} |tq^{[1]}_{\hat{m}-1}(t)|=|1-p^{[1]}_{\hat{m}}(t)|\le 1. \end{aligned}$$

Furthermore, (43), (41), (61) and (30) imply

$$\begin{aligned} \max _{0\le t < {\varepsilon _0}}|q^{[1]}_{\hat{m}-1}(t)|\le q^{[1]}_{\hat{m}-1}(0) =|(p^{[1] }_{\hat{m}})'(0)| \le 15\lambda ^{-1}. \end{aligned}$$

Therefore, the first term in (60) can be bounded as

$$\begin{aligned} A_1\le 16\mathscr {Q}_{D,\lambda }\mathscr {P}_{D,\lambda }. \end{aligned}$$

(62)

We then bound the second term $A_2$ in two cases involving r.

When $1/2\le r\le 3/2$, we have $r-1/2\le 1$, and the bound $\sup _{0\le t < {\varepsilon _0}\le t^{[1]}_{1,\hat{m}}} |p^{[1]}_{\hat{m}}(t)| \le 1$ together with (26) and the regularization condition (7) yields

$$\begin{aligned} A_2\le & {} \Vert F_{\varepsilon _0}(L_{K,D}+\lambda I)^{1/2}p^{[1]}_{\hat{m}}(L_{K,D})L_K^{r-1/2}\Vert \Vert h_\rho \Vert _\rho \nonumber \\\le & {} \mathscr {Q}_{D,\lambda }^{2r-1}\Vert F_{\varepsilon _0}(L_{K,D}+\lambda I)^{1/2}(L_{K,D}+\lambda I)^{r-1/2}\Vert \Vert h_\rho \Vert _\rho \nonumber \\\le & {} \mathscr {Q}_{D,\lambda }^{2r-1}\Vert h_\rho \Vert _\rho ({\varepsilon _0}+\lambda )^r. \end{aligned}$$

(63)

When $r>3/2$, we use (21), (59) and the regularization condition (7) to get

$$\begin{aligned} A_2&\le \Vert F_{\varepsilon _0}(L_{K,D}+\lambda I)^{1/2}p^{[1]}_{\hat{m}}(L_{K,D}) (L_{K,D}^{r-1/2}-L_K^{r-1/2})\Vert \Vert h_\rho \Vert _\rho \\&\quad + \Vert F_{\varepsilon _0}(L_{K,D}+\lambda I)^{1/2}p^{[1]}_{\hat{m}}(L_{K,D})L_{K,D}^{r-1/2}\Vert \Vert h_\rho \Vert _\rho . \end{aligned}$$

Since $|p^{[1]}_{\hat{m}}(t)|\le 1$ for all $0\le t < {\varepsilon _0}\le t^{[1]}_{1,\hat{m}}$, we get

$$\begin{aligned} \Vert F_{\varepsilon _0}(L_{K,D}+\lambda I)^{1/2}p^{[1]}_{\hat{m}}(L_{K,D})L_{K,D}^{r-1/2}\Vert \le \varepsilon _0^{r-1/2}({\varepsilon _0}+\lambda )^{1/2}. \end{aligned}$$

Combining these with (52) and the definition of $\mathscr {R}_{D}$ yields

$$\begin{aligned} A_2 \le (\varepsilon _0^{r-1/2}+(r-1/2)\kappa ^{2r-3}\mathscr {R}_{D}) ({\varepsilon _0}+\lambda )^{1/2}\Vert h_\rho \Vert _\rho . \end{aligned}$$

(64)

Finally, we turn to bound $A_3$. From Lemma 1 and $f_\rho \in \mathscr {H}_K$, we obtain that $f_\rho $ is in the range of $L_{K,D}$. Since $(a+b)^{1/2}\le a^{1/2}+b^{1/2}$ for $a,b>0$, we then have

$$\begin{aligned} A_3\le & {} \Vert F^\bot _{\varepsilon _0} [L^{1/2}_{K,D}(f^{[1]}_{D,\hat{m}}-f_\rho )]\Vert _K + \lambda ^{1/2}\Vert F^\bot _{\varepsilon _0} (f^{[1]}_{D,\hat{m}}-f_\rho )\Vert _K\\\le & {} \left( \frac{({\varepsilon _0}+\lambda )^{1/2}}{{\varepsilon _0}^{1/2}}+ \lambda ^{1/2}\frac{({\varepsilon _0}+\lambda )^{1/2}}{{\varepsilon _0}}\right) \Vert F^\bot _{\varepsilon _0} (L_{K,D}+\lambda I)^{-1/2} L_{K,D}(f^{[1]}_{D,\hat{m}}-f_\rho )\Vert _K\\\le & {} \left( 1+\frac{\lambda ^{1/2}}{{\varepsilon _0}^{1/2}}\right) \left( 1+ \frac{\lambda }{{\varepsilon _0}}\right) ^{1/2} \Vert F^\bot _{\varepsilon _0} (L_{K,D}+\lambda I)^{-1/2} (L_{K,D}f^{[1]}_{D,\hat{m}}-f_{K,D})\Vert _K\\+ & {} \quad \left( 1+\frac{\lambda ^{1/2}}{{\varepsilon _0}^{1/2}}\right) \left( 1+ \frac{\lambda }{{\varepsilon _0}}\right) ^{1/2} \Vert F^\bot _{\varepsilon _0} (L_{K,D }+\lambda I)^{-1/2} (f_{K,D}-L_{K,D(x)}f_\rho )\Vert _K\\\le & {} \left( 1+\frac{\lambda ^{1/2}}{{\varepsilon _0}^{1/2}}\right) {\varepsilon _0}^{-1/2} \Vert L_{K,D}f^{[1]}_{D,\hat{m}}-f_{K,D}\Vert _K + \sqrt{2}\mathscr {Q}_{D,\lambda }\mathscr {P}_{D,\lambda }\left( 1+ \frac{\lambda }{{\varepsilon _0}}\right) . \end{aligned}$$

But $\hat{m}$ satisfies (16). It follows that

$$\begin{aligned} A_3\le \left( 1+\frac{\lambda ^{1/2}}{{\varepsilon _0}^{1/2}}\right) {\varepsilon _0}^{-1/2} \lambda ^{1/2}\Lambda _{\rho ,\lambda ,r}+\sqrt{2}\mathscr {Q}_{D,\lambda }\mathscr {P}_{D,\lambda }\left( 1+ \frac{\lambda }{{\varepsilon _0}}\right) . \end{aligned}$$

(65)

Inserting (62), (63), (64) and (65) into (60) and noticing $\varepsilon _0=\lambda /15$, we obtain

$$\begin{aligned} \Vert f^{[1]}_{D,\hat{m}}-f_\rho \Vert _\rho \le 32\mathscr {Q}_{D,\lambda }\Lambda _{\rho ,\lambda ,r}. \end{aligned}$$

This verifies the second statement of the proposition. The proof of Proposition 2 is complete. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, SB., Zhou, DX. Optimal Learning Rates for Kernel Partial Least Squares. J Fourier Anal Appl 24, 908–933 (2018). https://doi.org/10.1007/s00041-017-9544-8

Download citation

Received: 28 October 2016
Revised: 16 February 2017
Published: 07 April 2017
Issue Date: June 2018
DOI: https://doi.org/10.1007/s00041-017-9544-8

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal Learning Rates for Kernel Partial Least Squares

Abstract

Access this article

Similar content being viewed by others

On the Impact of Early Stopping in Multiple Kernel Learning

Estimates for the generalized cross-validation function via an extrapolation and statistical approach

Theory and algorithms for learning with rejection in binary classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Lemma 1

Lemma 2

Lemma 3

Lemma 4

Lemma 5

Proof of Proposition 1

Proof of Proposition 2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Optimal Learning Rates for Kernel Partial Least Squares

Abstract

Access this article

Similar content being viewed by others

On the Impact of Early Stopping in Multiple Kernel Learning

Estimates for the generalized cross-validation function via an extrapolation and statistical approach

Theory and algorithms for learning with rejection in binary classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Lemma 1

Lemma 2

Lemma 3

Lemma 4

Lemma 5

Proof of Proposition 1

Proof of Proposition 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation