1 Introduction

The applications of greedy algorithms to supervised learning have sparked great research interest because they have appealing generalization capability with lower computing burden than typical regularized methods, particularly in large-scale dictionary learning problem [16]. Big data sets for the most traditional learning algorithms frequently cause slow machine performance. To tackle this problem, many researchers [13, 7, 8] advocate greedy learning algorithms, which have greatly improved learning performance.

The approximation abilities of greedy-type algorithms for frames or more dictionaries \(\mathcal {D}\) were investigated in [7, 912], as well as various applications, see [3, 7, 1319]. The pure greedy algorithm (PGA) can realize the best bilinear approximation, see [20, 21]. Although the PGA is outstanding at computing, the main problem is that it lacks optimal convergence properties for a general dictionary, and consequently the slower convergence rate than the best nonlinear approximation [11, 2123] corrupts its learning performance. To improve the approximation rate, the orthogonal greedy algorithm (OGA), the relaxed greedy algorithm (RGA), the stepwise projection algorithm (SPA), and their weak versions have been proposed. It was shown that these greedy algorithms all achieved the optimal rate \(\mathcal {O}(m^{-\frac{1}{2}})\) for approximating the elements in the class \(\mathcal {A}_{1}(\mathcal {D})\), which will be defined in (14), where m is the iteration number, see [9, 11].

Both the OGA and the RGA have recently been employed successfully in machine learning [13, 7, 8]. For example, Barron et al. [7] established the optimal convergence rate \(\mathcal {O} (n/\log n)^{-\frac{1}{2}} )\), where n is the sample size. To reduce the OGA’s computational load, Fang et al. [1] investigated the learning performance of the orthogonal super greedy algorithm (OSGA) and derived the almost same rate as the orthogonal greedy learning algorithm (OGLA). All these results demonstrate that each greedy learning algorithm has its advantages and disadvantages.

We study the applications of weak greedy algorithms to least squares regression in supervised learning. It is well known that the weak type are easier to implement than the usual greedy algorithms, see [12]. Specifically, the weak rescaled pure greedy algorithm (WRPGA), one fairly simple modification of the PGA, is the goal of our investigation, see [24, 25]. When compared to the OGA and the RGA, the WRPGA can also furthermore reduce the computational load. The best rate \(\mathcal {O}(m^{-\frac{1}{2}})\) for functions in the basic sparse class has been proved [24]. Motivated by research results of [24], we proceed to use the same method employed for the RPGA in [24] to deduce the error bound of the K-functional estimate in the Hilbert space \(\mathcal {H}\) for the WRPGA. The WRPGA is a simple greedy algorithm with good approximation ability. Based on this, we propose the weak rescaled pure greedy learning algorithm (WRPGLA) for solving the kernel-based regression problems in supervised learning. Using the WRPGA’s proven approximation result, we can derive that the WRPGLA has the almost same learning rate as the OGLA. Our results show that the WRPGLA further cuts down the computational complexity even more without reducing generalization capabilities.

The paper is organized as follows. In Sect. 2, we review least squares regression learning theory and the WRPGA. In Sect. 3, we propose the WRPGLA and state the main theorems on the error estimates. Section 4 is devoted to proofs of the main results. We present the convergence rates under two smoothness assumptions on the regression function \(f_{\rho}\) in the last section.

2 Preliminaries

Some preliminaries are presented in this section. Sections 2.1 and 2.2 provide a fast overview of least squares regression learning and the WRPGA, respectively.

2.1 Least squares regression

In this paper, the approximation problem is addressed in the following statistical learning context. Let X be a compact metric space and \(Y=\mathbb{R}\). Let ρ be a Borel probability measure on \(Z= X\times Y \). The generalization error for a function \(f:X\rightarrow Y\) is defined by

$$ \mathcal {E}(f)= \int _{Z}\bigl(f(x)-y\bigr)^{2}\,d\rho ,$$
(1)

which is minimized by the following regression function:

$$ f_{\rho}(x)= \int _{Y} y \,d\rho (y\vert x), $$

where \(\rho (\cdot \vert x)\) is the conditional distribution induced by ρ at \(x\in X\). In regression learning, ρ is unknown, and what one can know is a set of samples \({\mathbf{z}}=\{z_{i}\}_{i=1}^{n}=\{(x_{i}, y_{i})\}_{i=1}^{n} \in Z^{n}\) that are drawn independently and identically according to ρ. The goal of learning is to find a good approximation \(f_{{\mathbf{z}}}\) of \(f_{\rho}\), which minimizes the empirical error

$$ \mathcal {E}_{{\mathbf{z}}}(f)= \Vert y-f \Vert _{n}^{2}:=\frac{1}{n}\sum_{i=1}^{n} \bigl(f(x_{i})-y_{i}\bigr)^{2}.$$
(2)

Denote the Hilbert space of the square integrable functions defined on X with respect to the measure \(\rho _{X}\) by \(L_{\rho _{X}} ^{2}(X)\), where \(\rho _{X}\) is the marginal measure of ρ on X. It is clear from the definition of \(f_{\rho}(x)\) that for each \(x\in X\), \(\int _{Y}(f_{\rho}(x)-y)\,d\rho (y\vert x)=0\). For any \(f\in L_{\rho _{X}} ^{2}(X)\), it holds that

$$\begin{aligned} \mathcal {E}(f)={}& \int _{Z}\bigl(f(x)-f_{\rho}(x)+f_{\rho}(x)-y \bigr)^{2}\,d\rho \\ ={}& \int _{X}\bigl(f(x)-f_{\rho}(x)\bigr)^{2}\,d\rho _{X}+ \int _{Z}\bigl(f_{\rho}(x)-y\bigr)^{2}\,d\rho \\ &{}+2 \int _{X}\bigl(f(x)-f_{\rho}(x)\bigr)\,d\rho _{X} \int _{Y}\bigl(f_{\rho}(x)-y\bigr)\,d\rho (y\vert x) \\ ={}& \int _{X}\bigl(f(x)-f_{\rho}(x)\bigr)^{2}\,d\rho _{X}+\mathcal {E}(f_{\rho}). \end{aligned}$$

Therefore,

$$ \mathcal {E}(f)-\mathcal {E}(f_{\rho})= \Vert f-f_{\rho} \Vert ^{2} $$
(3)

with the norm \(\|\cdot \|\)

$$ \Vert f \Vert = \biggl( \int _{X} \bigl\vert f(x) \bigr\vert ^{2}d{\rho _{X}} \biggr)^{ \frac{1}{2}}. $$
(4)

The prediction accuracy of learning algorithms is measured by \(E(\|f_{{\mathbf{z}}}-f_{\rho}\|^{2})\).

We will assume \(\vert y\vert \leq B\) for a positive real number \(B<\infty \) almost surely. In this paper, we construct the learning estimator \(f_{{\mathbf{z}}}\) by applying the WRPGA and estimate \(E(\|f_{{\mathbf{z}}}-f_{\rho}\|^{2})\). So, in the following subsection, we recall this algorithm.

2.2 Weak rescaled pure greedy algorithm

We shall restrict our analysis to the situation in which approximation takes place in a real, separable Hilbert space \(\mathcal {H}\) with the inner product \(\langle \cdot ,\cdot \rangle _{\mathcal {H}}\) and the norm \(\|\cdot \|:=\|\cdot \|_{\mathcal {H}}=\langle \cdot ,\cdot \rangle _{ \mathcal {H}}^{\frac{1}{2}}\). Let \(\mathcal {D}\subset \mathcal {H}\) be a given dictionary satisfying \(\|g\|=1\) for every \(g\in \mathcal {D}\), \(g\in \mathcal {D}\) implies \(-g\in \mathcal {D}\) and \(\overline{\operatorname{Span}(\mathcal {D})}=\mathcal {H}\).

Petrova developed the rescaled pure greedy algorithm (RPGA) to enhance the PGA’s convergence rate, which simply rescales \(f_{m}\) at the mth greedy step, see [24]. We begin by describing the weak rescaled pure greedy algorithm (WRPGA) also introduced by Petrova in [24].

\({\mathbf{{WRPGA}}}(\{t_{m}\},\mathcal {D})\):

Step 0: Let \(f_{0} :=0\).

Step m (\(m \geq 1\)):

(1) If \(f=f_{m-1}\), then terminate the iterative process and define \(f_{k}=f_{m-1}=f\) for \(k \geq m\).

(2) If \(f\neq f_{m-1}\), then choose a direction \(\varphi _{m}\in \mathcal {D}\) such that

$$ \bigl\vert \langle f-f_{m-1},\varphi _{m}\rangle \bigr\vert \geq t_{m}\sup_{ \varphi \in \mathcal {D}} \bigl\vert \langle f-f_{m-1},\varphi \rangle \bigr\vert , $$
(5)

where \(\{t_{m}\}_{m=1}^{\infty}\) is a weakness sequence and \(t_{m}\in (0,1]\).

Let

$$\begin{aligned}& \lambda _{m}:= \langle f-f_{m-1},\varphi _{m} \rangle , \end{aligned}$$
(6)
$$\begin{aligned}& \hat{f}_{m}:= f_{m-1}+\lambda _{m} \varphi _{m}, \end{aligned}$$
(7)
$$\begin{aligned}& s_{m}:= \frac { \langle f, \hat{f}_{m}\rangle }{ \Vert \hat{f }_{m} \Vert ^{2}}. \end{aligned}$$
(8)

The m step approximation \(f_{m}\) is defined as

$$ f_{m}=s_{m}\hat{f}_{m} , $$
(9)

and proceed to Step \(m+1\).

Remark 1

When \(t_{m}=1\), this algorithm is the RPGA. Note that if the supremum is not attained, one can select \(t_{m}<1\) and proceed with the algorithm. In this case, it is easier to choose \(\varphi _{m}\). If the output at the mth greedy step was \(\hat{f}_{m}\) rather than \(f_{m}=s_{m}{\hat{f}_{m}}\), this would be the PGA. The WRPGA uses \(s_{m}{\hat{f}_{m}}\), which is just suitable scaling of \({\hat{f}_{m}}\), and thus increases the rate to \(\mathcal {O}(m^{-\frac{1}{2}})\) for functions in the closure of the convex hull of \(\mathcal {D}\).

3 Weak rescaled pure greedy learning

We shall provide the WRPGLA for regression. From the definition of the WRPGA, computing \(\sup_{\varphi \in \mathcal {D}} \vert \langle f-f_{m-1},\varphi \rangle \vert \) may result in computation difficulty. Therefore we compute only over the truncation of the dictionary, which is a finite subset of \(\mathcal {D}\). Let \(\mathcal {D}_{1}\subset \mathcal {D}_{2}\subset \cdots\subset \mathcal {D}\). Then \(\mathcal {D}_{m}\) is the truncation of \(\mathcal {D}\) with the cardinality \(\#(\mathcal {D}_{m})=m\). Here we assume that

$$ m\leq m(n):=\bigl\lfloor n^{a}\bigr\rfloor \quad \text{for some fixed }a \geq 1. $$
(10)

Then the WRPGLA is defined by the following simple processes.

WRPGLA:

Step 1: We apply the WRPGA for \(\mathcal {D}_{m}\) to the function \(y(x_{i})=y_{i}\) by utilizing the norm \(\|\cdot \|_{n}\) associated with the empirical inner product, that is,

$$ \Vert f \Vert _{n}:= \Biggl(\frac{1}{n}\sum _{i=1}^{n} \bigl\vert f(x_{i}) \bigr\vert ^{2} \Biggr)^{\frac{1}{2}}. $$

Step 2: The algorithms establish the approximation \(f_{{\mathbf{z}},k}:=f_{k}\) to the data at the kth greedy step. Then, we define our estimator as \(f_{{\mathbf{z}}}:=Tf_{{\mathbf{z}},k^{*}}\), where \(Tu:=T_{B}\min \{B,\vert u\vert \}\operatorname{sgn}(u)\) and

$$ k^{*}:=\arg \min_{k>0} \biggl\{ \Vert y-Tf_{{\mathbf{z}},k} \Vert _{n}^{2}+\kappa \frac{k\log n}{n} \biggr\} ,$$
(11)

where the constant \(\kappa \geq \kappa _{0}=2568B^{4}(a+5)\), which will be discussed in proof of Theorem 1.

Remark 2

First, when \(k=0\), it follows from \(f_{0}=0\) and \(\vert y\vert \leq B\) that \(\kappa \frac{k\log n}{n}\leq B^{2}\). This suggests that \(k^{*}\) is not larger than \(\frac{Bn}{\kappa}\). Second, from the definition of the estimator, we observe further that the computing cost of the kth greedy step is less than \(O(n^{a})\). For the WRPGA, it only requires an additional computation of \(s_{m}\).

To discuss the approximation properties of WRPGLA, we introduce the class of functions

$$ \mathcal {A}_{1}^{0}(\mathcal {D},M):= \biggl\{ f= \sum _{k \in \Lambda}c_{k}(f)\varphi _{k}: \varphi _{k} \in D,\#(\Lambda )< \infty ,\sum _{k \in \Lambda } \bigl\vert c_{k}(f) \bigr\vert \leq M \biggr\} , $$
(12)

and

$$ \mathcal {A} _{1}(\mathcal {D},M) = \overline {\mathcal {A}_{1}^{0}(\mathcal {D},M) }. $$
(13)

Then

$$ \mathcal {A} _{1}(\mathcal {D})=\bigcup _{M>0}\mathcal {A} _{1}( \mathcal {D},M)$$
(14)

and

$$ \Vert f \Vert _{\mathcal {A} _{1}(\mathcal {D})}:= \inf \bigl\{ M: f\in \mathcal {A} _{1}( \mathcal {D},M)\bigr\} .$$
(15)

We also use the following K-functional:

$$ K(f,t):= K\bigl(f,t,\mathcal {H},\mathcal {A}_{1}(\mathcal {D}) \bigr):=\inf_{h \in \mathcal {A}_{1}(\mathcal {D})}\bigl\{ \Vert f-h \Vert _{\mathcal {H}}+t \Vert h \Vert _{ \mathcal {A}_{1}(\mathcal {D})}\bigr\} , \quad t>0. $$
(16)

Since all the constants in this work depend at most on \(\kappa _{0}\), B, and a, we denote all of them by C for simplicity of notation. Now we take \(\mathcal {H}=L_{\rho _{X}} ^{2}(X)\) with the norm defined by (4).

Then, we provide our main results on the generalization error bounds for the WRPGLA.

Theorem 1

There exists \(\kappa _{0}\) depending only on B and a such that if \(\kappa \geq \kappa _{0}\), then for all \(k>0\) and \(h\in \operatorname{Span}(\mathcal {D}_{m})\), the learning estimator by applying the WRPGA satisfies

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq 8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}(\mathcal {D}_{m})}}{\sum_{i=1}^{k}t_{i}^{2}}+2 \Vert f_{\rho}-h \Vert ^{2}+C \frac{k\log n}{n}. $$
(17)

Furthermore, we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert \bigr)\leq 2K \Biggl(f_{\rho},2 \Biggl(\sum_{i=1}^{k}t_{i}^{2} \Biggr)^{-\frac{1}{2}} \Biggr)+C\frac{k\log n}{n}. $$
(18)

Applying Theorem 1 with \(t_{i}=t_{0}\) for all \(i\geq 1\) and \(0< t_{0}\leq 1\), we get the following theorem.

Theorem 2

Under the assumptions of Theorem 1, if \(t_{i}=t_{0}\) for all \(i\geq 1\) and \(0< t_{0}\leq 1\), then we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq 8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}(\mathcal {D}_{m})}}{kt_{0}^{2}}+2 \Vert f_{ \rho}-h \Vert ^{2}+C \frac{k\log n}{n}. $$
(19)

Furthermore, we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert \bigr)\leq 2K \bigl(f_{\rho},2k^{-\frac{1}{2}}t_{0}^{-1} \bigr)+C \frac{k\log n}{n}. $$
(20)

4 Proofs of the main results

To prove Theorem 1, we establish a lemma on the upper error bound for the WRPGA.

Lemma 4.1

If \(f\in \mathcal {H}\), \(h\in \mathcal {A}_{1}(\mathcal {D})\), then the output \((f_{m})_{m\geq 0}\) of the WRPGA satisfies

$$\begin{aligned} e_{m}:= \Vert f-f_{m} \Vert \leq 2K \Biggl(f, \Biggl( \sum_{k=1}^{m}t_{k}^{2} \Biggr)^{-1/2} \Biggr),\quad m=0,1,2,\ldots . \end{aligned}$$
(21)

Proof

In terms of the definition of K-functional, we just need to prove that for \(f\in \mathcal {H}\) and \(h\in \mathcal {A}_{1}(\mathcal {D})\),

$$\begin{aligned} e_{m}^{2}\leq \Vert f-h \Vert ^{2}+\frac{4}{\sum_{k=1}^{m}t_{k}^{2}} \Vert h \Vert _{ \mathcal {A}_{1}(\mathcal {D})}^{2},\quad m=1,2, \ldots . \end{aligned}$$
(22)

Since \(\mathcal {A}_{1}^{0}(\mathcal {D},M)\) is dense in \(\mathcal {A}_{1}(\mathcal {D},M)\), it suffices to prove (22) for functions h that are finite sums \(\sum_{j}c_{j}\varphi _{j}\) with \(\sum_{j}\vert c_{j}\vert \leq M\). We fix \(\epsilon >0\) and select a representation for \(h=\sum_{\varphi \in \mathcal {D}}c_{\varphi}\varphi \), such that

$$\begin{aligned} \sum_{\varphi \in \mathcal {D}} \vert c_{\varphi} \vert < M+ \epsilon . \end{aligned}$$
(23)

Denote

$$\begin{aligned} a_{m}:=e_{m}^{2}- \Vert f-h \Vert ^{2},\quad m=1,2,\ldots . \end{aligned}$$
(24)

The nonincreasing of \(\{e_{m}\}_{m=0}^{\infty}\) implies that \(\{a_{m}\}_{m=0}^{\infty}\) is also a nonincreasing sequence.

Then we discuss these two cases separately.

Case 1: \(a_{0}:=\|f\|^{2}-\|f-h\|^{2}\leq 0\). Then, for every \(m\geq 1\), we have \(a_{m}\leq 0\). Therefore inequality (22) holds true.

Case 2: \(a_{0}>0\). Assume that \(a_{m-1}>0\), \(m\geq 1\). Note that \(f_{m}\) is the orthogonal projection of f onto the linear space spanned by \(\hat{f}_{m}\), it implies

$$\begin{aligned} \langle f-f_{m},f_{m}\rangle =0,\quad m\geq 0. \end{aligned}$$
(25)

This together with the selection of \(\varphi _{m}\) implies

$$\begin{aligned} e_{m-1}^{2}&=\langle f-f_{m-1},f-f_{m-1} \rangle \\ &=\langle f-f_{m-1},f \rangle \\ &=\langle f-f_{m-1},f-h\rangle +\langle f-f_{m-1},h\rangle \\ &\leq e_{m-1} \Vert f-h \Vert +\sum_{\varphi \in \mathcal {D}}c_{ \varphi} \langle f-f_{m-1},\varphi \rangle \\ &\leq e_{m-1} \Vert f-h \Vert +t_{m}^{-1} \bigl\vert \langle f-f_{m-1},\varphi _{m} \rangle \bigr\vert \sum_{\varphi \in \mathcal {D}} \vert c_{ \varphi} \vert . \end{aligned}$$
(26)

By (23), we get

$$\begin{aligned} e_{m-1}^{2}\leq \frac{1}{2}\bigl(e_{m-1}^{2}+ \Vert f-h \Vert ^{2}\bigr)+t_{m}^{-1} \bigl\vert \langle f-f_{m-1},\varphi _{m}\rangle \bigr\vert (M+ \epsilon ). \end{aligned}$$
(27)

Let \(\epsilon \rightarrow 0\). Therefore

$$\begin{aligned} \bigl\vert \langle f-f_{m-1},\varphi _{m}\rangle \bigr\vert \geq \frac{t_{m}(e_{m-1}^{2}- \Vert f-h \Vert ^{2})}{2M}. \end{aligned}$$
(28)

It has been proved in [24] that

$$\begin{aligned} e_{m}^{2}\leq e_{m-1}^{2}-\langle f-f_{m-1},\varphi _{m}\rangle ^{2},\quad m=1,2, \ldots . \end{aligned}$$
(29)

Then, using the assumption that \(a_{m-1}>0\), we have

$$\begin{aligned} e_{m}^{2}\leq e_{m-1}^{2}- \frac{t_{m}^{2}a_{m-1}^{2}}{4M^{2}}. \end{aligned}$$
(30)

It yields

$$\begin{aligned} a_{m}\leq a_{m-1} \biggl(1-\frac{t_{m}^{2}a_{m-1}}{4M^{2}} \biggr). \end{aligned}$$
(31)

In particular, for \(m=1\), we have

$$\begin{aligned} a_{1}\leq a_{0} \biggl(1-\frac{t_{1}^{2}a_{0}}{4M^{2}} \biggr). \end{aligned}$$
(32)

Case 2.1: \(0< a_{0}<\frac{4M^{2}}{t_{1}^{2}}\). Since \(\psi (t):=t (1-\frac{t_{1}^{2}t}{4M^{2}} )\) on \((0,\frac{4M^{2}}{t_{1}^{2}} )\) has maximum \(\frac{M^{2}}{t_{1}^{2}}\), it follows that

$$ a_{m}\leq \psi (a_{0})\leq \frac{M^{2}}{t_{1}^{2}}\leq \frac{4M^{2}}{t_{1}^{2}}. $$

Therefore, either all \(\{a_{m}\}_{m=0}^{\infty}\subset (0,\frac{4M^{2}}{t_{1}^{2}})\) and then satisfy (31), or we know that \(a_{m^{\ast}}\leq 0\) for some \(m^{\ast}\geq 1\). The analysis for \(m\geq m^{\ast}\) is therefore the same as in Case 1. For the positive elements in \(\{a_{m}\}_{m=0}^{\infty}\), by applying Lemma 2.2 from [24] with \(l=1\), \(r_{m}=t_{m}^{2}\), \(B=\frac{4M^{2}}{t_{1}^{2}}\), \(J=0\), and \(r=4M^{2}\), we obtain

$$\begin{aligned} a_{m}\leq \frac{4M^{2}}{t_{1}^{2}+\sum_{k=1}^{m}t_{k}^{2}}\leq \frac{4M^{2}}{\sum_{k=1}^{m}t_{k}^{2}}, \end{aligned}$$
(33)

which gives inequality (22).

Case 2.2: \(a_{0}\geq \frac{4M^{2}}{t_{1}^{2}}\). It follows from (32) that \(a_{1}<0\). That is, \(e_{1}^{2}<\|f-h\|^{2}\), which yields (22) due to monotonicity. Lemma 4.1 is proved. □

Now we prove Theorem 1.

Proof of Theorem 1

As shown in [5], \(\|f_{{\mathbf{z}}}-f_{\rho}\|^{2}\) can be decomposed as

$$\begin{aligned} \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2} \leq{}& \mathcal {S}_{1}+\mathcal {S}_{2}+ \mathcal {S}_{3} \\ &{}+2 \biggl( \Vert y-f_{{\mathbf{z}}} \Vert ^{2}_{n}+ \kappa \frac{k^{\ast}\log n}{n}- \Vert y-Tf_{{ \mathbf{z}},k} \Vert ^{2}_{n}-\kappa \frac{k\log n}{n} \biggr), \end{aligned}$$
(34)

where

$$\begin{aligned} &\mathcal {S}_{1}:= \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}-2 \biggl( \Vert y-f_{{\mathbf{z}}} \Vert ^{2}_{n}- \Vert y-f_{\rho} \Vert ^{2}_{n}+\kappa \frac{k^{\ast}\log n}{n} \biggr), \\ &\mathcal {S}_{2}:=2 \bigl( \Vert y-f_{{\mathbf{z}},k} \Vert ^{2}_{n}- \Vert y-h \Vert ^{2}_{n} \bigr), \\ &\mathcal {S}_{3}:=2 \biggl( \Vert y-h \Vert ^{2}_{n}- \Vert y-f_{\rho} \Vert ^{2}_{n}+ \kappa \frac{k\log n}{n} \biggr), \end{aligned}$$
(35)

and \(h\in \operatorname{Span}\{\mathcal {D}_{m}\}\).

We firstly estimate the bound of \(\mathcal {S}_{1}\). To do this, we introduce Ω,

$$\begin{aligned} \Omega = \biggl\{ \textbf{z}:\textbf{z}\in Z^{n}, \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2} \geq 2 \biggl( \Vert y-f_{{\mathbf{z}}} \Vert _{n}^{2}- \Vert y-f_{\rho} \Vert _{n}^{2}+\kappa \frac{k^{\ast}\log n}{n} \biggr) \biggr\} . \end{aligned}$$
(36)

Let \(\operatorname{Prob}(\Omega )\) be the probability that the sample point is a member of the set Ω. Then from \(\vert y\vert \leq B\) and the definition of \(f_{\rho}\) and \(f_{{\mathbf{z}}}\), we have

$$\begin{aligned} E(\mathcal {S}_{1})\leq 6B^{2} \operatorname{Prob}(\Omega ). \end{aligned}$$
(37)

For \(\mathcal {S}_{2}\), according to Lemma 4.1, we get

$$\begin{aligned} \Vert y-f_{{\mathbf{z}},k} \Vert ^{2}_{n}- \Vert y-h \Vert _{n}^{2}\leq 4 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}^{n}}}{\sum_{k=1}^{m}t_{k}^{2}}, \end{aligned}$$
(38)

where

$$ \mathcal {A}_{1}^{n}(\mathcal {D}):= \biggl\{ h:h= \sum _{i \in \Lambda}c_{i}^{n} \Vert g_{i} \Vert _{n}\frac{g_{i}}{ \Vert g_{i} \Vert _{n}},h\in \mathcal {A}_{1}(\mathcal {D}) \biggr\} $$
(39)

and

$$ \Vert h \Vert _{\mathcal {A}_{1}^{n}(\mathcal {D})}:= \inf_{h} \biggl\{ \sum _{i \in \Lambda} \bigl\vert c_{i}^{n} \bigr\vert \cdot \Vert g_{i} \Vert _{n}, h \in \mathcal {A} _{1}^{n}(\mathcal {D}) \biggr\} . $$
(40)

It has been proved in Lemma 3.4 of [7] that

$$\begin{aligned} E\bigl( \Vert h \Vert _{\mathcal {A}_{1}^{n}}^{2}\bigr)\leq \Vert h \Vert _{\mathcal {A}_{1}}^{2}, \end{aligned}$$
(41)

which implies

$$\begin{aligned} E(\mathcal {S}_{2})\leq 8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}}}{\sum_{k=1}^{m}t_{k}^{2}}. \end{aligned}$$
(42)

For \(\mathcal {S}_{3}\), from the property of mathematical expectation and (1), we have

$$\begin{aligned} E \bigl( \Vert y-h \Vert ^{2}_{n}- \Vert y-f_{\rho} \Vert ^{2}_{n} \bigr)&=E\bigl( \bigl\vert y-h(x) \bigr\vert ^{2}\bigr)-E\bigl( \bigl\vert y-f_{\rho}(x) \bigr\vert ^{2}\bigr) \\ &=\mathcal {E}(h)-\mathcal {E}(f_{\rho}). \end{aligned}$$
(43)

This together with (3) yields

$$\begin{aligned} E(\mathcal {S}_{3})&=2 \Vert f_{\rho}-h \Vert ^{2}+2\kappa \frac{k\log n}{n}. \end{aligned}$$
(44)

Combining (37), (42), with (44), we obtain

$$\begin{aligned} E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq 6B^{2}\operatorname{Prob}(\Omega )+8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}}}{\sum_{k=1}^{m}t_{k}^{2}}+2 \Vert f_{ \rho}-h \Vert ^{2}+2\kappa \frac{k\log n}{n}. \end{aligned}$$
(45)

Next we bound \(\operatorname{Prob}(\Omega )\). To this end, we need the following known result in [10].

Lemma 4.2

Let \(\mathcal {F}\) be the class of functions \(\mathcal {F}=\{\vert f\vert \leq B\}\) for some fixed constant B. For all n and \(\alpha ,\beta >0\), we have

$$\begin{aligned} &\operatorname{Prob}\bigl\{ \exists f\in \mathcal {F}: \Vert f-f_{\rho} \Vert _{\rho _{X}}^{2}\geq 2\bigl( \Vert y-f \Vert _{n}^{2}- \Vert y-f_{\rho} \Vert _{n}^{2}\bigr)+\alpha +\beta \bigr\} \\ &\quad \leq 14\sup_{\textbf{x}}\mathscr{N} \biggl(\frac{\beta}{40B}, \mathcal {F},L_{1}(\vec{v}_{\textbf{x}}) \biggr)\exp \biggl(- \frac{\alpha n}{2568B^{4}} \biggr), \end{aligned}$$
(46)

where \({\textbf{x}}=(x_{1},\ldots,x_{n})\in X^{n}\) and \(\mathcal {N}(t,\mathcal {F},L_{1}(\vec{v}_{\textbf{x}}))\) is the covering number for the class \(\mathcal {F}\) by balls of radius t in \(L_{1}(\vec{v}_{\textbf{x}})\), with \(\vec{v}_{\textbf{x}}:=\frac{1}{n}\sum_{i=1}^{n}\delta _{x_{i}}\) the empirical discrete measure.

We define \(\mathcal {G}_{\Lambda}:=\operatorname{Span}\{g:g\in \Lambda \subset \mathcal {D}\}\) and \(\mathcal {F}_{k}:=\bigcup_{\Lambda \subset \mathcal {D}_{m},\#( \Lambda )\leq k} \{Tf:f\in \mathcal {G}_{\Lambda} \}\). Consider the probability

$$ p_{k}=\operatorname{Prob} \biggl\{ \exists f\in \mathcal {F}_{k}: \Vert f-f_{\rho} \Vert ^{2} \geq 2 \biggl( \Vert y-f \Vert _{n}^{2}- \Vert y-f_{\rho} \Vert _{n}^{2}+\kappa \frac{k\log n}{n} \biggr) \biggr\} . $$

Applying Lemma 4.2 to \(\mathcal {F}_{k}\) with \(\alpha =\kappa \frac{k\log n}{n}\), \(\beta =\frac{1}{n}\), and \(\kappa >1\), we get

$$\begin{aligned} p_{k}&\leq 14\sup_{\textbf{x}}\mathscr{N} \biggl(\frac{1}{40Bn}, \mathcal {F}_{k},L_{1}( \vec{v}_{\textbf{x}}) \biggr)\exp \biggl(-\kappa \frac{k\log n}{2568B^{4}} \biggr) \\ &=14\sup_{\textbf{x}}\mathscr{N} \biggl(\frac{1}{40Bn},\mathcal {F}_{k},L_{1}( \vec{v}_{\textbf{x}}) \biggr)n^{-\frac{\kappa k}{2568B^{4}}}. \end{aligned}$$
(47)

Lemma 3.3 of [7] provides the upper bound for \(\mathcal {N}(t,\mathcal {F}_{k},L_{1}(\vec{v}_{\textbf{x}}))\), which implies

$$\begin{aligned} p_{k}&\leq Cn^{ak}n^{2(k+1)}n^{-\frac{\kappa k}{2568B^{4}}}. \end{aligned}$$
(48)

Let \(\kappa \geq \kappa _{0}=2568B^{4}(a+5)\). Then the above inequality yields

$$\begin{aligned} p_{k}\leq Cn^{-3k+2}\leq Cn^{-2}. \end{aligned}$$
(49)

So we have

$$\begin{aligned} \operatorname{Prob}(\Omega )\leq \sum _{1\leq k\leq \frac{Bn}{\kappa}}p_{k}\leq \frac{C}{n}. \end{aligned}$$
(50)

By substituting the bound (50) of \(\operatorname{Prob}(\Omega )\) into (45), we get

$$\begin{aligned} E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq 8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}}}{\sum_{k=1}^{m}t_{k}^{2}}+2 \Vert f_{ \rho}-h \Vert ^{2}+C\frac{k\log n}{n}. \end{aligned}$$
(51)

Next we derive the K-functional result of the upper bound (51). It is known from the property of variance that

$$\begin{aligned} E^{2}\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert \bigr)\leq E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert \bigr)^{2}. \end{aligned}$$
(52)

Combining (51) with (52), we have

$$\begin{aligned} E(\parallel f_{{\mathbf{z}}}-f_{\rho} \parallel ) \leq & \sqrt { 8 \frac{\parallel h \parallel ^{2} _{\mathcal {A}_{1} }}{\sum_{k=1}^{m}t_{k}^{2}} +2 \parallel f_{\rho}-h \parallel ^{2} +C\frac{k\log n}{n} } \\ \leq & 2 \biggl( \frac{2 \parallel h \parallel _{\mathcal {A}_{1} }}{(\sum_{k=1}^{m}t_{k}^{2})^{1/2}} + \parallel f_{\rho}-h \parallel \biggr)+C\frac{k\log n}{n} \\ \leq & 2K \Biggl(f_{\rho},2 \Biggl(\sum_{k=1}^{m}t_{k}^{2} \Biggr)^{-1/2} \Biggr)+C\frac{k\log n}{n}. \end{aligned}$$
(53)

This completes the proof of Theorem 1.  □

5 Convergence rate and universal consistency

In this section, we analyze Theorem 2 under two different prior assumptions on \(f_{\rho}\). We begin with the definitions of \(\mathcal {A}_{1}(\mathcal {D}_{m})\), \(\mathcal {A}_{1,r}\), and \(\mathcal {B}_{p,r}\).

We define the space \(\mathcal {A}_{1}(\mathcal {D}_{m})\) to be the space \(\operatorname{Span}\{\mathcal {D}_{m}\}\) with the norm \(\|\cdot \|_{\mathcal {A}_{1}(\mathcal {D}_{m})}\) defined by (15). Note that now \(\mathcal {D}\) is replaced by \(\mathcal {D}_{m}\).

For \(r>0\), we then introduce the space

$$\begin{aligned} \mathcal {A}_{1,r}=\bigl\{ f:\forall m, \exists h=h(m)\in \operatorname{Span}\{ \mathcal {D}_{m}\}, \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq C, \Vert f-h \Vert \leq Cm^{-r}\bigr\} , \end{aligned}$$
(54)

where \(\|\cdot \|_{\mathcal {A}_{1,r}}\) is the minimum value of C such that (54) holds.

Furthermore, we present the following space:

$$ \mathcal {B}_{p,r}:=[\mathcal {H},\mathcal {A}_{1,r}]_{\theta ,\infty},\quad 0< \theta < 1, $$
(55)

with \(\frac{1}{p}=\frac{1+\theta}{2}\). From the definition of interpolation spaces in [26], we know that \(f\in [\mathcal {H},\mathcal {A}_{1,r}]_{\theta ,\infty}\) if and only if for any \(t>0\),

$$ K(f,t,\mathcal {H},\mathcal {A}_{1,r}):=\inf _{h\in \mathcal {A}_{1,r}} \bigl\{ \Vert f-h \Vert _{\mathcal {H}}+t \Vert h \Vert _{\mathcal {A}_{1,r}}\bigr\} \leq Ct^{\theta}. $$
(56)

The minimum C such that (56) holds true is defined as the norm on \(\mathcal {B}_{p,r}\).

Now we first consider \(f_{\rho}\in \mathcal {A}_{1,r}\).

Corollary 5.1

Under the assumptions of Theorem 2, if \(f_{\rho}\in \mathcal {A}_{1,r}\) with \(r>\frac{1}{2a}\), then we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq C\bigl(1+ \Vert f_{\rho} \Vert _{\mathcal {A}_{1,r}}\bigr)t_{0}^{-1} \biggl(\frac{n}{\log n} \biggr)^{-\frac{1}{2}}. $$
(57)

Proof

From the definition of \(\mathcal {A}_{1,r}\), there exists \(h:=h(m)\in \operatorname{Span}\{\mathcal {D}_{m}\}\) for every m that satisfies

$$ \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq M $$

and

$$ \Vert f_{\rho}-h \Vert \leq Mm^{-r}, $$

where \(M:=\|f_{\rho}\|_{\mathcal {A}_{1,r}}\).

Theorem 2 thus implies

$$\begin{aligned} E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq C\min_{k>0} \biggl( \frac{M^{2}}{kt_{0}^{2}}+M^{2}n^{-2ar}+\frac{k\log n}{n} \biggr). \end{aligned}$$
(58)

Moreover, the mild restriction \(2ar\geq 1\) with a arbitrarily large allows us to remove the term \(M^{2}n^{-2ar}\) in (58). To balance the errors in (58), we take \(k:= \lceil \frac{(M+1)^{2}}{t_{0}^{2}}\frac{n}{\log n} \rceil ^{\frac{1}{2}}\). Then the desired result (57) can be obtained. □

Next we consider \(f_{\rho}\in \mathcal {B}_{p,r}\).

Corollary 5.2

Under the assumptions of Theorem 2, if \(f_{\rho}\in \mathcal {B}_{p,r}\) with \(r>\frac{1}{2a}\), then we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq C t_{0}^{-p}\bigl(1+ \Vert f_{\rho} \Vert _{ \mathcal {B}_{p,r}}\bigr)^{p} \biggl( \frac{n}{\log n} \biggr)^{-1+\frac{p}{2}}.$$
(59)

Proof

By (56), if \(f\in \mathcal {B}_{p,r}\), then for any \(t>0\), we can find a function \(\tilde{f}\in \mathcal {A}_{1,r}\) that satisfies

$$ \Vert \tilde{f} \Vert _{\mathcal {A}_{1,r}}\leq \Vert f \Vert _{\mathcal {B}_{p,r}}t^{ \theta -1} $$
(60)

and

$$ \Vert f-\tilde{f} \Vert \leq \Vert f \Vert _{\mathcal {B}_{p,r}}t^{\theta}. $$
(61)

For \(\tilde{f}\in \mathcal {A}_{1,r}\), according to (54), there exists \(h:=h(m)\in \operatorname{Span}\{\mathcal {D}_{m}\}\) for every m that satisfies

$$ \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq \Vert \tilde{f} \Vert _{ \mathcal {A}_{1,r}} $$
(62)

and

$$ \Vert \tilde{f}-h \Vert \leq \Vert \tilde{f} \Vert _{\mathcal {A}_{1,r}}m^{-r}. $$
(63)

The relations (60), (62), and (63) imply

$$\begin{aligned} \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq \Vert f \Vert _{\mathcal {B}_{P,r}}t^{ \theta -1} \end{aligned}$$
(64)

and

$$\begin{aligned} \Vert \tilde{f}-h \Vert \leq \Vert f \Vert _{\mathcal {B}_{p,r}}t^{\theta -1}m^{-r}. \end{aligned}$$
(65)

Then combining (61) with (65), we obtain

$$ \Vert f-h \Vert \leq \Vert f \Vert _{\mathcal {B}_{p,r}} \bigl(t^{\theta}+t^{\theta -1}m^{-r}\bigr). $$
(66)

From (64) and (66), there exists \(h:=h(m)\in \operatorname{Span}\{\mathcal {D}_{m}\}\) for every m and \(t>0\) that satisfies

$$ \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq Mt^{\theta -1} $$

and

$$ \Vert f_{\rho}-h \Vert \leq M\bigl(t^{\theta}+t^{\theta -1}m^{-r} \bigr), $$

where \(M=\|f_{\rho}\|_{\mathcal {B}_{p,r}}\).

Therefore, Theorem 2 with \(t=k^{-\frac{1}{2}}\) implies

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq C\min_{k>0} \biggl(M^{2}t_{0}^{-2}k^{1- \frac{2}{p}}+M^{2} \bigl(k^{\frac{1}{2}-\frac{1}{p}}+k^{1-\frac{1}{p}}n^{-ar} \bigr)^{2}+ \frac{k\log n}{n} \biggr). $$
(67)

The condition \(2ar\geq 1\) also enables us to eliminate the term involving \(n^{-ar}\). Then, by taking \(k:= \lceil \frac{(M+1)^{2}}{t_{0}^{2}}\frac{n}{\log n} \rceil ^{\frac{p}{2}}\) in (67), we obtain the desired result (59). □

Then we show the universal consistency of the WRPGLA.

Theorem 3

Under the assumptions of Theorem 2, if the dictionary \(\mathcal {D}\) is complete in \(L_{\rho _{X}} ^{2}(X)\), for any \(f_{\rho}\), we have

$$ \lim_{n\rightarrow +\infty}E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)=0. $$
(68)

Proof

Since \(\mathcal {D}\) is complete in \(L_{\rho _{X}} ^{2}(X)\), we can find \(h\in \operatorname{Span}\{\mathcal {D}_{m}\}\) satisfying \(\|f_{\rho}-h\|\leq \varepsilon \), where \(\varepsilon >0\) and n is big enough. It follows from Theorem 2 that

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq C\min_{k>0} \biggl( \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}(\mathcal {D}_{m})}}{kt_{0}^{2}}+ \varepsilon ^{2}+\frac{k\log n}{n} \biggr). $$
(69)

To balance the first and third error term, we choose \(k:=n^{\frac{1}{2}}t_{0}^{-1}\), which implies

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq C\bigl(\varepsilon ^{2}+t_{0}^{-1}n^{- \frac{1}{2}} \log n\bigr). $$
(70)

Thus, for n sufficiently large,

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq 2C\varepsilon ^{2}. $$
(71)

This completes the proof of Theorem 3. □

Remark 3

It is known from [11] that the OGA and the RGA can achieve the optimal convergence rate \(\mathcal {O}(m^{-\frac{1}{2}})\) on \(\mathcal {A}_{1}(\mathcal {D})\). When \(t_{k}=1\), Lemma 4.1 shows that the WRPGA also attains the best rate. Meanwhile, we compare the WRPGLA with the OGLA and the relaxed greedy learning algorithm (RGLA). For \(f_{\rho}\in \mathcal {A}_{1,r}\), we derive the same convergence rate \(\mathcal {O}((n\log n)^{-1/2})\) of the WRPGLA as that of the OGLA and the RGLA in Ref. [7]. For \(f_{\rho}\in \mathcal {B}_{p,r}\), when \(p\rightarrow 1\), the rate \(\mathcal {O}((n\log n)^{-1+\frac{p}{2}})\) of the WRPGLA can be arbitrarily close to \(\mathcal {O}((n\log n)^{-1/2})\).

Moreover, from the viewpoint of the computational complexity, for the WRPGLA, the approximant \(f_{k}\) is constructed by solving a one-dimensional optimization problem since \(f_{k}\) is an orthogonal projection of f onto \(\operatorname{Span}\{\hat{f}_{k}\}\). On the other hand, the OGLA is more expensive to implement since at each step, the algorithm requires the evaluation of orthogonal projection on a k-dimensional space, and the output is constructed by solving a k-dimensional optimization problem. And it is clear that the WRPGLA is simpler than the RGLA. Thus, the WRPGLA should essentially reduce the complexity and make the learning process accelerated.

In future research, it would be an interesting project to deduce the error bound of the WRPGLA in Banach spaces with modulus of smoothness \(\rho (u)\leq \gamma u^{q}\), \(1< q\leq 2\) as [24, 27]. Furthermore, Guo and Ye [28, 29] derived the convergence rates of the moving least-squares learning algorithm for the weakly dependent and nonidentical samples. It remains open to explore the greedy learning algorithms in the non-i.i.d. and nonidentical sampling setting.