Abstract
We investigate the regression problem in supervised learning by means of the weak rescaled pure greedy algorithm (WRPGA). We construct learning estimator by applying the WRPGA and deduce the tight upper bounds of the K-functional error estimate for the corresponding greedy learning algorithms in Hilbert spaces. Satisfactory learning rates are obtained under two prior assumptions on the regression function. The application of the WRPGA in supervised learning considerably reduces the computational cost while maintaining its powerful generalization capability when compared with other greedy learning algorithms.
Similar content being viewed by others
1 Introduction
The applications of greedy algorithms to supervised learning have sparked great research interest because they have appealing generalization capability with lower computing burden than typical regularized methods, particularly in large-scale dictionary learning problem [1–6]. Big data sets for the most traditional learning algorithms frequently cause slow machine performance. To tackle this problem, many researchers [1–3, 7, 8] advocate greedy learning algorithms, which have greatly improved learning performance.
The approximation abilities of greedy-type algorithms for frames or more dictionaries \(\mathcal {D}\) were investigated in [7, 9–12], as well as various applications, see [3, 7, 13–19]. The pure greedy algorithm (PGA) can realize the best bilinear approximation, see [20, 21]. Although the PGA is outstanding at computing, the main problem is that it lacks optimal convergence properties for a general dictionary, and consequently the slower convergence rate than the best nonlinear approximation [11, 21–23] corrupts its learning performance. To improve the approximation rate, the orthogonal greedy algorithm (OGA), the relaxed greedy algorithm (RGA), the stepwise projection algorithm (SPA), and their weak versions have been proposed. It was shown that these greedy algorithms all achieved the optimal rate \(\mathcal {O}(m^{-\frac{1}{2}})\) for approximating the elements in the class \(\mathcal {A}_{1}(\mathcal {D})\), which will be defined in (14), where m is the iteration number, see [9, 11].
Both the OGA and the RGA have recently been employed successfully in machine learning [1–3, 7, 8]. For example, Barron et al. [7] established the optimal convergence rate \(\mathcal {O} (n/\log n)^{-\frac{1}{2}} )\), where n is the sample size. To reduce the OGA’s computational load, Fang et al. [1] investigated the learning performance of the orthogonal super greedy algorithm (OSGA) and derived the almost same rate as the orthogonal greedy learning algorithm (OGLA). All these results demonstrate that each greedy learning algorithm has its advantages and disadvantages.
We study the applications of weak greedy algorithms to least squares regression in supervised learning. It is well known that the weak type are easier to implement than the usual greedy algorithms, see [12]. Specifically, the weak rescaled pure greedy algorithm (WRPGA), one fairly simple modification of the PGA, is the goal of our investigation, see [24, 25]. When compared to the OGA and the RGA, the WRPGA can also furthermore reduce the computational load. The best rate \(\mathcal {O}(m^{-\frac{1}{2}})\) for functions in the basic sparse class has been proved [24]. Motivated by research results of [24], we proceed to use the same method employed for the RPGA in [24] to deduce the error bound of the K-functional estimate in the Hilbert space \(\mathcal {H}\) for the WRPGA. The WRPGA is a simple greedy algorithm with good approximation ability. Based on this, we propose the weak rescaled pure greedy learning algorithm (WRPGLA) for solving the kernel-based regression problems in supervised learning. Using the WRPGA’s proven approximation result, we can derive that the WRPGLA has the almost same learning rate as the OGLA. Our results show that the WRPGLA further cuts down the computational complexity even more without reducing generalization capabilities.
The paper is organized as follows. In Sect. 2, we review least squares regression learning theory and the WRPGA. In Sect. 3, we propose the WRPGLA and state the main theorems on the error estimates. Section 4 is devoted to proofs of the main results. We present the convergence rates under two smoothness assumptions on the regression function \(f_{\rho}\) in the last section.
2 Preliminaries
Some preliminaries are presented in this section. Sections 2.1 and 2.2 provide a fast overview of least squares regression learning and the WRPGA, respectively.
2.1 Least squares regression
In this paper, the approximation problem is addressed in the following statistical learning context. Let X be a compact metric space and \(Y=\mathbb{R}\). Let ρ be a Borel probability measure on \(Z= X\times Y \). The generalization error for a function \(f:X\rightarrow Y\) is defined by
which is minimized by the following regression function:
where \(\rho (\cdot \vert x)\) is the conditional distribution induced by ρ at \(x\in X\). In regression learning, ρ is unknown, and what one can know is a set of samples \({\mathbf{z}}=\{z_{i}\}_{i=1}^{n}=\{(x_{i}, y_{i})\}_{i=1}^{n} \in Z^{n}\) that are drawn independently and identically according to ρ. The goal of learning is to find a good approximation \(f_{{\mathbf{z}}}\) of \(f_{\rho}\), which minimizes the empirical error
Denote the Hilbert space of the square integrable functions defined on X with respect to the measure \(\rho _{X}\) by \(L_{\rho _{X}} ^{2}(X)\), where \(\rho _{X}\) is the marginal measure of ρ on X. It is clear from the definition of \(f_{\rho}(x)\) that for each \(x\in X\), \(\int _{Y}(f_{\rho}(x)-y)\,d\rho (y\vert x)=0\). For any \(f\in L_{\rho _{X}} ^{2}(X)\), it holds that
Therefore,
with the norm \(\|\cdot \|\)
The prediction accuracy of learning algorithms is measured by \(E(\|f_{{\mathbf{z}}}-f_{\rho}\|^{2})\).
We will assume \(\vert y\vert \leq B\) for a positive real number \(B<\infty \) almost surely. In this paper, we construct the learning estimator \(f_{{\mathbf{z}}}\) by applying the WRPGA and estimate \(E(\|f_{{\mathbf{z}}}-f_{\rho}\|^{2})\). So, in the following subsection, we recall this algorithm.
2.2 Weak rescaled pure greedy algorithm
We shall restrict our analysis to the situation in which approximation takes place in a real, separable Hilbert space \(\mathcal {H}\) with the inner product \(\langle \cdot ,\cdot \rangle _{\mathcal {H}}\) and the norm \(\|\cdot \|:=\|\cdot \|_{\mathcal {H}}=\langle \cdot ,\cdot \rangle _{ \mathcal {H}}^{\frac{1}{2}}\). Let \(\mathcal {D}\subset \mathcal {H}\) be a given dictionary satisfying \(\|g\|=1\) for every \(g\in \mathcal {D}\), \(g\in \mathcal {D}\) implies \(-g\in \mathcal {D}\) and \(\overline{\operatorname{Span}(\mathcal {D})}=\mathcal {H}\).
Petrova developed the rescaled pure greedy algorithm (RPGA) to enhance the PGA’s convergence rate, which simply rescales \(f_{m}\) at the mth greedy step, see [24]. We begin by describing the weak rescaled pure greedy algorithm (WRPGA) also introduced by Petrova in [24].
\({\mathbf{{WRPGA}}}(\{t_{m}\},\mathcal {D})\):
Step 0: Let \(f_{0} :=0\).
Step m (\(m \geq 1\)):
(1) If \(f=f_{m-1}\), then terminate the iterative process and define \(f_{k}=f_{m-1}=f\) for \(k \geq m\).
(2) If \(f\neq f_{m-1}\), then choose a direction \(\varphi _{m}\in \mathcal {D}\) such that
where \(\{t_{m}\}_{m=1}^{\infty}\) is a weakness sequence and \(t_{m}\in (0,1]\).
Let
The m step approximation \(f_{m}\) is defined as
and proceed to Step \(m+1\).
Remark 1
When \(t_{m}=1\), this algorithm is the RPGA. Note that if the supremum is not attained, one can select \(t_{m}<1\) and proceed with the algorithm. In this case, it is easier to choose \(\varphi _{m}\). If the output at the mth greedy step was \(\hat{f}_{m}\) rather than \(f_{m}=s_{m}{\hat{f}_{m}}\), this would be the PGA. The WRPGA uses \(s_{m}{\hat{f}_{m}}\), which is just suitable scaling of \({\hat{f}_{m}}\), and thus increases the rate to \(\mathcal {O}(m^{-\frac{1}{2}})\) for functions in the closure of the convex hull of \(\mathcal {D}\).
3 Weak rescaled pure greedy learning
We shall provide the WRPGLA for regression. From the definition of the WRPGA, computing \(\sup_{\varphi \in \mathcal {D}} \vert \langle f-f_{m-1},\varphi \rangle \vert \) may result in computation difficulty. Therefore we compute only over the truncation of the dictionary, which is a finite subset of \(\mathcal {D}\). Let \(\mathcal {D}_{1}\subset \mathcal {D}_{2}\subset \cdots\subset \mathcal {D}\). Then \(\mathcal {D}_{m}\) is the truncation of \(\mathcal {D}\) with the cardinality \(\#(\mathcal {D}_{m})=m\). Here we assume that
Then the WRPGLA is defined by the following simple processes.
WRPGLA:
Step 1: We apply the WRPGA for \(\mathcal {D}_{m}\) to the function \(y(x_{i})=y_{i}\) by utilizing the norm \(\|\cdot \|_{n}\) associated with the empirical inner product, that is,
Step 2: The algorithms establish the approximation \(f_{{\mathbf{z}},k}:=f_{k}\) to the data at the kth greedy step. Then, we define our estimator as \(f_{{\mathbf{z}}}:=Tf_{{\mathbf{z}},k^{*}}\), where \(Tu:=T_{B}\min \{B,\vert u\vert \}\operatorname{sgn}(u)\) and
where the constant \(\kappa \geq \kappa _{0}=2568B^{4}(a+5)\), which will be discussed in proof of Theorem 1.
Remark 2
First, when \(k=0\), it follows from \(f_{0}=0\) and \(\vert y\vert \leq B\) that \(\kappa \frac{k\log n}{n}\leq B^{2}\). This suggests that \(k^{*}\) is not larger than \(\frac{Bn}{\kappa}\). Second, from the definition of the estimator, we observe further that the computing cost of the kth greedy step is less than \(O(n^{a})\). For the WRPGA, it only requires an additional computation of \(s_{m}\).
To discuss the approximation properties of WRPGLA, we introduce the class of functions
and
Then
and
We also use the following K-functional:
Since all the constants in this work depend at most on \(\kappa _{0}\), B, and a, we denote all of them by C for simplicity of notation. Now we take \(\mathcal {H}=L_{\rho _{X}} ^{2}(X)\) with the norm defined by (4).
Then, we provide our main results on the generalization error bounds for the WRPGLA.
Theorem 1
There exists \(\kappa _{0}\) depending only on B and a such that if \(\kappa \geq \kappa _{0}\), then for all \(k>0\) and \(h\in \operatorname{Span}(\mathcal {D}_{m})\), the learning estimator by applying the WRPGA satisfies
Furthermore, we have
Applying Theorem 1 with \(t_{i}=t_{0}\) for all \(i\geq 1\) and \(0< t_{0}\leq 1\), we get the following theorem.
Theorem 2
Under the assumptions of Theorem 1, if \(t_{i}=t_{0}\) for all \(i\geq 1\) and \(0< t_{0}\leq 1\), then we have
Furthermore, we have
4 Proofs of the main results
To prove Theorem 1, we establish a lemma on the upper error bound for the WRPGA.
Lemma 4.1
If \(f\in \mathcal {H}\), \(h\in \mathcal {A}_{1}(\mathcal {D})\), then the output \((f_{m})_{m\geq 0}\) of the WRPGA satisfies
Proof
In terms of the definition of K-functional, we just need to prove that for \(f\in \mathcal {H}\) and \(h\in \mathcal {A}_{1}(\mathcal {D})\),
Since \(\mathcal {A}_{1}^{0}(\mathcal {D},M)\) is dense in \(\mathcal {A}_{1}(\mathcal {D},M)\), it suffices to prove (22) for functions h that are finite sums \(\sum_{j}c_{j}\varphi _{j}\) with \(\sum_{j}\vert c_{j}\vert \leq M\). We fix \(\epsilon >0\) and select a representation for \(h=\sum_{\varphi \in \mathcal {D}}c_{\varphi}\varphi \), such that
Denote
The nonincreasing of \(\{e_{m}\}_{m=0}^{\infty}\) implies that \(\{a_{m}\}_{m=0}^{\infty}\) is also a nonincreasing sequence.
Then we discuss these two cases separately.
Case 1: \(a_{0}:=\|f\|^{2}-\|f-h\|^{2}\leq 0\). Then, for every \(m\geq 1\), we have \(a_{m}\leq 0\). Therefore inequality (22) holds true.
Case 2: \(a_{0}>0\). Assume that \(a_{m-1}>0\), \(m\geq 1\). Note that \(f_{m}\) is the orthogonal projection of f onto the linear space spanned by \(\hat{f}_{m}\), it implies
This together with the selection of \(\varphi _{m}\) implies
By (23), we get
Let \(\epsilon \rightarrow 0\). Therefore
It has been proved in [24] that
Then, using the assumption that \(a_{m-1}>0\), we have
It yields
In particular, for \(m=1\), we have
Case 2.1: \(0< a_{0}<\frac{4M^{2}}{t_{1}^{2}}\). Since \(\psi (t):=t (1-\frac{t_{1}^{2}t}{4M^{2}} )\) on \((0,\frac{4M^{2}}{t_{1}^{2}} )\) has maximum \(\frac{M^{2}}{t_{1}^{2}}\), it follows that
Therefore, either all \(\{a_{m}\}_{m=0}^{\infty}\subset (0,\frac{4M^{2}}{t_{1}^{2}})\) and then satisfy (31), or we know that \(a_{m^{\ast}}\leq 0\) for some \(m^{\ast}\geq 1\). The analysis for \(m\geq m^{\ast}\) is therefore the same as in Case 1. For the positive elements in \(\{a_{m}\}_{m=0}^{\infty}\), by applying Lemma 2.2 from [24] with \(l=1\), \(r_{m}=t_{m}^{2}\), \(B=\frac{4M^{2}}{t_{1}^{2}}\), \(J=0\), and \(r=4M^{2}\), we obtain
which gives inequality (22).
Case 2.2: \(a_{0}\geq \frac{4M^{2}}{t_{1}^{2}}\). It follows from (32) that \(a_{1}<0\). That is, \(e_{1}^{2}<\|f-h\|^{2}\), which yields (22) due to monotonicity. Lemma 4.1 is proved. □
Now we prove Theorem 1.
Proof of Theorem 1
As shown in [5], \(\|f_{{\mathbf{z}}}-f_{\rho}\|^{2}\) can be decomposed as
where
and \(h\in \operatorname{Span}\{\mathcal {D}_{m}\}\).
We firstly estimate the bound of \(\mathcal {S}_{1}\). To do this, we introduce Ω,
Let \(\operatorname{Prob}(\Omega )\) be the probability that the sample point is a member of the set Ω. Then from \(\vert y\vert \leq B\) and the definition of \(f_{\rho}\) and \(f_{{\mathbf{z}}}\), we have
For \(\mathcal {S}_{2}\), according to Lemma 4.1, we get
where
and
It has been proved in Lemma 3.4 of [7] that
which implies
For \(\mathcal {S}_{3}\), from the property of mathematical expectation and (1), we have
This together with (3) yields
Combining (37), (42), with (44), we obtain
Next we bound \(\operatorname{Prob}(\Omega )\). To this end, we need the following known result in [10].
Lemma 4.2
Let \(\mathcal {F}\) be the class of functions \(\mathcal {F}=\{\vert f\vert \leq B\}\) for some fixed constant B. For all n and \(\alpha ,\beta >0\), we have
where \({\textbf{x}}=(x_{1},\ldots,x_{n})\in X^{n}\) and \(\mathcal {N}(t,\mathcal {F},L_{1}(\vec{v}_{\textbf{x}}))\) is the covering number for the class \(\mathcal {F}\) by balls of radius t in \(L_{1}(\vec{v}_{\textbf{x}})\), with \(\vec{v}_{\textbf{x}}:=\frac{1}{n}\sum_{i=1}^{n}\delta _{x_{i}}\) the empirical discrete measure.
We define \(\mathcal {G}_{\Lambda}:=\operatorname{Span}\{g:g\in \Lambda \subset \mathcal {D}\}\) and \(\mathcal {F}_{k}:=\bigcup_{\Lambda \subset \mathcal {D}_{m},\#( \Lambda )\leq k} \{Tf:f\in \mathcal {G}_{\Lambda} \}\). Consider the probability
Applying Lemma 4.2 to \(\mathcal {F}_{k}\) with \(\alpha =\kappa \frac{k\log n}{n}\), \(\beta =\frac{1}{n}\), and \(\kappa >1\), we get
Lemma 3.3 of [7] provides the upper bound for \(\mathcal {N}(t,\mathcal {F}_{k},L_{1}(\vec{v}_{\textbf{x}}))\), which implies
Let \(\kappa \geq \kappa _{0}=2568B^{4}(a+5)\). Then the above inequality yields
So we have
By substituting the bound (50) of \(\operatorname{Prob}(\Omega )\) into (45), we get
Next we derive the K-functional result of the upper bound (51). It is known from the property of variance that
Combining (51) with (52), we have
This completes the proof of Theorem 1. □
5 Convergence rate and universal consistency
In this section, we analyze Theorem 2 under two different prior assumptions on \(f_{\rho}\). We begin with the definitions of \(\mathcal {A}_{1}(\mathcal {D}_{m})\), \(\mathcal {A}_{1,r}\), and \(\mathcal {B}_{p,r}\).
We define the space \(\mathcal {A}_{1}(\mathcal {D}_{m})\) to be the space \(\operatorname{Span}\{\mathcal {D}_{m}\}\) with the norm \(\|\cdot \|_{\mathcal {A}_{1}(\mathcal {D}_{m})}\) defined by (15). Note that now \(\mathcal {D}\) is replaced by \(\mathcal {D}_{m}\).
For \(r>0\), we then introduce the space
where \(\|\cdot \|_{\mathcal {A}_{1,r}}\) is the minimum value of C such that (54) holds.
Furthermore, we present the following space:
with \(\frac{1}{p}=\frac{1+\theta}{2}\). From the definition of interpolation spaces in [26], we know that \(f\in [\mathcal {H},\mathcal {A}_{1,r}]_{\theta ,\infty}\) if and only if for any \(t>0\),
The minimum C such that (56) holds true is defined as the norm on \(\mathcal {B}_{p,r}\).
Now we first consider \(f_{\rho}\in \mathcal {A}_{1,r}\).
Corollary 5.1
Under the assumptions of Theorem 2, if \(f_{\rho}\in \mathcal {A}_{1,r}\) with \(r>\frac{1}{2a}\), then we have
Proof
From the definition of \(\mathcal {A}_{1,r}\), there exists \(h:=h(m)\in \operatorname{Span}\{\mathcal {D}_{m}\}\) for every m that satisfies
and
where \(M:=\|f_{\rho}\|_{\mathcal {A}_{1,r}}\).
Theorem 2 thus implies
Moreover, the mild restriction \(2ar\geq 1\) with a arbitrarily large allows us to remove the term \(M^{2}n^{-2ar}\) in (58). To balance the errors in (58), we take \(k:= \lceil \frac{(M+1)^{2}}{t_{0}^{2}}\frac{n}{\log n} \rceil ^{\frac{1}{2}}\). Then the desired result (57) can be obtained. □
Next we consider \(f_{\rho}\in \mathcal {B}_{p,r}\).
Corollary 5.2
Under the assumptions of Theorem 2, if \(f_{\rho}\in \mathcal {B}_{p,r}\) with \(r>\frac{1}{2a}\), then we have
Proof
By (56), if \(f\in \mathcal {B}_{p,r}\), then for any \(t>0\), we can find a function \(\tilde{f}\in \mathcal {A}_{1,r}\) that satisfies
and
For \(\tilde{f}\in \mathcal {A}_{1,r}\), according to (54), there exists \(h:=h(m)\in \operatorname{Span}\{\mathcal {D}_{m}\}\) for every m that satisfies
and
The relations (60), (62), and (63) imply
and
Then combining (61) with (65), we obtain
From (64) and (66), there exists \(h:=h(m)\in \operatorname{Span}\{\mathcal {D}_{m}\}\) for every m and \(t>0\) that satisfies
and
where \(M=\|f_{\rho}\|_{\mathcal {B}_{p,r}}\).
Therefore, Theorem 2 with \(t=k^{-\frac{1}{2}}\) implies
The condition \(2ar\geq 1\) also enables us to eliminate the term involving \(n^{-ar}\). Then, by taking \(k:= \lceil \frac{(M+1)^{2}}{t_{0}^{2}}\frac{n}{\log n} \rceil ^{\frac{p}{2}}\) in (67), we obtain the desired result (59). □
Then we show the universal consistency of the WRPGLA.
Theorem 3
Under the assumptions of Theorem 2, if the dictionary \(\mathcal {D}\) is complete in \(L_{\rho _{X}} ^{2}(X)\), for any \(f_{\rho}\), we have
Proof
Since \(\mathcal {D}\) is complete in \(L_{\rho _{X}} ^{2}(X)\), we can find \(h\in \operatorname{Span}\{\mathcal {D}_{m}\}\) satisfying \(\|f_{\rho}-h\|\leq \varepsilon \), where \(\varepsilon >0\) and n is big enough. It follows from Theorem 2 that
To balance the first and third error term, we choose \(k:=n^{\frac{1}{2}}t_{0}^{-1}\), which implies
Thus, for n sufficiently large,
This completes the proof of Theorem 3. □
Remark 3
It is known from [11] that the OGA and the RGA can achieve the optimal convergence rate \(\mathcal {O}(m^{-\frac{1}{2}})\) on \(\mathcal {A}_{1}(\mathcal {D})\). When \(t_{k}=1\), Lemma 4.1 shows that the WRPGA also attains the best rate. Meanwhile, we compare the WRPGLA with the OGLA and the relaxed greedy learning algorithm (RGLA). For \(f_{\rho}\in \mathcal {A}_{1,r}\), we derive the same convergence rate \(\mathcal {O}((n\log n)^{-1/2})\) of the WRPGLA as that of the OGLA and the RGLA in Ref. [7]. For \(f_{\rho}\in \mathcal {B}_{p,r}\), when \(p\rightarrow 1\), the rate \(\mathcal {O}((n\log n)^{-1+\frac{p}{2}})\) of the WRPGLA can be arbitrarily close to \(\mathcal {O}((n\log n)^{-1/2})\).
Moreover, from the viewpoint of the computational complexity, for the WRPGLA, the approximant \(f_{k}\) is constructed by solving a one-dimensional optimization problem since \(f_{k}\) is an orthogonal projection of f onto \(\operatorname{Span}\{\hat{f}_{k}\}\). On the other hand, the OGLA is more expensive to implement since at each step, the algorithm requires the evaluation of orthogonal projection on a k-dimensional space, and the output is constructed by solving a k-dimensional optimization problem. And it is clear that the WRPGLA is simpler than the RGLA. Thus, the WRPGLA should essentially reduce the complexity and make the learning process accelerated.
In future research, it would be an interesting project to deduce the error bound of the WRPGLA in Banach spaces with modulus of smoothness \(\rho (u)\leq \gamma u^{q}\), \(1< q\leq 2\) as [24, 27]. Furthermore, Guo and Ye [28, 29] derived the convergence rates of the moving least-squares learning algorithm for the weakly dependent and nonidentical samples. It remains open to explore the greedy learning algorithms in the non-i.i.d. and nonidentical sampling setting.
Data availability
All data, models, and code generated or used during the study appear in the submitted article.
References
Fang, J., Lin, S.B., Xu, Z.B.: Learning and approximation capabilities of orthogonal super greedy algorithm. Knowl.-Based Syst. 95, 86–98 (2016)
Chen, H., Li, L.Q., Pan, Z.B.: Learning rates of multi-kernel regression by orthogonal greedy algorithm. J. Stat. Plan. Inference 143, 276–282 (2013)
Lin, S.B., Rong, Y.H., Sun, X.P., Xu, Z.B.: Learning capability of relaxed greedy algorithms. IEEE Trans. Neural Netw. Learn. Syst. 24(10), 1598–1608 (2013)
Xu, L., Lin, S.B., Xu, Z.B.: Learning capability of the truncated greedy algorithm. Sci. China Inf. Sci. 59(5), 052103 (2016). https://doi.org/10.1007/s11432-016-5536-6
Barron, A.R.: Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)
Guo, Q.: Distributed semi-supervised regression learning with coefficient regularization. Results Math. 77, 1–19 (2022)
Barron, A.R., Cohen, A., Dahmen, W., DeVore, R.A.: Approximation and learning by greedy algorithms. Ann. Stat. 36(1), 64–94 (2008)
Xu, L., Lin, S.B., Zeng, J.S., Liu, X., Xu, Z.B.: Greedy criterion in orthogonal greedy learning. IEEE Trans. Cybern. 48(3), 955–966 (2018)
Jones, L.K.: A simple lemma on greedy approximation in Hilbert spaces and convergence rates for projection pursuit regression and neural network training. Ann. Stat. 20(1), 608–613 (1992)
Lee, W.S., Bartlett, P.L., Williamson, R.C.: Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inf. Theory 42(6), 2118–2132 (1996)
DeVore, R.A., Temlyakov, V.N.: Some remarks on greedy algorithms. Adv. Comput. Math. 5, 173–187 (1996)
Temlyakov, V.N.: Greedy Approximation. Cambridge University Press, Cambridge (2011)
Dai, W., Milenkovic, O.: Subspace pursuit for compressive sensing signal recontruction. IEEE Trans. Inf. Theory 55(5), 2230–2249 (2009)
Kunis, S., Rauhut, H.: Random sampling of sparse trigonometric polynomials ii-orthogonal matching pursuit versus basis pursuit. Found. Comput. Math. 8, 737–763 (2008)
Donoho, D.L., Tsaig, Y., Drori, I., Starck, J.L.: Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Trans. Inf. Theory 58(2), 1094–1121 (2012)
Tropp, J.A., Wright, S.: Computational methods for sparse solution of linear inverse problems. Proc. IEEE 98(6), 948–958 (2010)
Temlyakov, V.N., Zheltov, P.: On performance of greedy algorithms. J. Approx. Theory 163(9), 1134–1145 (2011)
Donoho, D.L., Elad, M., Temlyakov, V.N.: On Lebesgue-type inequalities for greedy approximation. J. Approx. Theory 147(2), 185–195 (2007)
Chen, H., Zhou, Y.C., Tang, Y.Y., Li, L.Q., Pan, Z.B.: Convergence rate of the semi-supervised greedy algorithm. Neural Netw. 44, 44–50 (2013)
Schmidt, E.: Zur theorie der linearen und nicht linearen integralgleichungen zweite abhandlung. Math. Ann. 64, 161–174 (1907)
Temlyakov, V.N.: Greedy approximation. Acta Numer. 17, 235–409 (2008)
Livshitz, E.D., Temlyakov, V.N.: Two lower estimates in greedy approximation. Constr. Approx. 19, 509–523 (2003)
Livshits, E.D.: Lower bounds for the rate of convergence of greedy algorithms. Izv. Math. 73, 1197–1215 (2009)
Petrova, G.: Rescaled pure greedy algorithm for Hilbert and Banach spaces. Appl. Comput. Harmon. Anal. 41, 852–866 (2016)
Jiang, B., Ye, P.X.: Efficiency of the weak rescaled pure greedy algorithm. Int. J. Wavelets Multiresolut. Inf. Process. 19(4), 2150001 (2021)
Bergh, J., Lofstrom, J.: Interpolation Spaces. Springer, Berlin (1976)
Temlyakov, V.N.: Greedy algorithms in Banach spaces. Adv. Comput. Math. 14, 277–292 (2001)
Guo, Q., Ye, P.X.: Error analysis of the moving least-squares method with non-identical sampling. Int. J. Comput. Math. 96(4), 767–781 (2019)
Guo, Q., Ye, P.X.: Error analysis of the moving least-squares regression learning algorithm with β-mixing and non-identical sampling. Int. J. Comput. Math. 97(8), 1586–1602 (2020)
Funding
This research is supported by the National Science Foundation for Young Scientists of China (Grant No. 12001328), the National Natural Science Foundation of China (Grant No. 11671213), the Development Plan of Youth Innovation Team of University in Shandong Province (No. 2021KJ067), the Shandong Provincial Natural Science Foundation of China (No. ZR2022MF223), and the Qilu University of Technology (Shandong Academy of Sciences) Talent Research Project (No. 2023RCKY133).
Author information
Authors and Affiliations
Contributions
All authors contributed substantially to this paper, participated in drafting and checking the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Guo, Q., Liu, X. & Ye, P. The learning performance of the weak rescaled pure greedy algorithms. J Inequal Appl 2024, 30 (2024). https://doi.org/10.1186/s13660-024-03077-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13660-024-03077-6