The learning performance of the weak rescaled pure greedy algorithms

Guo, Qin; Liu, Xianghua; Ye, Peixin

doi:10.1186/s13660-024-03077-6

The learning performance of the weak rescaled pure greedy algorithms

Research
Open access
Published: 04 March 2024

Volume 2024, article number 30, (2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Inequalities and Applications Submit manuscript

The learning performance of the weak rescaled pure greedy algorithms

Download PDF

Qin Guo¹,
Xianghua Liu²^na1 &
Peixin Ye²^na1

444 Accesses
1 Citation
Explore all metrics

Abstract

We investigate the regression problem in supervised learning by means of the weak rescaled pure greedy algorithm (WRPGA). We construct learning estimator by applying the WRPGA and deduce the tight upper bounds of the K-functional error estimate for the corresponding greedy learning algorithms in Hilbert spaces. Satisfactory learning rates are obtained under two prior assumptions on the regression function. The application of the WRPGA in supervised learning considerably reduces the computational cost while maintaining its powerful generalization capability when compared with other greedy learning algorithms.

Learning capability of the truncated greedy algorithm

Article 08 April 2016

Rescaled Pure Greedy Algorithm for convex optimization

Article 08 April 2019

Greedy Strategies for Convex Optimization

Article 30 March 2016

1 Introduction

The applications of greedy algorithms to supervised learning have sparked great research interest because they have appealing generalization capability with lower computing burden than typical regularized methods, particularly in large-scale dictionary learning problem [1–6]. Big data sets for the most traditional learning algorithms frequently cause slow machine performance. To tackle this problem, many researchers [1–3, 7, 8] advocate greedy learning algorithms, which have greatly improved learning performance.

The approximation abilities of greedy-type algorithms for frames or more dictionaries $\mathcal {D}$ were investigated in [7, 9–12], as well as various applications, see [3, 7, 13–19]. The pure greedy algorithm (PGA) can realize the best bilinear approximation, see [20, 21]. Although the PGA is outstanding at computing, the main problem is that it lacks optimal convergence properties for a general dictionary, and consequently the slower convergence rate than the best nonlinear approximation [11, 21–23] corrupts its learning performance. To improve the approximation rate, the orthogonal greedy algorithm (OGA), the relaxed greedy algorithm (RGA), the stepwise projection algorithm (SPA), and their weak versions have been proposed. It was shown that these greedy algorithms all achieved the optimal rate $\mathcal {O}(m^{-\frac{1}{2}})$ for approximating the elements in the class $\mathcal {A}_{1}(\mathcal {D})$, which will be defined in (14), where m is the iteration number, see [9, 11].

Both the OGA and the RGA have recently been employed successfully in machine learning [1–3, 7, 8]. For example, Barron et al. [7] established the optimal convergence rate $\mathcal {O} (n/\log n)^{-\frac{1}{2}} )$, where n is the sample size. To reduce the OGA’s computational load, Fang et al. [1] investigated the learning performance of the orthogonal super greedy algorithm (OSGA) and derived the almost same rate as the orthogonal greedy learning algorithm (OGLA). All these results demonstrate that each greedy learning algorithm has its advantages and disadvantages.

We study the applications of weak greedy algorithms to least squares regression in supervised learning. It is well known that the weak type are easier to implement than the usual greedy algorithms, see [12]. Specifically, the weak rescaled pure greedy algorithm (WRPGA), one fairly simple modification of the PGA, is the goal of our investigation, see [24, 25]. When compared to the OGA and the RGA, the WRPGA can also furthermore reduce the computational load. The best rate $\mathcal {O}(m^{-\frac{1}{2}})$ for functions in the basic sparse class has been proved [24]. Motivated by research results of [24], we proceed to use the same method employed for the RPGA in [24] to deduce the error bound of the K-functional estimate in the Hilbert space $\mathcal {H}$ for the WRPGA. The WRPGA is a simple greedy algorithm with good approximation ability. Based on this, we propose the weak rescaled pure greedy learning algorithm (WRPGLA) for solving the kernel-based regression problems in supervised learning. Using the WRPGA’s proven approximation result, we can derive that the WRPGLA has the almost same learning rate as the OGLA. Our results show that the WRPGLA further cuts down the computational complexity even more without reducing generalization capabilities.

The paper is organized as follows. In Sect. 2, we review least squares regression learning theory and the WRPGA. In Sect. 3, we propose the WRPGLA and state the main theorems on the error estimates. Section 4 is devoted to proofs of the main results. We present the convergence rates under two smoothness assumptions on the regression function $f_{\rho}$ in the last section.

2 Preliminaries

Some preliminaries are presented in this section. Sections 2.1 and 2.2 provide a fast overview of least squares regression learning and the WRPGA, respectively.

2.1 Least squares regression

In this paper, the approximation problem is addressed in the following statistical learning context. Let X be a compact metric space and $Y=\mathbb{R}$. Let ρ be a Borel probability measure on $Z= X\times Y $. The generalization error for a function $f:X\rightarrow Y$ is defined by

$$ \mathcal {E}(f)= \int _{Z}\bigl(f(x)-y\bigr)^{2}\,d\rho ,$$

(1)

which is minimized by the following regression function:

$$ f_{\rho}(x)= \int _{Y} y \,d\rho (y\vert x), $$

where $\rho (\cdot \vert x)$ is the conditional distribution induced by ρ at $x\in X$. In regression learning, ρ is unknown, and what one can know is a set of samples ${\mathbf{z}}=\{z_{i}\}_{i=1}^{n}=\{(x_{i}, y_{i})\}_{i=1}^{n} \in Z^{n}$ that are drawn independently and identically according to ρ. The goal of learning is to find a good approximation $f_{{\mathbf{z}}}$ of $f_{\rho}$, which minimizes the empirical error

$$ \mathcal {E}_{{\mathbf{z}}}(f)= \Vert y-f \Vert _{n}^{2}:=\frac{1}{n}\sum_{i=1}^{n} \bigl(f(x_{i})-y_{i}\bigr)^{2}.$$

(2)

Denote the Hilbert space of the square integrable functions defined on X with respect to the measure $\rho _{X}$ by $L_{\rho _{X}} ^{2}(X)$, where $\rho _{X}$ is the marginal measure of ρ on X. It is clear from the definition of $f_{\rho}(x)$ that for each $x\in X$, $\int _{Y}(f_{\rho}(x)-y)\,d\rho (y\vert x)=0$. For any $f\in L_{\rho _{X}} ^{2}(X)$, it holds that

$$\begin{aligned} \mathcal {E}(f)={}& \int _{Z}\bigl(f(x)-f_{\rho}(x)+f_{\rho}(x)-y \bigr)^{2}\,d\rho \\ ={}& \int _{X}\bigl(f(x)-f_{\rho}(x)\bigr)^{2}\,d\rho _{X}+ \int _{Z}\bigl(f_{\rho}(x)-y\bigr)^{2}\,d\rho \\ &{}+2 \int _{X}\bigl(f(x)-f_{\rho}(x)\bigr)\,d\rho _{X} \int _{Y}\bigl(f_{\rho}(x)-y\bigr)\,d\rho (y\vert x) \\ ={}& \int _{X}\bigl(f(x)-f_{\rho}(x)\bigr)^{2}\,d\rho _{X}+\mathcal {E}(f_{\rho}). \end{aligned}$$

Therefore,

$$ \mathcal {E}(f)-\mathcal {E}(f_{\rho})= \Vert f-f_{\rho} \Vert ^{2} $$

(3)

with the norm $\|\cdot \|$

$$ \Vert f \Vert = \biggl( \int _{X} \bigl\vert f(x) \bigr\vert ^{2}d{\rho _{X}} \biggr)^{ \frac{1}{2}}. $$

(4)

The prediction accuracy of learning algorithms is measured by $E(\|f_{{\mathbf{z}}}-f_{\rho}\|^{2})$.

We will assume $\vert y\vert \leq B$ for a positive real number $B<\infty $ almost surely. In this paper, we construct the learning estimator $f_{{\mathbf{z}}}$ by applying the WRPGA and estimate $E(\|f_{{\mathbf{z}}}-f_{\rho}\|^{2})$. So, in the following subsection, we recall this algorithm.

2.2 Weak rescaled pure greedy algorithm

We shall restrict our analysis to the situation in which approximation takes place in a real, separable Hilbert space $\mathcal {H}$ with the inner product $\langle \cdot ,\cdot \rangle _{\mathcal {H}}$ and the norm $\|\cdot \|:=\|\cdot \|_{\mathcal {H}}=\langle \cdot ,\cdot \rangle _{ \mathcal {H}}^{\frac{1}{2}}$. Let $\mathcal {D}\subset \mathcal {H}$ be a given dictionary satisfying $\|g\|=1$ for every $g\in \mathcal {D}$, $g\in \mathcal {D}$ implies $-g\in \mathcal {D}$ and $\overline{\operatorname{Span}(\mathcal {D})}=\mathcal {H}$.

Petrova developed the rescaled pure greedy algorithm (RPGA) to enhance the PGA’s convergence rate, which simply rescales $f_{m}$ at the mth greedy step, see [24]. We begin by describing the weak rescaled pure greedy algorithm (WRPGA) also introduced by Petrova in [24].

${\mathbf{{WRPGA}}}(\{t_{m}\},\mathcal {D})$:

Step 0: Let $f_{0} :=0$.

Step m ($m \geq 1$):

(1) If $f=f_{m-1}$, then terminate the iterative process and define $f_{k}=f_{m-1}=f$ for $k \geq m$.

(2) If $f\neq f_{m-1}$, then choose a direction $\varphi _{m}\in \mathcal {D}$ such that

$$ \bigl\vert \langle f-f_{m-1},\varphi _{m}\rangle \bigr\vert \geq t_{m}\sup_{ \varphi \in \mathcal {D}} \bigl\vert \langle f-f_{m-1},\varphi \rangle \bigr\vert , $$

(5)

where $\{t_{m}\}_{m=1}^{\infty}$ is a weakness sequence and $t_{m}\in (0,1]$.

Let

$$\begin{aligned}& \lambda _{m}:= \langle f-f_{m-1},\varphi _{m} \rangle , \end{aligned}$$

(6)

$$\begin{aligned}& \hat{f}_{m}:= f_{m-1}+\lambda _{m} \varphi _{m}, \end{aligned}$$

(7)

$$\begin{aligned}& s_{m}:= \frac { \langle f, \hat{f}_{m}\rangle }{ \Vert \hat{f }_{m} \Vert ^{2}}. \end{aligned}$$

(8)

The m step approximation $f_{m}$ is defined as

$$ f_{m}=s_{m}\hat{f}_{m} , $$

(9)

and proceed to Step $m+1$.

Remark 1

When $t_{m}=1$, this algorithm is the RPGA. Note that if the supremum is not attained, one can select $t_{m}<1$ and proceed with the algorithm. In this case, it is easier to choose $\varphi _{m}$. If the output at the mth greedy step was $\hat{f}_{m}$ rather than $f_{m}=s_{m}{\hat{f}_{m}}$, this would be the PGA. The WRPGA uses $s_{m}{\hat{f}_{m}}$, which is just suitable scaling of ${\hat{f}_{m}}$, and thus increases the rate to $\mathcal {O}(m^{-\frac{1}{2}})$ for functions in the closure of the convex hull of $\mathcal {D}$.

3 Weak rescaled pure greedy learning

We shall provide the WRPGLA for regression. From the definition of the WRPGA, computing $\sup_{\varphi \in \mathcal {D}} \vert \langle f-f_{m-1},\varphi \rangle \vert $ may result in computation difficulty. Therefore we compute only over the truncation of the dictionary, which is a finite subset of $\mathcal {D}$. Let $\mathcal {D}_{1}\subset \mathcal {D}_{2}\subset \cdots\subset \mathcal {D}$. Then $\mathcal {D}_{m}$ is the truncation of $\mathcal {D}$ with the cardinality $\#(\mathcal {D}_{m})=m$. Here we assume that

$$ m\leq m(n):=\bigl\lfloor n^{a}\bigr\rfloor \quad \text{for some fixed }a \geq 1. $$

(10)

Then the WRPGLA is defined by the following simple processes.

WRPGLA:

Step 1: We apply the WRPGA for $\mathcal {D}_{m}$ to the function $y(x_{i})=y_{i}$ by utilizing the norm $\|\cdot \|_{n}$ associated with the empirical inner product, that is,

$$ \Vert f \Vert _{n}:= \Biggl(\frac{1}{n}\sum _{i=1}^{n} \bigl\vert f(x_{i}) \bigr\vert ^{2} \Biggr)^{\frac{1}{2}}. $$

Step 2: The algorithms establish the approximation $f_{{\mathbf{z}},k}:=f_{k}$ to the data at the kth greedy step. Then, we define our estimator as $f_{{\mathbf{z}}}:=Tf_{{\mathbf{z}},k^{*}}$, where $Tu:=T_{B}\min \{B,\vert u\vert \}\operatorname{sgn}(u)$ and

$$ k^{*}:=\arg \min_{k>0} \biggl\{ \Vert y-Tf_{{\mathbf{z}},k} \Vert _{n}^{2}+\kappa \frac{k\log n}{n} \biggr\} ,$$

(11)

where the constant $\kappa \geq \kappa _{0}=2568B^{4}(a+5)$, which will be discussed in proof of Theorem 1.

Remark 2

First, when $k=0$, it follows from $f_{0}=0$ and $\vert y\vert \leq B$ that $\kappa \frac{k\log n}{n}\leq B^{2}$. This suggests that $k^{*}$ is not larger than $\frac{Bn}{\kappa}$. Second, from the definition of the estimator, we observe further that the computing cost of the kth greedy step is less than $O(n^{a})$. For the WRPGA, it only requires an additional computation of $s_{m}$.

To discuss the approximation properties of WRPGLA, we introduce the class of functions

$$ \mathcal {A}_{1}^{0}(\mathcal {D},M):= \biggl\{ f= \sum _{k \in \Lambda}c_{k}(f)\varphi _{k}: \varphi _{k} \in D,\#(\Lambda )< \infty ,\sum _{k \in \Lambda } \bigl\vert c_{k}(f) \bigr\vert \leq M \biggr\} , $$

(12)

and

$$ \mathcal {A} _{1}(\mathcal {D},M) = \overline {\mathcal {A}_{1}^{0}(\mathcal {D},M) }. $$

(13)

Then

$$ \mathcal {A} _{1}(\mathcal {D})=\bigcup _{M>0}\mathcal {A} _{1}( \mathcal {D},M)$$

(14)

and

$$ \Vert f \Vert _{\mathcal {A} _{1}(\mathcal {D})}:= \inf \bigl\{ M: f\in \mathcal {A} _{1}( \mathcal {D},M)\bigr\} .$$

(15)

We also use the following K-functional:

$$ K(f,t):= K\bigl(f,t,\mathcal {H},\mathcal {A}_{1}(\mathcal {D}) \bigr):=\inf_{h \in \mathcal {A}_{1}(\mathcal {D})}\bigl\{ \Vert f-h \Vert _{\mathcal {H}}+t \Vert h \Vert _{ \mathcal {A}_{1}(\mathcal {D})}\bigr\} , \quad t>0. $$

(16)

Since all the constants in this work depend at most on $\kappa _{0}$, B, and a, we denote all of them by C for simplicity of notation. Now we take $\mathcal {H}=L_{\rho _{X}} ^{2}(X)$ with the norm defined by (4).

Then, we provide our main results on the generalization error bounds for the WRPGLA.

Theorem 1

There exists $\kappa _{0}$ depending only on B and a such that if $\kappa \geq \kappa _{0}$, then for all $k>0$ and $h\in \operatorname{Span}(\mathcal {D}_{m})$, the learning estimator by applying the WRPGA satisfies

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq 8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}(\mathcal {D}_{m})}}{\sum_{i=1}^{k}t_{i}^{2}}+2 \Vert f_{\rho}-h \Vert ^{2}+C \frac{k\log n}{n}. $$

(17)

Furthermore, we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert \bigr)\leq 2K \Biggl(f_{\rho},2 \Biggl(\sum_{i=1}^{k}t_{i}^{2} \Biggr)^{-\frac{1}{2}} \Biggr)+C\frac{k\log n}{n}. $$

(18)

Applying Theorem 1 with $t_{i}=t_{0}$ for all $i\geq 1$ and $0< t_{0}\leq 1$, we get the following theorem.

Theorem 2

Under the assumptions of Theorem 1, if $t_{i}=t_{0}$ for all $i\geq 1$ and $0< t_{0}\leq 1$, then we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq 8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}(\mathcal {D}_{m})}}{kt_{0}^{2}}+2 \Vert f_{ \rho}-h \Vert ^{2}+C \frac{k\log n}{n}. $$

(19)

Furthermore, we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert \bigr)\leq 2K \bigl(f_{\rho},2k^{-\frac{1}{2}}t_{0}^{-1} \bigr)+C \frac{k\log n}{n}. $$

(20)

4 Proofs of the main results

To prove Theorem 1, we establish a lemma on the upper error bound for the WRPGA.

Lemma 4.1

If $f\in \mathcal {H}$, $h\in \mathcal {A}_{1}(\mathcal {D})$, then the output $(f_{m})_{m\geq 0}$ of the WRPGA satisfies

$$\begin{aligned} e_{m}:= \Vert f-f_{m} \Vert \leq 2K \Biggl(f, \Biggl( \sum_{k=1}^{m}t_{k}^{2} \Biggr)^{-1/2} \Biggr),\quad m=0,1,2,\ldots . \end{aligned}$$

(21)

Proof

In terms of the definition of K-functional, we just need to prove that for $f\in \mathcal {H}$ and $h\in \mathcal {A}_{1}(\mathcal {D})$,

$$\begin{aligned} e_{m}^{2}\leq \Vert f-h \Vert ^{2}+\frac{4}{\sum_{k=1}^{m}t_{k}^{2}} \Vert h \Vert _{ \mathcal {A}_{1}(\mathcal {D})}^{2},\quad m=1,2, \ldots . \end{aligned}$$

(22)

Since $\mathcal {A}_{1}^{0}(\mathcal {D},M)$ is dense in $\mathcal {A}_{1}(\mathcal {D},M)$, it suffices to prove (22) for functions h that are finite sums $\sum_{j}c_{j}\varphi _{j}$ with $\sum_{j}\vert c_{j}\vert \leq M$. We fix $\epsilon >0$ and select a representation for $h=\sum_{\varphi \in \mathcal {D}}c_{\varphi}\varphi $, such that

$$\begin{aligned} \sum_{\varphi \in \mathcal {D}} \vert c_{\varphi} \vert < M+ \epsilon . \end{aligned}$$

(23)

Denote

$$\begin{aligned} a_{m}:=e_{m}^{2}- \Vert f-h \Vert ^{2},\quad m=1,2,\ldots . \end{aligned}$$

(24)

The nonincreasing of $\{e_{m}\}_{m=0}^{\infty}$ implies that $\{a_{m}\}_{m=0}^{\infty}$ is also a nonincreasing sequence.

Then we discuss these two cases separately.

Case 1: $a_{0}:=\|f\|^{2}-\|f-h\|^{2}\leq 0$. Then, for every $m\geq 1$, we have $a_{m}\leq 0$. Therefore inequality (22) holds true.

Case 2: $a_{0}>0$. Assume that $a_{m-1}>0$, $m\geq 1$. Note that $f_{m}$ is the orthogonal projection of f onto the linear space spanned by $\hat{f}_{m}$, it implies

$$\begin{aligned} \langle f-f_{m},f_{m}\rangle =0,\quad m\geq 0. \end{aligned}$$

(25)

This together with the selection of $\varphi _{m}$ implies

$$\begin{aligned} e_{m-1}^{2}&=\langle f-f_{m-1},f-f_{m-1} \rangle \\ &=\langle f-f_{m-1},f \rangle \\ &=\langle f-f_{m-1},f-h\rangle +\langle f-f_{m-1},h\rangle \\ &\leq e_{m-1} \Vert f-h \Vert +\sum_{\varphi \in \mathcal {D}}c_{ \varphi} \langle f-f_{m-1},\varphi \rangle \\ &\leq e_{m-1} \Vert f-h \Vert +t_{m}^{-1} \bigl\vert \langle f-f_{m-1},\varphi _{m} \rangle \bigr\vert \sum_{\varphi \in \mathcal {D}} \vert c_{ \varphi} \vert . \end{aligned}$$

(26)

By (23), we get

$$\begin{aligned} e_{m-1}^{2}\leq \frac{1}{2}\bigl(e_{m-1}^{2}+ \Vert f-h \Vert ^{2}\bigr)+t_{m}^{-1} \bigl\vert \langle f-f_{m-1},\varphi _{m}\rangle \bigr\vert (M+ \epsilon ). \end{aligned}$$

(27)

Let $\epsilon \rightarrow 0$. Therefore

$$\begin{aligned} \bigl\vert \langle f-f_{m-1},\varphi _{m}\rangle \bigr\vert \geq \frac{t_{m}(e_{m-1}^{2}- \Vert f-h \Vert ^{2})}{2M}. \end{aligned}$$

(28)

It has been proved in [24] that

$$\begin{aligned} e_{m}^{2}\leq e_{m-1}^{2}-\langle f-f_{m-1},\varphi _{m}\rangle ^{2},\quad m=1,2, \ldots . \end{aligned}$$

(29)

Then, using the assumption that $a_{m-1}>0$, we have

$$\begin{aligned} e_{m}^{2}\leq e_{m-1}^{2}- \frac{t_{m}^{2}a_{m-1}^{2}}{4M^{2}}. \end{aligned}$$

(30)

It yields

$$\begin{aligned} a_{m}\leq a_{m-1} \biggl(1-\frac{t_{m}^{2}a_{m-1}}{4M^{2}} \biggr). \end{aligned}$$

(31)

In particular, for $m=1$, we have

$$\begin{aligned} a_{1}\leq a_{0} \biggl(1-\frac{t_{1}^{2}a_{0}}{4M^{2}} \biggr). \end{aligned}$$

(32)

Case 2.1: $0< a_{0}<\frac{4M^{2}}{t_{1}^{2}}$. Since $\psi (t):=t (1-\frac{t_{1}^{2}t}{4M^{2}} )$ on $(0,\frac{4M^{2}}{t_{1}^{2}} )$ has maximum $\frac{M^{2}}{t_{1}^{2}}$, it follows that

$$ a_{m}\leq \psi (a_{0})\leq \frac{M^{2}}{t_{1}^{2}}\leq \frac{4M^{2}}{t_{1}^{2}}. $$

Therefore, either all $\{a_{m}\}_{m=0}^{\infty}\subset (0,\frac{4M^{2}}{t_{1}^{2}})$ and then satisfy (31), or we know that $a_{m^{\ast}}\leq 0$ for some $m^{\ast}\geq 1$. The analysis for $m\geq m^{\ast}$ is therefore the same as in Case 1. For the positive elements in $\{a_{m}\}_{m=0}^{\infty}$, by applying Lemma 2.2 from [24] with $l=1$, $r_{m}=t_{m}^{2}$, $B=\frac{4M^{2}}{t_{1}^{2}}$, $J=0$, and $r=4M^{2}$, we obtain

$$\begin{aligned} a_{m}\leq \frac{4M^{2}}{t_{1}^{2}+\sum_{k=1}^{m}t_{k}^{2}}\leq \frac{4M^{2}}{\sum_{k=1}^{m}t_{k}^{2}}, \end{aligned}$$

(33)

which gives inequality (22).

Case 2.2: $a_{0}\geq \frac{4M^{2}}{t_{1}^{2}}$. It follows from (32) that $a_{1}<0$. That is, $e_{1}^{2}<\|f-h\|^{2}$, which yields (22) due to monotonicity. Lemma 4.1 is proved. □

Now we prove Theorem 1.

Proof of Theorem 1

As shown in [5], $\|f_{{\mathbf{z}}}-f_{\rho}\|^{2}$ can be decomposed as

$$\begin{aligned} \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2} \leq{}& \mathcal {S}_{1}+\mathcal {S}_{2}+ \mathcal {S}_{3} \\ &{}+2 \biggl( \Vert y-f_{{\mathbf{z}}} \Vert ^{2}_{n}+ \kappa \frac{k^{\ast}\log n}{n}- \Vert y-Tf_{{ \mathbf{z}},k} \Vert ^{2}_{n}-\kappa \frac{k\log n}{n} \biggr), \end{aligned}$$

(34)

where

$$\begin{aligned} &\mathcal {S}_{1}:= \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}-2 \biggl( \Vert y-f_{{\mathbf{z}}} \Vert ^{2}_{n}- \Vert y-f_{\rho} \Vert ^{2}_{n}+\kappa \frac{k^{\ast}\log n}{n} \biggr), \\ &\mathcal {S}_{2}:=2 \bigl( \Vert y-f_{{\mathbf{z}},k} \Vert ^{2}_{n}- \Vert y-h \Vert ^{2}_{n} \bigr), \\ &\mathcal {S}_{3}:=2 \biggl( \Vert y-h \Vert ^{2}_{n}- \Vert y-f_{\rho} \Vert ^{2}_{n}+ \kappa \frac{k\log n}{n} \biggr), \end{aligned}$$

(35)

and $h\in \operatorname{Span}\{\mathcal {D}_{m}\}$.

We firstly estimate the bound of $\mathcal {S}_{1}$. To do this, we introduce Ω,

$$\begin{aligned} \Omega = \biggl\{ \textbf{z}:\textbf{z}\in Z^{n}, \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2} \geq 2 \biggl( \Vert y-f_{{\mathbf{z}}} \Vert _{n}^{2}- \Vert y-f_{\rho} \Vert _{n}^{2}+\kappa \frac{k^{\ast}\log n}{n} \biggr) \biggr\} . \end{aligned}$$

(36)

Let $\operatorname{Prob}(\Omega )$ be the probability that the sample point is a member of the set Ω. Then from $\vert y\vert \leq B$ and the definition of $f_{\rho}$ and $f_{{\mathbf{z}}}$, we have

$$\begin{aligned} E(\mathcal {S}_{1})\leq 6B^{2} \operatorname{Prob}(\Omega ). \end{aligned}$$

(37)

For $\mathcal {S}_{2}$, according to Lemma 4.1, we get

$$\begin{aligned} \Vert y-f_{{\mathbf{z}},k} \Vert ^{2}_{n}- \Vert y-h \Vert _{n}^{2}\leq 4 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}^{n}}}{\sum_{k=1}^{m}t_{k}^{2}}, \end{aligned}$$

(38)

where

$$ \mathcal {A}_{1}^{n}(\mathcal {D}):= \biggl\{ h:h= \sum _{i \in \Lambda}c_{i}^{n} \Vert g_{i} \Vert _{n}\frac{g_{i}}{ \Vert g_{i} \Vert _{n}},h\in \mathcal {A}_{1}(\mathcal {D}) \biggr\} $$

(39)

and

$$ \Vert h \Vert _{\mathcal {A}_{1}^{n}(\mathcal {D})}:= \inf_{h} \biggl\{ \sum _{i \in \Lambda} \bigl\vert c_{i}^{n} \bigr\vert \cdot \Vert g_{i} \Vert _{n}, h \in \mathcal {A} _{1}^{n}(\mathcal {D}) \biggr\} . $$

(40)

It has been proved in Lemma 3.4 of [7] that

$$\begin{aligned} E\bigl( \Vert h \Vert _{\mathcal {A}_{1}^{n}}^{2}\bigr)\leq \Vert h \Vert _{\mathcal {A}_{1}}^{2}, \end{aligned}$$

(41)

which implies

$$\begin{aligned} E(\mathcal {S}_{2})\leq 8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}}}{\sum_{k=1}^{m}t_{k}^{2}}. \end{aligned}$$

(42)

For $\mathcal {S}_{3}$, from the property of mathematical expectation and (1), we have

$$\begin{aligned} E \bigl( \Vert y-h \Vert ^{2}_{n}- \Vert y-f_{\rho} \Vert ^{2}_{n} \bigr)&=E\bigl( \bigl\vert y-h(x) \bigr\vert ^{2}\bigr)-E\bigl( \bigl\vert y-f_{\rho}(x) \bigr\vert ^{2}\bigr) \\ &=\mathcal {E}(h)-\mathcal {E}(f_{\rho}). \end{aligned}$$

(43)

This together with (3) yields

$$\begin{aligned} E(\mathcal {S}_{3})&=2 \Vert f_{\rho}-h \Vert ^{2}+2\kappa \frac{k\log n}{n}. \end{aligned}$$

(44)

Combining (37), (42), with (44), we obtain

$$\begin{aligned} E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq 6B^{2}\operatorname{Prob}(\Omega )+8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}}}{\sum_{k=1}^{m}t_{k}^{2}}+2 \Vert f_{ \rho}-h \Vert ^{2}+2\kappa \frac{k\log n}{n}. \end{aligned}$$

(45)

Next we bound $\operatorname{Prob}(\Omega )$. To this end, we need the following known result in [10].

Lemma 4.2

Let $\mathcal {F}$ be the class of functions $\mathcal {F}=\{\vert f\vert \leq B\}$ for some fixed constant B. For all n and $\alpha ,\beta >0$, we have

$$\begin{aligned} &\operatorname{Prob}\bigl\{ \exists f\in \mathcal {F}: \Vert f-f_{\rho} \Vert _{\rho _{X}}^{2}\geq 2\bigl( \Vert y-f \Vert _{n}^{2}- \Vert y-f_{\rho} \Vert _{n}^{2}\bigr)+\alpha +\beta \bigr\} \\ &\quad \leq 14\sup_{\textbf{x}}\mathscr{N} \biggl(\frac{\beta}{40B}, \mathcal {F},L_{1}(\vec{v}_{\textbf{x}}) \biggr)\exp \biggl(- \frac{\alpha n}{2568B^{4}} \biggr), \end{aligned}$$

(46)

where ${\textbf{x}}=(x_{1},\ldots,x_{n})\in X^{n}$ and $\mathcal {N}(t,\mathcal {F},L_{1}(\vec{v}_{\textbf{x}}))$ is the covering number for the class $\mathcal {F}$ by balls of radius t in $L_{1}(\vec{v}_{\textbf{x}})$, with $\vec{v}_{\textbf{x}}:=\frac{1}{n}\sum_{i=1}^{n}\delta _{x_{i}}$ the empirical discrete measure.

We define $\mathcal {G}_{\Lambda}:=\operatorname{Span}\{g:g\in \Lambda \subset \mathcal {D}\}$ and $\mathcal {F}_{k}:=\bigcup_{\Lambda \subset \mathcal {D}_{m},\#( \Lambda )\leq k} \{Tf:f\in \mathcal {G}_{\Lambda} \}$. Consider the probability

$$ p_{k}=\operatorname{Prob} \biggl\{ \exists f\in \mathcal {F}_{k}: \Vert f-f_{\rho} \Vert ^{2} \geq 2 \biggl( \Vert y-f \Vert _{n}^{2}- \Vert y-f_{\rho} \Vert _{n}^{2}+\kappa \frac{k\log n}{n} \biggr) \biggr\} . $$

Applying Lemma 4.2 to $\mathcal {F}_{k}$ with $\alpha =\kappa \frac{k\log n}{n}$, $\beta =\frac{1}{n}$, and $\kappa >1$, we get

$$\begin{aligned} p_{k}&\leq 14\sup_{\textbf{x}}\mathscr{N} \biggl(\frac{1}{40Bn}, \mathcal {F}_{k},L_{1}( \vec{v}_{\textbf{x}}) \biggr)\exp \biggl(-\kappa \frac{k\log n}{2568B^{4}} \biggr) \\ &=14\sup_{\textbf{x}}\mathscr{N} \biggl(\frac{1}{40Bn},\mathcal {F}_{k},L_{1}( \vec{v}_{\textbf{x}}) \biggr)n^{-\frac{\kappa k}{2568B^{4}}}. \end{aligned}$$

(47)

Lemma 3.3 of [7] provides the upper bound for $\mathcal {N}(t,\mathcal {F}_{k},L_{1}(\vec{v}_{\textbf{x}}))$, which implies

$$\begin{aligned} p_{k}&\leq Cn^{ak}n^{2(k+1)}n^{-\frac{\kappa k}{2568B^{4}}}. \end{aligned}$$

(48)

Let $\kappa \geq \kappa _{0}=2568B^{4}(a+5)$. Then the above inequality yields

$$\begin{aligned} p_{k}\leq Cn^{-3k+2}\leq Cn^{-2}. \end{aligned}$$

(49)

So we have

$$\begin{aligned} \operatorname{Prob}(\Omega )\leq \sum _{1\leq k\leq \frac{Bn}{\kappa}}p_{k}\leq \frac{C}{n}. \end{aligned}$$

(50)

By substituting the bound (50) of $\operatorname{Prob}(\Omega )$ into (45), we get

$$\begin{aligned} E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq 8 \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}}}{\sum_{k=1}^{m}t_{k}^{2}}+2 \Vert f_{ \rho}-h \Vert ^{2}+C\frac{k\log n}{n}. \end{aligned}$$

(51)

Next we derive the K-functional result of the upper bound (51). It is known from the property of variance that

$$\begin{aligned} E^{2}\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert \bigr)\leq E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert \bigr)^{2}. \end{aligned}$$

(52)

Combining (51) with (52), we have

$$\begin{aligned} E(\parallel f_{{\mathbf{z}}}-f_{\rho} \parallel ) \leq & \sqrt { 8 \frac{\parallel h \parallel ^{2} _{\mathcal {A}_{1} }}{\sum_{k=1}^{m}t_{k}^{2}} +2 \parallel f_{\rho}-h \parallel ^{2} +C\frac{k\log n}{n} } \\ \leq & 2 \biggl( \frac{2 \parallel h \parallel _{\mathcal {A}_{1} }}{(\sum_{k=1}^{m}t_{k}^{2})^{1/2}} + \parallel f_{\rho}-h \parallel \biggr)+C\frac{k\log n}{n} \\ \leq & 2K \Biggl(f_{\rho},2 \Biggl(\sum_{k=1}^{m}t_{k}^{2} \Biggr)^{-1/2} \Biggr)+C\frac{k\log n}{n}. \end{aligned}$$

(53)

This completes the proof of Theorem 1. □

5 Convergence rate and universal consistency

In this section, we analyze Theorem 2 under two different prior assumptions on $f_{\rho}$. We begin with the definitions of $\mathcal {A}_{1}(\mathcal {D}_{m})$, $\mathcal {A}_{1,r}$, and $\mathcal {B}_{p,r}$.

We define the space $\mathcal {A}_{1}(\mathcal {D}_{m})$ to be the space $\operatorname{Span}\{\mathcal {D}_{m}\}$ with the norm $\|\cdot \|_{\mathcal {A}_{1}(\mathcal {D}_{m})}$ defined by (15). Note that now $\mathcal {D}$ is replaced by $\mathcal {D}_{m}$.

For $r>0$, we then introduce the space

$$\begin{aligned} \mathcal {A}_{1,r}=\bigl\{ f:\forall m, \exists h=h(m)\in \operatorname{Span}\{ \mathcal {D}_{m}\}, \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq C, \Vert f-h \Vert \leq Cm^{-r}\bigr\} , \end{aligned}$$

(54)

where $\|\cdot \|_{\mathcal {A}_{1,r}}$ is the minimum value of C such that (54) holds.

Furthermore, we present the following space:

$$ \mathcal {B}_{p,r}:=[\mathcal {H},\mathcal {A}_{1,r}]_{\theta ,\infty},\quad 0< \theta < 1, $$

(55)

with $\frac{1}{p}=\frac{1+\theta}{2}$. From the definition of interpolation spaces in [26], we know that $f\in [\mathcal {H},\mathcal {A}_{1,r}]_{\theta ,\infty}$ if and only if for any $t>0$,

$$ K(f,t,\mathcal {H},\mathcal {A}_{1,r}):=\inf _{h\in \mathcal {A}_{1,r}} \bigl\{ \Vert f-h \Vert _{\mathcal {H}}+t \Vert h \Vert _{\mathcal {A}_{1,r}}\bigr\} \leq Ct^{\theta}. $$

(56)

The minimum C such that (56) holds true is defined as the norm on $\mathcal {B}_{p,r}$.

Now we first consider $f_{\rho}\in \mathcal {A}_{1,r}$.

Corollary 5.1

Under the assumptions of Theorem 2, if $f_{\rho}\in \mathcal {A}_{1,r}$ with $r>\frac{1}{2a}$, then we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq C\bigl(1+ \Vert f_{\rho} \Vert _{\mathcal {A}_{1,r}}\bigr)t_{0}^{-1} \biggl(\frac{n}{\log n} \biggr)^{-\frac{1}{2}}. $$

(57)

Proof

From the definition of $\mathcal {A}_{1,r}$, there exists $h:=h(m)\in \operatorname{Span}\{\mathcal {D}_{m}\}$ for every m that satisfies

$$ \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq M $$

and

$$ \Vert f_{\rho}-h \Vert \leq Mm^{-r}, $$

where $M:=\|f_{\rho}\|_{\mathcal {A}_{1,r}}$.

Theorem 2 thus implies

$$\begin{aligned} E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq C\min_{k>0} \biggl( \frac{M^{2}}{kt_{0}^{2}}+M^{2}n^{-2ar}+\frac{k\log n}{n} \biggr). \end{aligned}$$

(58)

Moreover, the mild restriction $2ar\geq 1$ with a arbitrarily large allows us to remove the term $M^{2}n^{-2ar}$ in (58). To balance the errors in (58), we take $k:= \lceil \frac{(M+1)^{2}}{t_{0}^{2}}\frac{n}{\log n} \rceil ^{\frac{1}{2}}$. Then the desired result (57) can be obtained. □

Next we consider $f_{\rho}\in \mathcal {B}_{p,r}$.

Corollary 5.2

Under the assumptions of Theorem 2, if $f_{\rho}\in \mathcal {B}_{p,r}$ with $r>\frac{1}{2a}$, then we have

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq C t_{0}^{-p}\bigl(1+ \Vert f_{\rho} \Vert _{ \mathcal {B}_{p,r}}\bigr)^{p} \biggl( \frac{n}{\log n} \biggr)^{-1+\frac{p}{2}}.$$

(59)

Proof

By (56), if $f\in \mathcal {B}_{p,r}$, then for any $t>0$, we can find a function $\tilde{f}\in \mathcal {A}_{1,r}$ that satisfies

$$ \Vert \tilde{f} \Vert _{\mathcal {A}_{1,r}}\leq \Vert f \Vert _{\mathcal {B}_{p,r}}t^{ \theta -1} $$

(60)

and

$$ \Vert f-\tilde{f} \Vert \leq \Vert f \Vert _{\mathcal {B}_{p,r}}t^{\theta}. $$

(61)

For $\tilde{f}\in \mathcal {A}_{1,r}$, according to (54), there exists $h:=h(m)\in \operatorname{Span}\{\mathcal {D}_{m}\}$ for every m that satisfies

$$ \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq \Vert \tilde{f} \Vert _{ \mathcal {A}_{1,r}} $$

(62)

and

$$ \Vert \tilde{f}-h \Vert \leq \Vert \tilde{f} \Vert _{\mathcal {A}_{1,r}}m^{-r}. $$

(63)

The relations (60), (62), and (63) imply

$$\begin{aligned} \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq \Vert f \Vert _{\mathcal {B}_{P,r}}t^{ \theta -1} \end{aligned}$$

(64)

and

$$\begin{aligned} \Vert \tilde{f}-h \Vert \leq \Vert f \Vert _{\mathcal {B}_{p,r}}t^{\theta -1}m^{-r}. \end{aligned}$$

(65)

Then combining (61) with (65), we obtain

$$ \Vert f-h \Vert \leq \Vert f \Vert _{\mathcal {B}_{p,r}} \bigl(t^{\theta}+t^{\theta -1}m^{-r}\bigr). $$

(66)

From (64) and (66), there exists $h:=h(m)\in \operatorname{Span}\{\mathcal {D}_{m}\}$ for every m and $t>0$ that satisfies

$$ \Vert h \Vert _{\mathcal {A}_{1}(\mathcal {D}_{m})}\leq Mt^{\theta -1} $$

and

$$ \Vert f_{\rho}-h \Vert \leq M\bigl(t^{\theta}+t^{\theta -1}m^{-r} \bigr), $$

where $M=\|f_{\rho}\|_{\mathcal {B}_{p,r}}$.

Therefore, Theorem 2 with $t=k^{-\frac{1}{2}}$ implies

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)\leq C\min_{k>0} \biggl(M^{2}t_{0}^{-2}k^{1- \frac{2}{p}}+M^{2} \bigl(k^{\frac{1}{2}-\frac{1}{p}}+k^{1-\frac{1}{p}}n^{-ar} \bigr)^{2}+ \frac{k\log n}{n} \biggr). $$

(67)

The condition $2ar\geq 1$ also enables us to eliminate the term involving $n^{-ar}$. Then, by taking $k:= \lceil \frac{(M+1)^{2}}{t_{0}^{2}}\frac{n}{\log n} \rceil ^{\frac{p}{2}}$ in (67), we obtain the desired result (59). □

Then we show the universal consistency of the WRPGLA.

Theorem 3

Under the assumptions of Theorem 2, if the dictionary $\mathcal {D}$ is complete in $L_{\rho _{X}} ^{2}(X)$, for any $f_{\rho}$, we have

$$ \lim_{n\rightarrow +\infty}E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr)=0. $$

(68)

Proof

Since $\mathcal {D}$ is complete in $L_{\rho _{X}} ^{2}(X)$, we can find $h\in \operatorname{Span}\{\mathcal {D}_{m}\}$ satisfying $\|f_{\rho}-h\|\leq \varepsilon $, where $\varepsilon >0$ and n is big enough. It follows from Theorem 2 that

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq C\min_{k>0} \biggl( \frac{ \Vert h \Vert ^{2}_{\mathcal {A}_{1}(\mathcal {D}_{m})}}{kt_{0}^{2}}+ \varepsilon ^{2}+\frac{k\log n}{n} \biggr). $$

(69)

To balance the first and third error term, we choose $k:=n^{\frac{1}{2}}t_{0}^{-1}$, which implies

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq C\bigl(\varepsilon ^{2}+t_{0}^{-1}n^{- \frac{1}{2}} \log n\bigr). $$

(70)

Thus, for n sufficiently large,

$$ E\bigl( \Vert f_{{\mathbf{z}}}-f_{\rho} \Vert ^{2}\bigr) \leq 2C\varepsilon ^{2}. $$

(71)

This completes the proof of Theorem 3. □

Remark 3

It is known from [11] that the OGA and the RGA can achieve the optimal convergence rate $\mathcal {O}(m^{-\frac{1}{2}})$ on $\mathcal {A}_{1}(\mathcal {D})$. When $t_{k}=1$, Lemma 4.1 shows that the WRPGA also attains the best rate. Meanwhile, we compare the WRPGLA with the OGLA and the relaxed greedy learning algorithm (RGLA). For $f_{\rho}\in \mathcal {A}_{1,r}$, we derive the same convergence rate $\mathcal {O}((n\log n)^{-1/2})$ of the WRPGLA as that of the OGLA and the RGLA in Ref. [7]. For $f_{\rho}\in \mathcal {B}_{p,r}$, when $p\rightarrow 1$, the rate $\mathcal {O}((n\log n)^{-1+\frac{p}{2}})$ of the WRPGLA can be arbitrarily close to $\mathcal {O}((n\log n)^{-1/2})$.

Moreover, from the viewpoint of the computational complexity, for the WRPGLA, the approximant $f_{k}$ is constructed by solving a one-dimensional optimization problem since $f_{k}$ is an orthogonal projection of f onto $\operatorname{Span}\{\hat{f}_{k}\}$. On the other hand, the OGLA is more expensive to implement since at each step, the algorithm requires the evaluation of orthogonal projection on a k-dimensional space, and the output is constructed by solving a k-dimensional optimization problem. And it is clear that the WRPGLA is simpler than the RGLA. Thus, the WRPGLA should essentially reduce the complexity and make the learning process accelerated.

In future research, it would be an interesting project to deduce the error bound of the WRPGLA in Banach spaces with modulus of smoothness $\rho (u)\leq \gamma u^{q}$, $1< q\leq 2$ as [24, 27]. Furthermore, Guo and Ye [28, 29] derived the convergence rates of the moving least-squares learning algorithm for the weakly dependent and nonidentical samples. It remains open to explore the greedy learning algorithms in the non-i.i.d. and nonidentical sampling setting.

Data availability

All data, models, and code generated or used during the study appear in the submitted article.

References

Fang, J., Lin, S.B., Xu, Z.B.: Learning and approximation capabilities of orthogonal super greedy algorithm. Knowl.-Based Syst. 95, 86–98 (2016)
Article Google Scholar
Chen, H., Li, L.Q., Pan, Z.B.: Learning rates of multi-kernel regression by orthogonal greedy algorithm. J. Stat. Plan. Inference 143, 276–282 (2013)
Article MathSciNet Google Scholar
Lin, S.B., Rong, Y.H., Sun, X.P., Xu, Z.B.: Learning capability of relaxed greedy algorithms. IEEE Trans. Neural Netw. Learn. Syst. 24(10), 1598–1608 (2013)
Article PubMed Google Scholar
Xu, L., Lin, S.B., Xu, Z.B.: Learning capability of the truncated greedy algorithm. Sci. China Inf. Sci. 59(5), 052103 (2016). https://doi.org/10.1007/s11432-016-5536-6
Article Google Scholar
Barron, A.R.: Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)
Article MathSciNet Google Scholar
Guo, Q.: Distributed semi-supervised regression learning with coefficient regularization. Results Math. 77, 1–19 (2022)
Article MathSciNet Google Scholar
Barron, A.R., Cohen, A., Dahmen, W., DeVore, R.A.: Approximation and learning by greedy algorithms. Ann. Stat. 36(1), 64–94 (2008)
Article MathSciNet Google Scholar
Xu, L., Lin, S.B., Zeng, J.S., Liu, X., Xu, Z.B.: Greedy criterion in orthogonal greedy learning. IEEE Trans. Cybern. 48(3), 955–966 (2018)
Article PubMed Google Scholar
Jones, L.K.: A simple lemma on greedy approximation in Hilbert spaces and convergence rates for projection pursuit regression and neural network training. Ann. Stat. 20(1), 608–613 (1992)
Article MathSciNet Google Scholar
Lee, W.S., Bartlett, P.L., Williamson, R.C.: Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inf. Theory 42(6), 2118–2132 (1996)
Article MathSciNet Google Scholar
DeVore, R.A., Temlyakov, V.N.: Some remarks on greedy algorithms. Adv. Comput. Math. 5, 173–187 (1996)
Article MathSciNet Google Scholar
Temlyakov, V.N.: Greedy Approximation. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Dai, W., Milenkovic, O.: Subspace pursuit for compressive sensing signal recontruction. IEEE Trans. Inf. Theory 55(5), 2230–2249 (2009)
Article Google Scholar
Kunis, S., Rauhut, H.: Random sampling of sparse trigonometric polynomials ii-orthogonal matching pursuit versus basis pursuit. Found. Comput. Math. 8, 737–763 (2008)
Article MathSciNet Google Scholar
Donoho, D.L., Tsaig, Y., Drori, I., Starck, J.L.: Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Trans. Inf. Theory 58(2), 1094–1121 (2012)
Article MathSciNet Google Scholar
Tropp, J.A., Wright, S.: Computational methods for sparse solution of linear inverse problems. Proc. IEEE 98(6), 948–958 (2010)
Article Google Scholar
Temlyakov, V.N., Zheltov, P.: On performance of greedy algorithms. J. Approx. Theory 163(9), 1134–1145 (2011)
Article MathSciNet Google Scholar
Donoho, D.L., Elad, M., Temlyakov, V.N.: On Lebesgue-type inequalities for greedy approximation. J. Approx. Theory 147(2), 185–195 (2007)
Article MathSciNet Google Scholar
Chen, H., Zhou, Y.C., Tang, Y.Y., Li, L.Q., Pan, Z.B.: Convergence rate of the semi-supervised greedy algorithm. Neural Netw. 44, 44–50 (2013)
Article PubMed Google Scholar
Schmidt, E.: Zur theorie der linearen und nicht linearen integralgleichungen zweite abhandlung. Math. Ann. 64, 161–174 (1907)
Article MathSciNet Google Scholar
Temlyakov, V.N.: Greedy approximation. Acta Numer. 17, 235–409 (2008)
Article MathSciNet Google Scholar
Livshitz, E.D., Temlyakov, V.N.: Two lower estimates in greedy approximation. Constr. Approx. 19, 509–523 (2003)
Article MathSciNet Google Scholar
Livshits, E.D.: Lower bounds for the rate of convergence of greedy algorithms. Izv. Math. 73, 1197–1215 (2009)
Article MathSciNet Google Scholar
Petrova, G.: Rescaled pure greedy algorithm for Hilbert and Banach spaces. Appl. Comput. Harmon. Anal. 41, 852–866 (2016)
Article MathSciNet Google Scholar
Jiang, B., Ye, P.X.: Efficiency of the weak rescaled pure greedy algorithm. Int. J. Wavelets Multiresolut. Inf. Process. 19(4), 2150001 (2021)
Article MathSciNet Google Scholar
Bergh, J., Lofstrom, J.: Interpolation Spaces. Springer, Berlin (1976)
Book Google Scholar
Temlyakov, V.N.: Greedy algorithms in Banach spaces. Adv. Comput. Math. 14, 277–292 (2001)
Article MathSciNet Google Scholar
Guo, Q., Ye, P.X.: Error analysis of the moving least-squares method with non-identical sampling. Int. J. Comput. Math. 96(4), 767–781 (2019)
Article MathSciNet Google Scholar
Guo, Q., Ye, P.X.: Error analysis of the moving least-squares regression learning algorithm with β-mixing and non-identical sampling. Int. J. Comput. Math. 97(8), 1586–1602 (2020)
Article MathSciNet Google Scholar

Download references

Funding

This research is supported by the National Science Foundation for Young Scientists of China (Grant No. 12001328), the National Natural Science Foundation of China (Grant No. 11671213), the Development Plan of Youth Innovation Team of University in Shandong Province (No. 2021KJ067), the Shandong Provincial Natural Science Foundation of China (No. ZR2022MF223), and the Qilu University of Technology (Shandong Academy of Sciences) Talent Research Project (No. 2023RCKY133).

Author information

Xianghua Liu and Peixin Ye contributed equally to this work.

Authors and Affiliations

School of Science, Shandong Jianzhu University, Jinan, 250101, China
Qin Guo
School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
Xianghua Liu & Peixin Ye

Authors

Qin Guo
View author publications
You can also search for this author in PubMed Google Scholar
Xianghua Liu
View author publications
You can also search for this author in PubMed Google Scholar
Peixin Ye
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed substantially to this paper, participated in drafting and checking the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Qin Guo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Guo, Q., Liu, X. & Ye, P. The learning performance of the weak rescaled pure greedy algorithms. J Inequal Appl 2024, 30 (2024). https://doi.org/10.1186/s13660-024-03077-6

Download citation

Received: 02 September 2023
Accepted: 02 January 2024
Published: 04 March 2024
DOI: https://doi.org/10.1186/s13660-024-03077-6

The learning performance of the weak rescaled pure greedy algorithms

Abstract

Similar content being viewed by others

Learning capability of the truncated greedy algorithm

Rescaled Pure Greedy Algorithm for convex optimization

Greedy Strategies for Convex Optimization

1 Introduction

2 Preliminaries

2.1 Least squares regression

2.2 Weak rescaled pure greedy algorithm

Remark 1

3 Weak rescaled pure greedy learning

Remark 2

Theorem 1

Theorem 2

4 Proofs of the main results

Lemma 4.1

Proof

Proof of Theorem 1

Lemma 4.2

5 Convergence rate and universal consistency

Corollary 5.1

Proof

Corollary 5.2

Proof

Theorem 3

Proof

Remark 3

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation