# Convex Optimization on Banach Spaces

## Abstract

Greedy algorithms which use only function evaluations are applied to convex optimization in a general Banach space \(X\). Along with algorithms that use exact evaluations, algorithms with approximate evaluations are treated. A priori upper bounds for the convergence rate of the proposed algorithms are given. These bounds depend on the smoothness of the objective function and the sparsity or compressibility (with respect to a given dictionary) of a point in \(X\) where the minimum is attained.

### Keywords

Sparse Optimization Greedy Banach space Convergence rate Approximate evaluation### Mathematics Subject Classification

Primary: 41A46 Secondary: 65K05 41A65 46B20## 1 Introduction

Convex optimization is an important and well-studied subject of numerical analysis. The canonical setting for such problems is to find the minimum of a convex function \(E\) over a domain in \({\mathbb {R}}^d\). Various numerical algorithms have been developed for minimization problems, and a priori bounds for their performance have been proven. We refer the reader to [1, 9, 10, 11] for the core results in this area.

Solving (1.1) is an example of a high-dimensional problem and is known to suffer the curse of dimensionality without additional assumptions on \(E\) which serve to reduce its dimensionality. These additional assumptions take the form of smoothness restrictions on \(E\) and assumptions which imply that the minimum in (1.1) is attained on a subset of \(D\) with additional structure. Typical assumptions for the latter involve notions of sparsity or compressibility, which are by now heavily employed concepts for high-dimensional problems. We will always assume that there is a point \(x^*\in D\) where the minimum \(E^*\) is attained, \(E(x^*)=E^*\). We do not assume \(x^*\) is unique. Clearly, the set \(D^*=D^*(E)\subset D\) of all points where the minima is attained is convex.

The algorithms studied in this paper utilize dictionaries \(\mathcal{D}\) of \(X\). A set of elements \(\mathcal{D}\subset X\), whose closed linear span coincides with \(X\) is called a *symmetric dictionary* if \(\Vert g\Vert :=\Vert g\Vert _X=1\), for all \(g\in \mathcal{D}\), and in addition \(g\in \mathcal{D}\) implies \(-g\in \mathcal{D}\). The simplest example of a dictionary is \(\mathcal{D}=\{\pm \varphi _j\}_{j\in \Gamma }\) where \(\{\varphi _j\}_{j\in \Gamma }\) is a Schauder basis for \(X\). In particular for \(X={\mathbb {R}}^d\), one can take the canonical basis \(\{e_j\}_{j=1}^d\).

Given, such a dictionary \(\mathcal{D}\), there are several types of domains \(D\) that are employed in applications. Sometimes, these domains are the natural domain of the physical problem. Other times these are constraints imposed on the minimization problem to ameliorate high dimensionality. We mention the following three common settings.

**Sparsity constraints**

*The set*\(\Sigma _n(\mathcal{D})\)

*of functions*

*is called the set of sparse functions of order*\(n\)

*with respect to the dictionary*\(\mathcal{D}\).

*One common assumption is to minimize*\(E\)

*on the domain*\(D=\Sigma _n(\mathcal{D})\),

*i.e., to look for an*\(n\)

*sparse minimizer of*(1.1).

**constraints**

*A more general setting is to minimize*\(E\)

*over the closure*\(A_1(\mathcal{D})\) (in \(X\))

*of the convex hull of*\(\mathcal{D}\).

*A slightly more general setting is to minimize*\(E\)

*over one of the sets*

*Sometimes*\(M\)

*is allowed to vary as in model selection or regularization algorithms from statistics. This is often referred to as*\(\ell _1\)

*minimization*.

**Unconstrained optimization**

*Imposed constraints, such as sparsity or assuming*\(D=A_1(\mathcal{D})\),

*are sometimes artificial and may not reflect the original optimization problem. We consider therefore the unconstrained minimization where*\(D=X\).

*We always make the assumption that the minimum of*\(E\)

*is actually assumed. Therefore, there is a point*\(x^*\in X\)

*where*

*We do not require that*\(x^*\)

*is unique. Notice that in this case the minimum*\(E^*\)

*is attained on the set*

*In what follows, we refer to minimization over*\(D_0\)

*to be the unconstrained minimization problem.*

A typical greedy optimization algorithm builds approximations to \(E^*\) of the form \(E(G_m), m=1,2,\ldots \) where the elements \(G_m\) are built recursively using the dictionary \(\mathcal{D}\) and typically are in \(\Sigma _m(\mathcal{D})\). We will always assume that the initial point \(G_0\) is chosen as the 0 element. Given that \(G_{m-1}\) has been defined, one first searches for a direction \(\varphi _m\in \mathcal{D}\) for which \(E(G_{m-1}+\alpha \varphi _m)\) decreases significantly as \(\alpha \) moves away from zero. Once, \(\varphi _m\) is chosen, then one selects \(G_{m}=G_{m-1}+\alpha _m\varphi _m\) or more generally \(G_{m}=\alpha _m'G_{m-1}+\alpha _m\varphi _m\), using some recipe for choosing \(\alpha _m\) or more generally \(\alpha _m,\alpha _m'\). Algorithms of this type are referred to as *greedy algorithms* and will be the object of study in this paper.

There are different strategies for choosing \(\varphi _m\) and \(\alpha _m,\alpha _m'\) (see, for instance, [2, 3, 4, 6, 8, 13, 19, 20] and [7]). One possibility to choose \(\varphi _m\) is to use the Fréchet derivative \(E'(G_{m-1})\) of \(E\) to choose a steepest descent direction. This approach has been amply studied and various convergence results for steepest descent algorithms have been proven, even for the general Banach space setting. We refer the reader to the papers [17, 18, 20] which are representative of the convergence results known in this case. The selection of \(\alpha _m,\alpha _m'\) is commonly referred to as relaxation and is well studied in numerical analysis, although the Banach space setting needs additional attention.

Our interest in the present paper are greedy algorithms that do not utilize \(E'\). They are preferred since \(E'\) is not given to us and therefore, in numerical implementations, must typically be approximated at any given step of the algorithm. We will analyze several different algorithms of this type which are distinguished from one another by how \(G_{m}\) is gotten from \(G_{m-1}\) both in the selection of \(\varphi _m\) and the parameters \(\alpha _m,\alpha _m'\). Our algorithms are built with ideas similar to the analogous, well-studied, greedy algorithms for approximation of a given element \(f\in X\). We refer the reader to [16] for a comprehensive description of greedy approximation algorithms.

In this introduction, we limit ourselves to two of the main algorithms studied in this paper. The first of these, which we call the relaxed \(E\)-Greedy algorithm (REGA(co)) was introduced in [20] under the name sequential greedy approximation.

**Relaxed**\(E\)-

**Greedy Algorithm (REGA(co))**We define \(G_0:=0\). For \(m\ge 1\), assuming \(G_{m-1}\) has already been defined, we take \(\varphi _m \in \mathcal{D}\) and \(0\le \lambda _m \le 1\) such that

We note that the REGA(co) is a modification of the classical Frank–Wolfe algorithm [5]. For convenience, we have assumed the existence of a minimizing \(\varphi _m\) and \(\lambda _m\). However, we also analyze algorithms with only approximate implementation which avoids this assumption.

Observe that this algorithm is in a sense built for \(A_1(\mathcal{D})\) because each \(G_m\) is obviously in \(A_1(\mathcal{D})\). The next algorithm, called the \(E\)-Greedy algorithm with free relaxation (EGAFR(co)), makes some modifications in the relaxation step that will allow it to be applied to the more general unconstrained minimization problem on \(D_0\).

**Greedy Algorithm with free relaxation (EGAFR(co))**We define \(G_0:= 0\). For \(m\ge 1\), assuming \(G_{m-1}\) has already been defined, we take \(\varphi _m \in \mathcal{D}, \alpha _m, \beta _m\in {\mathbb {R}}\) satisfying (assuming existence)

The following theorem for REGA(co) is a prototype of the results proved in this paper.

**Theorem 1.1**

- (i)If \(E\) is uniformly smooth on \( A_1(\mathcal{D})\), then the REGA(co) converges:$$\begin{aligned} \lim _{m\rightarrow \infty } E(G_m) =E^*. \end{aligned}$$(1.7)
- (ii)If in addition, \(\rho (E,A_1(\mathcal{D}), u) \le \gamma u^q, 1<q\le 2\), then$$\begin{aligned} E(G_m)-E^*\le C(q,\gamma )m^{1-q}, \end{aligned}$$(1.8)

The case \(q=2\) of this theorem was proved in [20]. We prove this theorem in Sect. 2.

The following theorem states the convergence properties of the EGAFR(co).

**Theorem 1.2**

- (i)The EGAFR(co) converges:$$\begin{aligned} \lim _{m\rightarrow \infty } E(G_m) =\inf _{x\in X}E(x)=\inf _{x\in D_0} E(x)=E^*. \end{aligned}$$
- (ii)If the modulus of smoothness of \(E\) satisfies \(\rho (E,u)\le \gamma u^q, 1<q\le 2\), then, the EGAFR(co) satisfies$$\begin{aligned} E(G_m)-E^* \le C(E,q,\gamma ) \epsilon _m, \end{aligned}$$(1.11)

- (1)Choose \(\varphi _m \in {\mathcal {D}}\) as any element satisfying$$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in {\mathcal {D}}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$
- (2)Find \(w_m\) and \( \lambda _m\) such that$$\begin{aligned} E((1-w_m)G_{m-1} + \lambda _m\varphi _m) = \inf _{ \lambda ,w}E((1-w)G_{m-1} + \lambda \varphi _m) \end{aligned}$$

**Relaxed**\(E\)-

**Greedy Algorithm with error**\(\delta \)

**(REGA(**\(\delta \)

**))**Let \(\delta \in (0,1]\). We define \(G_0:=0\). Then, for each \(m\ge 1\) we have the following inductive definition: We take any \(\varphi _m \in \mathcal{D}\) and \(0\le \lambda _m \le 1\) satisfying

**Theorem 1.3**

In the case \(q=2\), Theorem 1.3 was proved in [20]. We note that our analysis is different from that in [20].

## 2 Analysis of Greedy Algorithms

We begin this section by showing how to prove the results for REGA(co) and EGAFR(co) stated in the introduction, namely Theorems 1.1 and 1.2. The proof of convergence results for greedy algorithms typically is done by establishing a recursive inequality for the error \(E(G_n)-E^*\). To analyze the decay of this sequence of errors will need the following lemma.

**Lemma 2.1**

*Proof*

In the case \(p\ge 1\) which is used in this paper this follows from Lemma 2.16 of [16]. In the case \(p\ge 1\), Lemma 2.1 was often used in greedy approximation in Banach spaces (see [16], Chapter 6). For the general case \(p>0\) see Lemma 4.2 of [12]). \(\square \)

To establish a recursive inequality for the error in REGA(co), we will use the following lemma about REGA(co).

**Lemma 2.2**

*Proof*

*Proof of Theorem 1.1*

*Proof of Theorem 1.2*

This proof is derived from results in [17] in a similar way to how we have proved Theorem 1.1 for REGA(co). An algorithm, called WGAFR(co), was introduced in [17] which differs from EGAFR(co) only in how each \(\varphi _m\) is chosen. One then uses the analysis in WGAFR(co). Also, part (ii) of Theorem 1.2 follows from Theorem 3.8 with \(\delta =0\).

The above-discussed algorithms REGA(co) and EGAFR(co) provide sparse approximate solutions to the corresponding optimization problems. These approximate solutions are sparse with respect to the given dictionary \(\mathcal{D}\), but they are not obtained as an expansion with respect to \(\mathcal{D}\). This means that at each iteration of these algorithms we update all the coefficients of sparse approximants. Sometimes it is important to build an approximant in the form of expansion with respect to \(\mathcal{D}\). The reader can find a discussion of greedy expansions in [16, Section 6.7]. For comparison with the algorithms, we have already introduced, we recall a greedy-type algorithm for unconstrained optimization which uses only function values and builds sparse approximants in the form of expansion that was introduced and analyzed in [18]. Let \(\mathcal{C}:=\{c_m\}_{m=1}^\infty \) be a fixed sequence of positive numbers.\(\square \)

**Greedy Algorithm with coefficients**\(\mathcal{C}(\mathbf {EGA}(\mathcal{C}))\) We define \(G_0:=0\). Then, for each \(m\ge 1\) we have the following inductive definition:

- (i)Let \(\varphi _m\in \mathcal{D}\) be such that (assuming existence)$$\begin{aligned} E(G_{m-1}+c_m\varphi _m)=\inf _{g\in \mathcal{D}}E(G_{m-1}+c_m g). \end{aligned}$$
- (ii)Then define$$\begin{aligned} G_m:=G_{m-1}+c_m\varphi _m. \end{aligned}$$

**Theorem 2.3**

**Theorem 2.4**

Let us now turn to a brief comparison of the above algorithms and their known convergence rates. The REGA(co) is designed for solving optimization problems on domains \(D\subset A_1(\mathcal{D})\) and requires that \(D^*\cap A_1(\mathcal{D})\ne \emptyset \). The EGAFR(co) is not limited to the \(A_1(\mathcal{D})\) but applies for any optimization domain as long as \(E\) achieves its minimum on a bounded domain. As we have noted earlier, if there is a point \(D^*\cap A_1(\mathcal{D})\ne \emptyset \), then EGAFR(co) provides the same convergence rate \((O(m^{1-q}))\) as REGA(co). Thus, EGAFR(co) is more robust and requires the solution of only a slightly more involved minimization at each iteration.

The advantage of \(\hbox {EGA}(\mathcal{C})\) is that it solves a simpler minimization problem at each iteration since the relaxation parameters are set in advance. However, it requires knowledge of the smoothness order \(q\) of \(E\) and also gives a poorer rate of convergence than REGA(co) and the EGAFR(co).

To continue this discussion, let us consider the very special case where \(X=\ell ^d_p\) and the dictionary \(\mathcal{D}\) is finite, say \(\mathcal{D}=\{g_j\}_{j=1}^N\). In such a case, the existence of \(\varphi _m\) in all the above algorithms is easily proven. The \(\hbox {EGA}(\mathcal{C})\) simply uses \(Nm\) function evaluations to make \(m\) iterations. The REGA(co) solves a one-dimensional optimization problem at each iteration for each dictionary element, thus \(N\) such problems. We discuss this problem in Sect. 4 and show that each such problem can be solved with exponential accuracy with respect to the number of evaluations needed from \(E\).

## 3 Approximate Greedy Algorithms for Convex Optimization

We turn now to the main topic of this paper which is modifications of the above greedy algorithms to allow imprecise calculations or less strenuous choices for descent directions and relaxation parameters. We begin with a discussion of the weak relaxed greedy algorithm WRGA(co) which was introduced and analyzed in [17] and which we already referred to in Sect 2. The WRGA(co) uses the gradient to choose a steepest descent direction at each iteration. The interesting aspect of WRGA(co), relative to imprecise calculations, is that it uses a weakness parameter \(t_m<1\) to allow some relative error in estimating \(\sup _{g\in \mathcal{D}} \langle -E'(G_{m-1}),g - G_{m-1}\rangle \). Here and below we use a convenient bracket notation: for a functional \(F\in X^*\) and an element \(f\in X\) we write \(F(f)= \langle F,f\rangle \). We concentrate on a modification of the second step of WRGA(co). Very often, we cannot calculate values of \(E\) exactly. Even in case, we can evaluate \(E\) exactly we may not be able to find the exact value of the \(\inf _{0\le \lambda \le 1}E((1-\lambda )G_{m-1} + \lambda \varphi _m)\). This motivates us to study the following modification of the WRGA(co). Let \(\tau :=\{t_k\}_{k=1}^\infty , t_k\in [0,1], k=1,2,\ldots \), be a weakness sequence.

**Weak relaxed Greedy Algorithm with error**\(\delta (\mathbf {WRGA}(\delta ))\). Let \(\delta \in (0,1]\). We define \(G_0:= 0\). Then, for each \(m\ge 1\), we have the following inductive definition.

- (1)\(\varphi _m:= \varphi ^{\delta ,\tau }_m \in \mathcal{D}\) is taken any element satisfying$$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m - G_{m-1}\rangle \ge t_m \sup _{g\in \mathcal{D}} \langle -E'(G_{m-1}),g - G_{m-1}\rangle . \end{aligned}$$
- (2)Then \(0\le \lambda _m \le 1\) is chosen as any number such that$$\begin{aligned} E((1-\lambda _m)G_{m-1} + \lambda _m\varphi _m) \le \inf _{0\le \lambda \le 1}E((1-\lambda )G_{m-1} + \lambda \varphi _m)+\delta . \end{aligned}$$

**Theorem 3.1**

**Lemma 3.2**

We use these remarks to prove the following.

**Lemma 3.3**

*Proof*

Finally, for the proof of Theorem 3.1, we will need the following result about sequences.

**Lemma 3.4**

*Proof*

*Proof of Theorem 3.1*

We can establish a similar convergence result for the \(\hbox {REGA}(\delta )\).

**Theorem 3.5**

*Proof*

We now introduce and analyze an approximate version of the WGAFR(co).

**Weak Greedy algorithm with free relaxation and error**\(\delta (\mathbf {WGAFR}(\delta ))\). Let \(\tau :=\{t_m\}_{m=1}^\infty , t_m\in [0,1]\), be a weakness sequence. We define \(G_0 := 0\). Then, for each \(m\ge 1\), we have the following inductive definition.

- (1)\(\varphi _m \in \mathcal{D}\) is any element satisfying$$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in \mathcal{D}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$(3.19)
- (2)Find \(w_m\) and \( \lambda _m\) such thatand define$$\begin{aligned} E((1-w_m)G_{m-1} + \lambda _m\varphi _m) \le \inf _{ \lambda ,w}E((1-w)G_{m-1} + \lambda \varphi _m) +\delta \end{aligned}$$$$\begin{aligned} G_m:= (1-w_m)G_{m-1} + \lambda _m\varphi _m. \end{aligned}$$

**Theorem 3.6**

*Proof*

We begin with a lemma. \(\square \)

**Lemma 3.7**

*Proof*

We have discussed above two algorithms the \(\hbox {WRGA}(\delta )\) and the \(\hbox {REGA}(\delta )\). Results for the \(\hbox {REGA}(\delta )\) (see Theorem 3.5) were derived from the proof of the corresponding results for the \(\hbox {WRGA}(\delta )\) (see Theorem 3.1). We now discuss a companion algorithm for the \(\hbox {WGAFR}(\delta )\) that uses only function evaluations.

**Greedy algorithm with free relaxation and error**\(\delta \mathbf {(EGAFR(}\delta \mathbf{))}\). We define \(G_0:=0\). For \(m\ge 1\), assuming \(G_{m-1}\) has already been defined, we take \(\varphi _{m} \in \mathcal {D}\) and \(\alpha _m\), \(\beta _{m} \in \mathbb {R}\) satisfying

**Theorem 3.8**

Theorem 2.4 provides the rate of convergence of the \(\hbox {EGA}(\mathcal{C})\) where we assume that function evaluations are exact and we can find \(\inf _{g\in \mathcal{D}}\) exactly. However, in practice, we very often cannot evaluate functions exactly and (or) cannot find the exact value of the \(\inf _{g\in \mathcal{D}}\). In order to address this issue, we modify the \(\hbox {EGA}(\mathcal{C})\) into the following algorithm \(\hbox {EGA}(\mathcal{C},\delta )\).

**Greedy algorithm with coefficients**\(\mathcal{C}\)

**and error**\(\delta (\hbox {EGA}(\mathcal{C},\delta ))\). Let \(\delta \in (0,1]\). We define \(G_0:=0\). Then, for each \(m\ge 1\), we have the following inductive definition.

- (1)\(\varphi _m^\delta \in \mathcal{D}\) is such that$$\begin{aligned} E(G_{m-1}+c_m\varphi _m^\delta )\le \inf _{g\in \mathcal{D}}E(G_{m-1}+c_m g)+\delta . \end{aligned}$$
- (2)Let$$\begin{aligned} G_m:=G_{m-1}+c_m\varphi _m^\delta . \end{aligned}$$

**Theorem 3.9**

We first accumulate some results that we will use in the proof of this theorem. Let \(N:=[\delta ^{-\frac{1}{1+r}}]\), where \([a]\) is the integer part of \(a\) and let \(G_m, m\ge 0\) be the sequence generated by the \(\hbox {EGA}(\mathcal{C}_s,\delta )\).

**Claim 1**

\(G_m\in D_3\), i.e., \(E(G_m)\le E(0)+3\), for all \(0\le m\le N\).

We also need the following lemma from [18].

**Lemma 3.10**

*Proof of Theorem 3.9*

**Lemma 3.11**

## 4 Univariate Convex Optimization

The relaxation step in each of the above algorithms involves either a univariate or bivariate optimization of a convex function. The univariate optimization problem called *line search* is well studied in optimization theory (see [10]). The purpose of the remaining two sections of this paper is to show that such problems can be solved efficiently. Results of these two sections are known. We present them here for completeness.

**Proposition 4.1**

We next analyze what happens if we do not receive the exact values of \(f\) when we query in the above algorithm. We assume that when we query \(f\) at a point \(x\), we receive the corrupted value \(y(x)\) where \(|f(x)-y(x)|\le \delta \) for each \(x\in [0,1]\). We assume that we know \(\delta \).

**Proposition 4.2**

*Proof*

**A.**Suppose \(y(0)\le y(1/2)\). Then \(f(0) \le f(1/2) +2\delta \) and by (4.3) with \(a=0, b=1/2, c=1/2, d=x, x\in (1/2,1]\) we obtain

**B.** Suppose \(y(1/2)<y(0)\). In this case, we make an additional evaluation of the function at \(1/4\).

**Ba.**Suppose \(y(1/4)< y(1/2)-2\delta \). Then \(f(1/4)<f(1/2)\) and by (4.3), we obtain that

**Bb.**Suppose \(y(1/4)\ge y(1/2)-2\delta \). In this case, we make an additional evaluation of the function at \(3/4\). If \(y(3/4)< y(1/2)-2\delta \) then as in

**Ba**we can restrict our search to the interval \([1/2,1]\). If \(y(3/4)\ge y(1/2)-2\delta \) we argue as in the case

**A**and obtain

At each iteration, we add two evaluations and then find that we can restrict our search to an interval of half the size of the original while incurring an additional error at most \(4\delta \). Finally, the evaluation of \(y\) gives us an error at most \(\delta \) with that of \(f\). \(\square \)

We note that convexity of functions from \(F\) plays a dominating role in obtaining exponential decay of error in Proposition 4.1. For instance, the following simple known statement holds for the \(\hbox {Lip}_{1}{1}\) class.

**Proposition 4.3**

*Proof*

The upper bound follows from evaluating \(f\) at the midpoints \(x_j\) of the intervals \([(j-1)/m,j/m], j=1,\ldots ,m\) and giving the approximate value \(\min \nolimits _{j}f(x_j)-\frac{1}{4m}\). The lower bound follows from the following observation. For any \(m\) points \(0\le \xi _1<\xi _2<\cdots <\xi _m\le 1\) there are two functions \(f_1,f_2\in \mathrm{Lip}_11\) such that \(f_1(\xi _j)=f_2(\xi _j) =0\) for all \(j\) and \(\min \nolimits _{x}f_1(x) -\min \nolimits _xf_2(x) \ge \frac{1}{2m}\). \(\square \)

## 5 Multivariate Convex Optimization

In this section, we discuss an analog of Proposition 4.1 for \(d\)-variate convex functions on \([0,1]^d\). The \(d\)-variate algorithm is a coordinate wise application of the algorithm from Proposition 4.1 with an appropriate \(\delta \). We begin with a simple lemma.

**Lemma 5.1**

*Proof*

**Proposition 5.1**

*Proof*

## 6 Conclusion

One more important feature of this paper is that we use function evaluations and do not utilize the gradient \(E'\). Clearly, in the setting on an infinite dimensional Banach space, the proposed algorithms are not algorithms in a strict sense. However, when \(X\) is finite dimensional and \({\mathcal {D}}\) is finite, they are algorithms in a strict sense. In such a situation we can compare complexities of, say, the \(\hbox {WGAFR}(\delta )\), which utilizes \(E'\), and the \(\hbox {EGAFR}(\delta )\), which does not. At the greedy step (1) of the \(\hbox {WGAFR}(\delta )\), generally speaking, we need to evaluate all \(\langle -E'(G_{m-1}),g\rangle , g\in {\mathcal {D}}\), in order to choose \(\varphi _m\). At the greedy step of the \(\hbox {EGAFR}(\delta )\), we need to solve \(N:=|{\mathcal {D}}|\) two-dimensional convex optimization problems. Proposition 5.1 shows that this extra work of optimization requires about \((\log 1/\delta )^2\) function evaluations per dictionary element. Therefore, roughly, for the \(\hbox {WGAFR}(\delta )\), we need to evaluate \(N\) inner products, and for the \(\hbox {EGAFR}(\delta )\), we need to make \(N(\log 1/\delta )^2\) function evaluations. This comparison is under assumption that in the case of the \(\hbox {WGAFR}(\delta )\) the gradient \(E'(G_{m-1})\) is known.

The most important results of the paper are in Sect. 3, where we allow approximate evaluations. Theorems 3.1, 3.5, 3.6, and 3.8, proved in that section, demonstrate that for the number of iterations \(m\le \delta ^{-1/q}\) the error \(\delta \) in approximate evaluations does not effect the upper bound of the error of the optimization algorithm. We do not know if the restriction \(m\le \delta ^{-1/q}\) is the best possible in these theorems. It is known and easy to check on examples from approximation (see, for instance, [16, p. 346]) that the error rate \(m^{1-q}\) for optimization over \(A_1({\mathcal {D}})\) is the best possible.

### References

- 1.J.M. Borwein and A.S. Lewis, Convex Analysis and Nonlinear Optimization. Theory and Examples, Canadian Mathematical Society, Springer, 2006.Google Scholar
- 2.V. Chandrasekaran, B. Recht, P.A. Parrilo, and A.S. Willsky, The convex geometry of linear inverse problems, Proceedings of the 48th Annual Allerton Conference on Communication, Control and Computing, 2010, 699–703.Google Scholar
- 3.K.L. Clarkson, Coresets, Sparse Greedy Approximation, and the Frank–Wolfe Algorithm, ACM Transactions on Algorithms,
**6**(2010), Article No. 63.Google Scholar - 4.M. Dudik, Z. Harchaoui, and J. Malick, Lifted coordinate descent for learning with trace-norm regularization, In AISTATS, 2012.Google Scholar
- 5.M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly, 3 (1956), 95–110.CrossRefMathSciNetGoogle Scholar
- 6.M. Jaggi, Sparse Convex Optimization Methods for Ma- chine Learning, PhD thesis, ETH Zürich, 2011.Google Scholar
- 7.M. Jaggi, Revisiting Frank–Wolfe: Projection-Free Sparse Convex Optimization, Proceedings of the \(30^{th}\) International Conference on Machine Learning, Atlanta, Georgia, USA, 2013.Google Scholar
- 8.M. Jaggi and M. Sulovský, A Simple Algorithm for Nuclear Norm Regularized Problems. ICML, 2010.Google Scholar
- 9.V.G. Karmanov, Mathematical Programming, Mir Publishers, Moscow, 1989.MATHGoogle Scholar
- 10.A. Nemirovski, Optimization II: Numerical methods for nonlinear continuous optimization, Lecture Notes, Israel Institute of Technology, 1999.Google Scholar
- 11.Yu. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic Publishers, Boston, 2004.CrossRefGoogle Scholar
- 12.H. Nguyen and G. Petrova,
*Greedy strategies for convex optimization*, arXiv:1401.1754v1 [math.NA] 8 Jan 2014. - 13.S. Shalev-Shwartz, N. Srebro, and T. Zhang, Trading accuracy for sparsity in optimization problems with sparsity constrains, SIAM Journal on Optimization, 20(6) (2010), 2807–2832.CrossRefMathSciNetMATHGoogle Scholar
- 14.V.N. Temlyakov, Greedy Algorithms and \(m\)-term Approximation With Regard to Redundant Dictionaries, J. Approx. Theory 98 (1999), 117–145.CrossRefMathSciNetMATHGoogle Scholar
- 15.V.N. Temlyakov, Greedy-Type Approximation in Banach Spaces and Applications, Constr. Approx., 21 (2005), 257–292.CrossRefMathSciNetMATHGoogle Scholar
- 16.V.N. Temlyakov, Greedy approximation, Cambridge University Press, 2011.Google Scholar
- 17.V.N. Temlyakov, Greedy approximation in convex optimization, IMI Preprint, 2012:03, 1–25; arXiv:1206.0392v1, 2 Jun 2012 (to appear in Constructive Approximation).
- 18.V.N. Temlyakov, Greedy expansions in convex optimization, Proceedings of the Steklov Institute of Mathematics,
**284**(2014), 244–262 ( arXiv:1206.0393v1, 2 Jun 2012). - 19.A. Tewari, P. Ravikumar, and I.S. Dhillon, Greedy Algorithms for Structurally Constrained High Dimensional Problems, prerint, (2012), 1–10.Google Scholar
- 20.T. Zhang, Sequential greedy approximation for certain convex optimization problems, IEEE Transactions on Information Theory, 49(3) (2003), 682–691.CrossRefMATHGoogle Scholar