1 Introduction

In this paper, we consider the general difference-of-convex (DC) optimization problem,

$$\begin{aligned}&\inf \;f(x):=f_1(x)-f_2(x)\nonumber \\&\text {s.t.}\; x\in \mathbb {R}^n, \end{aligned}$$
(1)

where \(f_1, f_2\) are extended convex functions on \(\mathbb {R}^n\) and f is an extended lower-semicontinuous function on \(\mathbb {R}^n\). Throughout the paper, we assume that the infimum in problem (1) is finite, and denote by \(f^\star \) a lower bound of f on \(\mathbb {R}^n\).

DC problems appear naturally in many applications, e.g., power allocation in digital communication systems [4], production-transportation planning [22], location planning [13], image processing [31], sparse signal recovering [17], cluster analysis [7, 8], and supervised data classification [6, 29], to name but a few.

This wide range of applications is to be expected, since some important classes of non-convex functions may be represented as DC functions. For instance, twice continuously differentiable functions on any convex subset of \(\mathbb {R}^n\) [20] and continuous piece-wise linear functions [34] may be written as DC functions. Furthermore, every continuous function on a compact and convex set can be approximated by a DC function [23, 44]. We refer the interested reader to Hiriart–Urruty [21] and Tuy [44] for more information on DC representable functions.

The celebrated difference-of-convex algorithm (DCA), also known as the convex–concave procedure, has been applied extensively to problem (1); see [28, 30, 40] and the references therein. Algorithm 1 presents the basic form of the DCA.

figure a

In the description of the DCA in Algorithm 1, (sub)gradients of \(f_1\) and \(f_2\) are assumed to be available at given points, the so-called black-box formulation. The DCA is sometimes also presented as a primal-dual method, where a dual sub-problem is solved to obtain the required (sub)gradients; see [28, 30] for further discussions of this topic. In recent years, some scholars have also extended the DCA and proposed some new variations; see [19, 32, 33, 36, 39].

The first convergence results for Algorithm 1 were given in [40, Theorem 3(iv)]. The authors showed that, if the sequence of iterates \(\{x^k\}\) is bounded, then each accumulation point of this sequence is a critical point of f.

Le Thi et al. [27] established an asymptotic linear convergence rate of \(\{x^k\}\) under some conditions, in particular under the assumption that f satisfies the Łojasiewicz gradient inequality at all stationary points. Recall that a differentiable function f is said to satisfy this inequality at a stationary point a (\(\nabla f(a) = 0\)), if there exist constants \(\theta \in (0,1)\), \(C > 0\) and \(\epsilon >0\) such that

$$\begin{aligned} |f(x) - f(a)|^\theta \le C\Vert \nabla f(x)\Vert \text{ if } \Vert x-a\Vert \le \epsilon , \end{aligned}$$
(3)

where the constant \(\theta \) is called the Łojasiewicz exponent. This inequality is known to hold, for example, for real analytic functions, but has been extended to include classes of non-smooth functions as well by considering general sub-differentials instead of gradients; see [10, 11] and the references therein.

The convergence rates established by Le Thi et al. [27] depend on the value of the Łojasiewicz exponent, as the following theorem shows. The theorem stated here is a special case of Theorems 3.4 and 3.5 in [27], to give a flavor of the convergence results in [27].

Theorem 1.1

(Theorems 3.4 and 3.5 in Le Thi et al. [27]) Let \(f_1\) and \(f_2\) be proper convex functions and let the domain of f be closed. Also assume that at least one of \(f_1\) and \(f_2\) is strongly convex, and \(f_1\) or \(f_2\) is differentiable with locally Lipschitz gradient in every critical point of the DC problem. Finally, assume the sequence \(\{x^k\}\) is bounded, and let \(x^\infty \) be a limit point of \(\{x^k\}\). Then \(x^\infty \) is also a stationary point. Moreover, if f satisfies the Łojasiewicz gradient inequality (3) at all stationary points, then

  1. 1.

    if \(\theta \in (1/2,1)\), then \(\Vert x^k-x^\infty \Vert \le ck^{\tfrac{1-\theta }{1-2\theta }}\) for some \(c>0\).

  2. 2.

    if \(\theta \in (0,1/2]\), then \(\Vert x^k-x^\infty \Vert \le cq^k\) for some \(c>0\) and \(q\in (0,1)\).

In particular, item 2 shows a linear convergence rate when \(\theta \in (0,1/2]\). Yen et al. [45] had already shown linear convergence earlier for a much smaller class of DC functions. We will present a complementary result to this theorem (see Theorem 5.1), for the case \(\theta = 1/2\), where we show linear convergence of the objective function values and give explicit expressions for the constants that determine the linear convergence rate. Moreover, we will relax the assumption of a bounded sequence of iterates, and the assumption of strong convexity.

In the absence of conditions like the Łojasiewicz gradient inequality (3), only weaker convergence rates are known for the DCA. In particular, Tao and An [40, Proposition 2] and Le Thi et al. [26, Corollary 1] have shown an \(O\left( \frac{1}{\sqrt{N}} \right) \) convergence rate after N iterations under suitable assumptions, as given in the next theorem.

Theorem 1.2

(Corollary 1 in [26], Proposition 2 in [40]) If \(x^\infty \) is a limit point of the iteration sequence generated by the DCA, and at least one of \(f_1\) and \(f_2\) is strongly convex, i.e.  for some \(\mu _1, \mu _2 \ge 0\) such that \(\mu _1 + \mu _2 > 0\),

$$\begin{aligned} x \mapsto f_i(x) - \frac{\mu _i}{2}\Vert x\Vert ^2 \text{ is } \text{ convex } \text{ for } i \in \{1,2\}, \end{aligned}$$

then the series \(\Vert x^{k+1}-x^k\Vert \) converges, and, after \(N +1\) iterations,

$$\begin{aligned} \sum _{k=1}^{N} \Vert x^{k+1}-x^k\Vert ^2\le \tfrac{2(f(x^{1})-f(x^{N+1}))}{\mu _1+\mu _2}, \end{aligned}$$

and, consequently,

$$\begin{aligned} \min _{1 \le k \le N} \Vert x^{k+1}-x^k\Vert \le \sqrt{\frac{2(f(x^1) - f^\star )}{(\mu _1+\mu _2)N} } = O\left( \frac{1}{\sqrt{N}} \right) . \end{aligned}$$

We will derive some variants on this \(O\left( \frac{1}{\sqrt{N}} \right) \) convergence result in Corollary 3.1 and in Sect. 3.2, where we improve the constants in the \(O\left( \frac{1}{\sqrt{N}} \right) \) bounds. We also show that we obtain the best possible constants, by demonstrating an example where our bound in Corollary 3.1 is tight.

1.1 Outline and Further Contributions of this Paper

The novel aspect of the analysis in this paper is that we will apply performance estimation to derive convergence rates. Drori and Teboulle, in the seminal paper [16], introduced performance estimation as a strong tool for the worst-case analysis of first-order methods. The underlying idea of performance estimation is that the worst-case complexity may be cast as an optimization problem. Furthermore, this optimization problem can often be reformulated as a semidefinite programming problem. It is worth noting that performance estimation has been employed extensively for the analysis of worst-case convergence rates of first-order methods, see, e.g. [1, 14,15,16, 41, 42], and the references therein.

This paper is organized as follows: In Sect. 2, we review some definitions and notions from convex analysis, which will be used in the following sections. We study the DCA for sufficiently smooth DC decompositions in Sect. 3. By using performance estimation, we give a convergence rate of \(O(1/\sqrt{N})\) in Corollary 3.1, without any strong convexity assumption, thus extending and complementing Le Thi et al. [26, Corollary 1]. We construct an example that shows this \(O(1/\sqrt{N})\) bound is tight. Since the first termination criterion is not suitable for the analysis of non-smooth DC compositions, we investigate the DCA with another stopping criterion in Sect. 4, and we show a convergence rate of O(1/N). This result is completely new to the best of our knowledge.

In Sect. 5, we study the DCA when the objective function satisfies the Polyak–Łojasiewicz inequality, and we derive a linear convergence rate in Theorem 5.1, thereby refining some linear convergence results in Le Thi et al. [27] as described above.

2 Basic Definitions and Preliminaries

In this section, we recall some notions and definitions from convex analysis. Throughout the paper, \(\Vert \cdot \Vert \) and \(\langle \cdot ,\cdot \rangle \) denote the Euclidean norm and the dot product, respectively. \(I_{\mathbb {R}_+}\) stands for the indicator function on \(\mathbb {R}_+\cup \{\infty \}\), i.e.,

$$\begin{aligned} I_{\mathbb {R}_+}(x)={\left\{ \begin{array}{ll} 1 &{} x\ge 0 \cup \{\infty \}\\ 0 &{} x<0\cup \{-\infty \}. \end{array}\right. } \end{aligned}$$

Let \(f:\mathbb {R}^n\rightarrow [-\infty , \infty ]\) be an extended convex function. The domain of f is denoted and defined as \(\text {dom}(f):=\{x: f(x)<\infty \}\). The function f is called proper if it does not attain the value \(-\infty \), and its domain is non-empty. We call f closed if its epi-graph is closed, that is \(\{(x, r): f(x)\le r\}\) is a closed subset of \(\mathbb {R}^{n+1}\). We denote the convex hull of \(X\subseteq \mathbb {R}^n\) by \(\text {co}(X)\). We adopt the conventions that, for \(a, b, c, d\in \mathbb {R}\) with \(c\ne d\) and \(a\ne 0\), \(\frac{b}{\infty }=0, 0\times \infty =0\) and \(\frac{a\infty +b}{c\infty -d\infty }=\frac{a}{c-d}\). For the function \(f:\mathbb {R}^n\rightarrow [-\infty , \infty ]\), the conjugate function \(f^*:\mathbb {R}^n\rightarrow \mathbb {R}\) is defined as \(f^*(g)=\max _{x\in \mathbb {R}^n} \langle g, x\rangle -f(x)\). Moreover, we denote the set of subgradients of f at \(x\in \text {dom}(f)\) by \(\partial f(x)\),

$$\begin{aligned} \partial f(x)=\{g: f(y)\ge f(x)+\langle g, y-x\rangle , \forall y\in \mathbb {R}^n\}. \end{aligned}$$

Let \(L\in (0, \infty ]\) and \(\mu \in (0, \infty )\). We call an extended convex function \(f:\mathbb {R}^n\rightarrow [-\infty , \infty ]\) L-smooth if for any \(x_1, x_2\in \mathbb {R}^n\),

$$\begin{aligned} \Vert g_1-g_2\Vert \le L\Vert x_1-x_2\Vert \ \ \forall g_1\in \partial f(x_1),\ g_2\in \partial f(x_2). \end{aligned}$$

Note that if \(L<\infty \), then f must be differentiable on \(\mathbb {R}^n\). In addition, any extended convex function is \(\infty \)-smooth. Also recall that the function \(f:\mathbb {R}^n\rightarrow [-\infty , \infty ]\) is called \(\mu \)-strongly convex function if the function \(x \mapsto f(x)-\tfrac{\mu }{2}\Vert x\Vert ^2\) is convex. Clearly, any convex function is 0-strongly convex. We denote the set of closed proper convex functions which are L-smooth and \(\mu \)-strongly convex by \(\mathcal {F}_{\mu ,L}(\mathbb {R}^n)\).

Let \(\mathcal {I}\) be a finite index set and let \(\{x^i; g^i; f^i\}_{i\in \mathcal {I}}\subseteq \mathbb {R}^n\times \mathbb {R}^n\times \mathbb {R}\). A set \(\{x^i; g^i; f^i\}_{i\in \mathcal {I}}\) is called \(\mathcal {F}_{\mu ,L}\)-interpolable if there exists \(f\in \mathcal {F}_{\mu ,L}(\mathbb {R}^n)\) with

$$\begin{aligned} f(x^i)=f^i, \ g^i\in \partial f(x^i) \ \ i\in \mathcal {I}. \end{aligned}$$

The next theorem gives necessary and sufficient conditions for \(\mathcal {F}_{\mu ,L}\)-interpolablity.

Theorem 2.1

[41, Theorem 4] Let \(L\in (0, \infty ]\) and \(\mu \in [0, \infty )\) and let \(\mathcal {I}\) be a finite index set. The set \(\{(x^i; g^i; f^i)\}_{i\in \mathcal {I}}\subseteq \mathbb {R}^n\times \mathbb {R}^n \times \mathbb {R}\) is \(\mathcal {F}_{\mu ,L}\)-interpolable if and only if for any \(i, j\in \mathcal {I}\), we have

$$\begin{aligned}&\tfrac{1}{2(1-\tfrac{\mu }{L})}\left( \tfrac{1}{L}\left\| g^i-g^j\right\| ^2+\mu \left\| x^i-x^j\right\| ^2-\tfrac{2\mu }{L}\left\langle g^j-g^i,x^j-x^i\right\rangle \right) \\&\quad \le f^i-f^j-\left\langle g^j, x^i-x^j\right\rangle . \end{aligned}$$

In the next lemma, we extend the descent lemma for DCA when \(L_1\) or \(L_2\) is finite.

Lemma 2.1

Let \(f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})\) and \(f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})\) and let \(f=f_1-f_2\). If \(g_1\in \partial f_1(x)\) and \(g_2\in \partial f_2(x)\), then

$$\begin{aligned} f^\star \le f(x)-\tfrac{1}{2\left( L_1-\mu _2\right) }\Vert g_1-g_2\Vert ^2. \end{aligned}$$

Proof

If \(L_1=\infty \), the proof is immediate. Let \(L_1<\infty \). By L-smoothness and strong convexity, we have

$$\begin{aligned} f_1(y)\le f_1(x)+\langle g_1, y-x\rangle +\tfrac{L_1}{2}\Vert y-x\Vert ^2,\\ f_2(y)\ge f_2(x)+\langle g_2, y-x\rangle +\tfrac{\mu _2}{2}\Vert y-x\Vert ^2, \end{aligned}$$

for \(y\in \mathbb {R}^n\). By the above inequalities, we get

$$\begin{aligned} f(y)\le f(x)+\langle g_1-g_2, y-x\rangle +\tfrac{L_1-\mu _2}{2}\Vert y-x\Vert ^2. \end{aligned}$$

Hence, by taking minimum on both sides of the last inequality with respect to y for fixed x, we get

$$\begin{aligned} f^\star \le f(x)-\tfrac{1}{2(L_1-\mu _2)}\Vert g_1-g_2\Vert ^2. \end{aligned}$$

Since the DC optimization problem (1) may have a non-convex and non-smooth objective function f, we will also need a more general notion of subgradients than in the convex case.

Definition 2.1

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be lower semi-continuous and let \(f(\bar{x})\) be finite.

  • The vector g is called regular subgradient of f at \(\bar{x}\), written \(g\in \hat{\partial }_l f(\bar{x})\), if for all x in some neighborhood of \(\bar{x}\)

    $$\begin{aligned} f(x)\ge f(\bar{x})+\langle g, x-\bar{x}\rangle +o(\Vert x-\bar{x}\Vert ). \end{aligned}$$
  • The vector g is called general subgradient of f at \(\bar{x}\), written \(g\in \partial _l f(\bar{x})\), if there exist sequences \(\{x^i\}\) and \(\{g^i\}\) with \(g^i\in \hat{\partial }_l f(x^i)\) such that

    $$\begin{aligned} x^i\rightarrow \bar{x}, \ f(x^i)\rightarrow f(\bar{x}), \ g^i\rightarrow g. \end{aligned}$$

It is worth mentioning that \(\hat{\partial }_l f(\bar{x})\) is a closed convex set. In addition, \(\partial _l f(\bar{x})\) is also closed but not necessarily convex. Note that when f is closed proper convex, then \(\partial f(x)=\hat{\partial }_l f(x)=\partial _l f(x)\) for \(x\in \text {dom} (f)\). We refer the interested reader to Rockafellar and Wets [38] for more discussions on regular and general subdifferentials.

Definition 2.2

Let \(f_1, f_2\) be closed proper convex functions, and let f be lower semi-continuous.

  • The point \(\bar{x}\in \text {dom}(f)\) is called a critical point of problem (1) if

    $$\begin{aligned} \partial f_1(\bar{x})\cap \partial f_2(\bar{x}) \ne \emptyset . \end{aligned}$$
    (4)
  • The point \(\bar{x}\in \text {dom}(f)\) is called a stationary point of problem (1) if

    $$\begin{aligned} 0\in \partial _l f(\bar{x}). \end{aligned}$$
    (5)

Obviously, the stationarity condition is stronger than criticality. We recall that a convex function will be locally Lipschitz around \(\bar{x}\) providing it takes finite values in a neighborhood of \(\bar{x}\); see Theorem 35.1 in [37]. Consequently, if \(f_1\) or \(f_2\) takes finite values around a neighborhood of a stationary point \(\bar{x}\), then \(\bar{x}\) is a critical point; see Corollary 10.9 in [38]. However, its converse does not hold in general. For instance, consider \(f:\mathbb {R}\rightarrow \mathbb {R}\) given as \(f(x)=x\). The function f may be written as \(f=f_1-f_2\) where \(f_1(x)=\max (x, 0)\) and \(f_2(x)=\max (-x, 0)\). Suppose that \(\bar{x}=0\). It is readily seen that \(\partial f_1(\bar{x})\cap \partial f_2(\bar{x}) \ne \emptyset \), but \(\bar{x} = 0\) is not a stationary point of f. It is worth noting that, if \(f_2\) is strictly differentiable at \(\bar{x}\), these definitions are equivalent; see Example 10.10 in [38]. Recall that function f is strictly differentiable at \(\bar{x}\), if

$$\begin{aligned} \lim _{\begin{array}{c} (x, x^\prime )\rightarrow (\bar{x}, \bar{x})\\ x\ne x^\prime \end{array}} \frac{f(x)-f(x^\prime )-\langle \nabla f(\bar{x}), x-x^\prime \rangle }{\Vert x-x^\prime \Vert }=0. \end{aligned}$$

We refer the interested reader to An and Tao [5], Joki et al. [24] and Pang et al. [36] and references therein for more discussions on optimality conditions for DC problems.

2.1 The DC Problem

In this section, we consider

$$\begin{aligned} \min ~&f(x)=f_1(x)-f_2(x)\nonumber \\ \text {s.t. }&x\in \mathbb {R}^n, \end{aligned}$$
(6)

where \(f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})\) and \(f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})\). Here, we assume that \(L_1, L_2\in (0, \infty ]\) and \(\mu _1, \mu _2\in [0, \infty )\), and consequently, f may be non-differentiable. We may assume without loss of generality that \(f_1\) and \(f_2\) satisfy the following assumptions:

$$\begin{aligned} L_1>\mu _2, \ \ \ \ L_2>\mu _1. \end{aligned}$$
(7)

Indeed, if \(L_1\le \mu _2\), then for \(x,y\in \mathbb {R}^n\) and \(\lambda \in [0, 1]\), we have

$$\begin{aligned}&\lambda f_1(x)+(1-\lambda )f_1(y)\le f_1(\lambda x+(1-\lambda )y)+\lambda (1-\lambda )\tfrac{L_1}{2}\Vert x-y\Vert ^2\\&-\lambda f_2(x)-(1-\lambda )f_2(y)\le -f_2(\lambda x+(1-\lambda )y)-\lambda (1-\lambda )\tfrac{\mu _2}{2}\Vert x-y\Vert ^2; \end{aligned}$$

see Theorem 2.15 and Theorem 2.19 in [35]. By summing the above inequalities, we obtain

$$\begin{aligned}&\lambda f(x)+(1-\lambda )f(y)\le f(\lambda x+(1-\lambda )y)+\lambda (1-\lambda )\tfrac{L_1-\mu _2}{2}\Vert x-y\Vert ^2, \end{aligned}$$

which implies concavity of f on \(\mathbb {R}^n\). In this case, problem (6) will be unbounded from below. This follows from the fact that a concave function on \(\mathbb {R}^n\) is unbounded from below unless it is constant. Likewise, one can show that problem (6) will be convex providing \(L_2\le \mu _1\).

The Toland dual [43] of problem (6) may be written as

$$\begin{aligned} \min ~&f_2^*(x)-f_1^*(x)\\ \nonumber \text {s.t. }&x\in \mathbb {R}^n. \end{aligned}$$
(8)

It is known that problems (6) and (8) share the same optimal value [43].

In what follows, we investigate the convergence rate of Algorithm 1 with the termination criterion \(\Vert g_1^k-g_2^k\Vert \le \epsilon \). As a motivation of this criterion, recall that \(\Vert g_1^k-g_2^k\Vert = 0\) implies that \(x^k\) is a critical point of (1) in the non-smooth case, and a stationary point of f if \(f_2\) is strictly differentiable; see our discussion following Definition 2.2. In Sect. 3, we will derive results for the case that at least one of \(f_1\) or \(f_2\) is differentiable, and we will consider the more general situation in Sect. 4.

For well-definedness of the DCA (Algorithm 1), throughout the paper, we assume that

$$\begin{aligned} x^k\in \text {dom}(\partial f_1)\cap \text {dom}(\partial f_2) \quad k=1, 2, \ldots , \end{aligned}$$

where \(\text {dom}(\partial f_1)=\{x: \partial f_1(x)\ne \emptyset \}\). It is worth noting that similar algorithm has been developed for the dual problem in [28] and (2) is equivalent to \(x^{k+1}\in \partial f^*_1(g_2^k)\).

3 Performance Analysis of the DCA for Smooth \(f_1\) or \(f_2\)

In this subsection, we apply performance estimation for the analysis of Algorithm 1 for the case that at least one of \(f_1\) or \(f_2\) is L-smooth for some finite \(L>0\). The worst-case convergence rate of Algorithm 1 can be obtained by solving the following abstract optimization problem:

$$\begin{aligned} \max&\ \left( \min _{1\le k\le N+1} \left\| g_1^k-g_2^k\right\| ^2\right) \nonumber \\&g_1^{N+1}, g_2^{N+1}, x^{N+1}, \ldots , x^2 \ \text {are generated by Algorithm~1 w.r.t.}\ f_1, f_2, x^1\nonumber \\&f(x)\ge f^\star \ \ \ \forall x\in \mathbb {R}^n\nonumber \\ \nonumber&\ f_1\in \mathcal {F}_{\mu _1,L_1}(\mathbb {R}^n), f_2\in \mathcal {F}_{\mu _2,L_2}(\mathbb {R}^n)\\ \nonumber&f_1(x^1)-f_2(x^1)-f^\star \le \Delta \\&\ x^1\in \mathbb {R}^n, \end{aligned}$$
(9)

where \(\Delta \ge 0\) denote the difference between the optimal value and the value of f at the starting point. Here, \(f_1, f_2\) and \(x^k\), \(g_1^k\) and \(g_2^k\) (\(k\in \{1,..., N+1\}\)) are decision variables, and \(\Delta ,\mu _1,L_1,\mu _2,L_2\) and N are fixed parameters.

Problem (9) is an intractable infinite-dimensional optimization problem with an infinite number of constraints. In what follows, we provide a semidefinite programming relaxation of the problem.

By Theorem 2.1, problem (9) can be written as,

$$\begin{aligned} \nonumber \max&\ \left( \min _{1\le k\le N+1} \left\| g_1^k-g_2^k\right\| ^2\right) \\ \nonumber \text {s.t.}\ {}&\tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\left( \tfrac{1}{L_1}\left\| g_1^i-g_1^j\right\| ^2+\mu _1\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _1}{L_1}\left\langle g_1^j-g_1^i,x^j-x^i\right\rangle \right) \\ \nonumber&\ \ \ \ \ \le f_1^i-f_1^j-\left\langle g_1^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N+1\right\} \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_2^i-g_2^j\right\| ^2+\mu _2\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^j-g_2^i,x^j-x^i\right\rangle \right) \\&\ \ \ \ \ \le f_2^i-f_2^j-\left\langle g_2^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N+1\right\} \nonumber \\ \nonumber&\ g_1^{k+1}=g_2^k \ \ k\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\ f_1^k-f_2^k-\frac{1}{2(L_1-\mu _2)}\Vert g_1^k-g_2^{k}\Vert ^2\ge f^\star \ \ k\in \left\{ 1, \ldots , N+1\right\} \\&f_1^1-f_2^1-f^\star \le \Delta . \end{aligned}$$
(10)

In problem (10), \(f^\star \) and \(x^k,\ g_1^k, \ g_2^k, \ f_1^k, \ f_2^k\), \(k\in \left\{ 1, \ldots , N+1\right\} \), are decision variables. By virtue of Lemma 2.1, constraints \(f(x)\ge f^\star \) for each \(x\in \mathbb {R}^n\) are replaced by \( f_1^k-f_2^k-\frac{1}{2(L_1-\mu _2)}\Vert g_1^k-g_2^{k}\Vert ^2\ge f^\star , \ \ k\in \left\{ 1, \ldots , N+1\right\} \). Due to the necessary and sufficient optimality conditions for convex problems, \(x^{k+1}\in \text {argmin}_{x\in \mathbb {R}^n} f_1(x)-f_2(x^k)-\langle g_2^k, x-x^k\rangle \), \(k\in \left\{ 1, \ldots , N\right\} \) implies \(g_1^{k+1}=g_2^k\) for some \(g_1^{k+1}\in \partial {f}(x^{k+1})\); see Theorem 3.63 in [9]. By substituting \(g_2^{k}=g_1^{k+1}\), \(k\in \{1,\ldots , N\}\), the above formulation may be written as:

$$\begin{aligned} \nonumber \max&\ \ell \\ \nonumber \text {s.t.}\ {}&\left\| g_1^i-g_1^{i+1}\right\| ^2\ge \ell \ \ \ i\in \{1,\dots , N\}\\&\nonumber \left\| g_1^{N+1}-g_2^{N+1}\right\| ^2\ge \ell \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\left( \tfrac{1}{L_1}\left\| g_1^i-g_1^j\right\| ^2+\mu _1\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _1}{L_1}\left\langle g_1^j-g_1^i,x^j-x^i\right\rangle \right) \\ \nonumber&\ \ \ \ \ \le f_1^i-f_1^j-\left\langle g_1^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N+1\right\} \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{i+1}-g_1^{j+1}\right\| ^2+\mu _2\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{j+1}-g_1^{i+1},x^j-x^i\right\rangle \right) \\&\ \ \ \ \ \le f_2^i-f_2^j-\left\langle g_1^{j+1}, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N\right\} \nonumber \\ \nonumber&\tfrac{1}{2\left( 1-\tfrac{\mu _2}{L_2}\right) }\left( \tfrac{1}{L_2}\left\| g_2^{N+1}-g_1^{j+1}\right\| ^2+\mu _2\left\| x^{N+1}-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{j+1}-g_2^{N+1},x^j-x^{N+1}\right\rangle \right) \\ \nonumber&\ \ \ \ \ \le f_2^{N+1}-f_2^j-\left\langle g_1^{j+1}, x^{N+1}-x^j\right\rangle \ \ j\in \left\{ 1, \ldots , N\right\} \\&\nonumber \tfrac{1}{2\left( 1-\tfrac{\mu _2}{L_2}\right) }\left( \tfrac{1}{L_2}\left\| g_1^{i+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^i-x^{N+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^{N+1}-g_1^{i+1},x^{N+1}-x^i\right\rangle \right) \\ \nonumber&\ \ \ \ \ \le f_2^i-f_2^{N+1}-\left\langle g_2^{N+1}, x^i-x^{N+1}\right\rangle \ \ i\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\ f_1^k-f_2^k-\frac{1}{2(L_1-\mu _2)}\Vert g_1^k-g_1^{k+1}\Vert ^2\ge f^\star \ \ k\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\ f_1^{N+1}-f_2^{N+1}-\frac{1}{2(L_1-\mu _2)}\Vert g_1^{N+1}-g_2^{N+1}\Vert ^2\ge f^\star \\&f_1^1-f_2^1-f^\star \le \Delta . \end{aligned}$$
(11)

By using this formulation, the next result (Theorem 3.1) provides a convergence rate for Algorithm 1. Since the proof is quite technical, a few remarks are in order. The proof uses the performance estimation technique of Drori and Teboulle [16] that consists of the following steps:

  1. 1.

    Observe that problem (11) may be rewritten as a semidefinite programming (SDP) problem (for sufficiently large N) by replacing all inner products by the entries of an unknown Gram matrix.

  2. 2.

    Use weak duality of SDP to bound the optimal value of (11) by constructing a dual feasible solution.

  3. 3.

    The dual feasible solution is constructed empirically, by first doing numerical experiments with fixed values of the parameters \(\Delta , N, \mu _1, L_1, \mu _2, L_2\), and noting the dual multipliers.

  4. 4.

    Subsequently, the analytical expressions of the dual multipliers are guessed, based on the numerical values, and the guess is verified analytically.

  5. 5.

    In the proof of Theorem 3.1, the conjectured dual multipliers are simply stated and then shown to provide the required bound on the optimal value of (11) through the corresponding aggregation of the constraints of (11).

Theorem 3.1

Let \(f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})\) and \(f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})\) and let \(f(x^1)-f^\star = \Delta \). Suppose that \(L_1\) or \(L_2\) is finite. Then after N iterations of Algorithm 1, one has:

$$\begin{aligned}&\min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le \sqrt{\frac{\mathcal {A}\Delta }{\mathcal {B} N+\mathcal {C}}}, \end{aligned}$$
(12)

where

$$\begin{aligned}&\mathcal {A}=2\left( L_1L_2-\mu _1L_2I_{\mathbb {R}_+}(L_1-L_2)-\mu _2L_1I_{\mathbb {R}_+}({L_2}-{L_1})\right) ,\\&\mathcal {B}=L_1+L_2+\mu _1\left( \tfrac{L_1}{L_2}-3\right) I_{\mathbb {R}_+}\left( {L_1}-{L_2}\right) + \mu _2\left( \tfrac{L_2}{L_1}-3\right) I_{\mathbb {R}_+}\left( {L_2}-{L_1}\right) , \end{aligned}$$

and

$$\begin{aligned} \mathcal {C}=\frac{L_1L_2-\mu _1L_2I_{\mathbb {R}_+}\left( {L_1}-{L_2}\right) -\mu _2L_1I_{\mathbb {R}_+}\left( {L_2}-{L_1}\right) }{L_1-\mu _2}. \end{aligned}$$

Proof

We investigate two cases \(L_1\ge L_2\) and \(L_1<L_2\). Suppose that U denote the square of the right side of inequality (12) and let \(B=\tfrac{U}{\Delta }\). To prove this bound, we show that U is an upper bound for problem (11). First, we consider \(L_1\ge L_2\). Let

$$\begin{aligned}&\bar{\lambda }=\frac{2\left( L_1L_2-\mu _1(2L_2-L_1)\right) }{N\left( L_1+L_2+\mu _1\left( \tfrac{L_1}{L_2}-3\right) \right) +\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}\\&\bar{\eta }_1=\frac{L_2-\mu _1}{\left( L_1+L_2+\mu _1(\tfrac{L_1}{L_2}-3)\right) N+\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}\\&\bar{\eta }_k=\frac{\tfrac{L_1\mu _1}{L_2}+(L_1+L_2-3\mu _1)}{\left( L_1+L_2+\mu _1(\tfrac{L_1}{L_2}-3)\right) N+\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}, \ \ k\in \{2,\ldots ,N\}\\&\bar{\eta }_{N+1}=1-\bar{\eta }_1-\sum _{k=2}^{N}\bar{\eta }_k=\frac{\tfrac{L_1\mu _1}{L_2}+L_1-2\mu _1+\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}{\left( L_1+L_2+\mu _1(\tfrac{L_1}{L_2}-3)\right) N+\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}. \end{aligned}$$

By direct calculation, one can verify that

$$\begin{aligned}&\ell -U+\bar{\eta }_1\left( \left\| g_1^1-g_1^2\right\| ^2-\ell \right) +\sum _{k=2}^{N}\bar{\eta }_k \left( \left\| g_1^k-g_1^{k+1}\right\| ^2-\ell \right) +\bar{\eta }_{N+1}\left( \left\| g_1^{N+1}-g_2^{N+1}\right\| ^2-\ell \right) \\&\qquad +\,B\left( f^\star -f_1^1+f^1_2+\Delta \right) +B\left( f_1^{N+1}- f_2^{N+1}-\frac{1}{2(L_1-\mu _2)}\Vert g_1^{N+1}-g_2^{N+1}\Vert ^2-f^\star \right) \\&\qquad + B\sum _{k=1}^{N} \Bigg ( f_1^k-f_1^{k+1}-\left\langle g_1^{k+1}, x^k-x^{k+1}\right\rangle -\tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\Bigg (\tfrac{1}{L_1}\left\| g_1^k-g_1^{k+1}\right\| ^2+\mu _1\left\| x^k-x^{k+1}\right\| ^2\\&\qquad -\tfrac{2\mu _1}{L_1}\left\langle g_1^{k+1}-g_1^k,x^{k+1}-x^k\right\rangle \Bigg )\Bigg )+\bar{\lambda }\sum _{k=1}^{N-1} \Bigg ( f_2^{k+1}-f_2^{k}-\left\langle g_1^{k+1}, x^{k+1}-x^{k}\right\rangle \\&\qquad -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{k+1}-g_1^{k+2}\right\| ^2+\mu _2\left\| x^k-x^{k+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\langle g_1^{k+2}-g_1^{k+1},x^{k+1}-x^k\rangle \right) \Bigg )\\&\qquad +(\bar{\lambda }-B)\sum _{k=1}^{N-1} \Bigg ( f_2^k-f_2^{k+1}-\left\langle g_1^{k+2}, x^k-x^{k+1}\right\rangle -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}(\tfrac{1}{L_2}\left\| g_1^{k+1}-g_1^{k+2}\right\| ^2\\&\qquad +\mu _2\left\| x^k-x^{k+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\langle g_1^{k+2}-g_1^{k+1},x^{k+1}-x^k\rangle \Bigg )+(\bar{\lambda }-B)\Bigg ( f_2^N-f_2^{N+1}-\left\langle g_2^{N+1}, x^N-x^{N+1}\right\rangle \\&\qquad -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{N+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^N-x^{N+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\langle g_2^{N+1}-g_1^{N+1},x^{N+1}-x^N\rangle \right) \Bigg )\\&\qquad +\bar{\lambda }\Bigg ( f_2^{N+1}-f_2^{N}-\left\langle g_1^{N+1}, x^{N+1}-x^{N}\right\rangle -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\Bigg (\tfrac{1}{L_2}\left\| g_1^{N+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^N-x^{N+1}\right\| ^2\\&\qquad -\tfrac{2\mu _2}{L_2}\langle g_2^{N+1}-g_1^{N+1},x^{N+1}-x^N\rangle \Bigg )\Bigg )\\&\quad =-\bar{\beta }_{1}^{-1}\sum _{i=1}^{N}\left\| \bar{\beta }_{1}g^{i}_1-\bar{\beta }_{1}g_1^{i+1}-\bar{\alpha }_{1}x^{i}+\bar{\alpha }_{1}x^{i+1}\right\| ^2-\bar{\alpha }_{2}^{-1}\sum _{i=1}^{N-1}\left\| \bar{\alpha }_{2}x^{i}-\bar{\alpha }_{2}x^{i+1}-\bar{\beta }_{2}g^{i+1}_1+\bar{\beta }_{2}g^{i+2}_1\right\| ^2\\&\qquad -\bar{\alpha }_{2}^{-1}\left\| \bar{\alpha }_{2}x^{N}-\bar{\alpha }_{2}x^{N+1}-\bar{\beta }_{2}g^{N+1}_1+\bar{\beta }_{2}g^{N+1}_2\right\| ^2 \le 0, \end{aligned}$$

where

$$\begin{aligned}&\bar{\alpha }_1=\frac{\mu _1 B}{2(L_1-\mu _1)}, \ \ \ \ \bar{\beta }_1=\frac{\mu _1B}{2L_2(L_1-\mu _1)}, \\&\bar{\alpha }_2=\frac{(-\mu _1L_2^2-2\mu _1\mu _2L_2+\mu _1L_1L_2+\mu _1\mu _2L_1+\mu _2L_1L_2)B }{2(L_1-\mu _1)(L_2-\mu _2)},\\&\bar{\beta }_2=\frac{(L_1L_2\mu _2-2\mu _1\mu _2L_2+\mu _1\mu _2L_1-\mu _1L_2^2+\mu _1L_1L_2)B}{2L_2(L_1-\mu _1)(L_2-\mu _2)}. \end{aligned}$$

It is readily seen that \(\bar{\lambda }, \bar{\eta }_k\ (k\in \{1, \ldots , N+1\}), \bar{\lambda }-B, \bar{\beta }_1, \bar{\alpha }_2\ge 0\). Thus, we have \(\ell \le U\) for any feasible point of problem (11). Now, we consider \(L_1<L_2\). In this case, because bound (12) does not depend on \(\mu _1\), we may assume \(\mu _1=0\) in problem (11). Let

$$\begin{aligned}&\hat{\lambda }=\frac{2\left( L_1L_2-\mu _2(2L_1-L_2)\right) }{\left( L_1+L_2+\mu _2\left( \tfrac{L_2}{L_1}-3\right) \right) N+\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}}\\&\hat{\eta }_1=\frac{\tfrac{L_2(L_1+\mu _2)}{L_1}-2\mu _2}{\left( L_1+L_2+\mu _2(\tfrac{L_2}{L_1}-3)\right) N+\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}}\\&\hat{\eta }_k=\frac{\tfrac{L_2(L_1+\mu _2)}{L_1}+(L_1-3\mu _2)}{\left( L_1+L_2+\mu _2(\tfrac{L_2}{L_1}-3)\right) N+\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}}, \ \ k\in \{2,\ldots ,N\}\\&\hat{\eta }_{N+1}=1-\hat{\eta }_1-\sum _{k=2}^{N}\hat{\eta }_k=\frac{\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}+L_1-\mu _2}{\left( L_1+L_2+\mu _2(\tfrac{L_2}{L_1}-3)\right) N+\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}}. \end{aligned}$$

With some calculation, one can establish that

$$\begin{aligned}&\ell -U+\hat{\eta }_1\left( \left\| g_1^1-g_1^2\right\| ^2-\ell \right) +\sum _{k=2}^{N} \hat{\eta }_k\left( \left\| g_1^k-g_1^{k+1}\right\| ^2-\ell \right) +\hat{\eta }_{N+1}\left( \left\| g_1^{N+1}-g_2^{N+1}\right\| ^2-\ell \right) \\&\qquad + B\left( f^\star -f_1^1+f^1_2+\Delta \right) +B\left( f_1^{N+1}- f_2^{N+1}-\frac{1}{2(L_1-\mu _2)}\Vert g_1^{N+1}-g_2^{N+1}\Vert ^2-f^\star \right) \\&\qquad +(\hat{\lambda }-B)\sum _{k=1}^{N} \left( f_1^{k+1}-f_1^{k}-\left\langle g_1^{k}, x^{k+1}-x^{k}\right\rangle -\tfrac{1}{2L_1}\left\| g_1^{k+1}-g_1^{k}\right\| ^2\right) \\&\qquad +\hat{\lambda }\sum _{k=1}^{N} \left( f_1^k-f_1^{k+1}-\left\langle g_1^{k+1}, x^k-x^{k+1}\right\rangle -\tfrac{1}{2L_1}\left\| g_1^{k}-g_1^{k+1}\right\| ^2\right) \\&\qquad +B\sum _{k=1}^{N-1} \Bigg ( f_2^{k+1}-f_2^{k}-\left\langle g_1^{k+1}, x^{k+1}-x^{k}\right\rangle \\&\qquad -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{k+1}-g_1^{k+2}\right\| ^2+\mu _2\left\| x^k-x^{k+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{k+2}-g_1^{k+1},x^{k+1}-x^k\right\rangle \right) \Bigg )\\&\qquad +B\Bigg ( f_2^{N+1}-f_2^{N}-\left\langle g_1^{N+1}, x^{N+1}-x^{N}\right\rangle \\&\qquad - \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{N+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^N-x^{N+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^{N+1}-g_1^{N+1},x^{N+1}-x^N\right\rangle \right) \Bigg )\\&\quad = -\hat{\beta }_{1}^{-1}\sum _{i=1}^{N}\left\| \hat{\beta }_{1}g^{i}_1-\hat{\beta }_{1}g_1^{i+1}-\hat{\alpha }_{1}x^{i}_1+\hat{\alpha }_{1}x^{i+1}\right\| ^2 -\hat{\alpha }_{2}^{-1}\sum _{i=1}^{N-1}\left\| \hat{\alpha }_{2}x^{i}-\hat{\alpha }_{2}x^{i+1}-\hat{\beta }_{2}g^{i+1}_1+\hat{\beta }_{2}g^{i+2}_1\right\| ^2\\&\qquad -\hat{\alpha }_{2}^{-1}\left\| \hat{\alpha }_{2}x^{N}-\hat{\alpha }_{2}x^{N+1}-\hat{\beta }_{2}g^{N+1}_1+\hat{\beta }_{2}g^{N+1}_2\right\| ^2 \le 0, \end{aligned}$$

where

$$\begin{aligned} \hat{\alpha }_1=\tfrac{\mu _2B(1-\tfrac{L_1}{L_2})}{2L_1(1-\tfrac{\mu _2}{L_2})}, \ \ \hat{\alpha }_2=\tfrac{\mu _2L_1B}{2(L_2-\mu _2)}, \ \ \hat{\beta }_1=\tfrac{\mu _2B(1-\tfrac{L_1}{L_2})}{2L_1^2(1-\tfrac{\mu _2}{L_2})}, \ \ \hat{\beta }_2=\tfrac{\mu _2B}{2(L_2-\mu _2)}. \end{aligned}$$

It is readily seen that \(\hat{\lambda }, \hat{\eta }_k\ (k\in \{1, \ldots , N+1\}), \hat{\lambda }-B, \hat{\beta }_1, \hat{\alpha }_2\ge 0\). The rest of proof is similar to that of the former case, and the proof is complete.

The theorem implies that Algorithm 1 is convergent when at least one of the Lipschitz constants is finite. In the following corollary, we simplify the inequality (12) for some special cases of \(L_1\), \(L_2\), \(\mu _1\), and \(\mu _2\).

Corollary 3.1

Suppose that \(f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})\) and \(f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})\). Then, after N iterations of Algorithm 1, one has:

  1. (i)

    If \(L_1=\infty \), \(L_2<\infty \), then

    $$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le \sqrt{\frac{2L_2^2\left( f(x^1)-f^\star \right) }{N(L_2+\mu _1)}}. \end{aligned}$$
  2. (ii)

    If \(L_2=\infty \), \(L_1<\infty \), then

    $$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le \sqrt{\frac{2L_1^2\left( L_1-\mu _2\right) \left( f(x^1)-f^\star \right) }{\left( L_1^2-\mu _2^2\right) N+L_1^2}}. \end{aligned}$$
    (13)
  3. (iii)

    If \(L_1, L_2<\infty \), and \(\mu _1=\mu _2=0\) then

    $$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le \sqrt{\frac{2L_1L_2\left( f(x^1)-f^\star \right) }{\left( L_1+L_2\right) N+L_2}}. \end{aligned}$$

One can compare the results in Corollary 3.1 to that of Le Thi et al. [26] as reviewed earlier in Theorem 1.2. First of all, Corollary 3.1 part iii) does not assume strict convexity of \(f_1\) or \(f_2\), and in this sense it is more general than the result in Theorem 1.2. If we do assume \(\mu _1+\mu _2 > 0\), then, for example, if \(L_1 < \infty \), Theorem 1.2 implies,

$$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le L_1\sqrt{\frac{2\left( f(x^1)-f^\star \right) }{{\left( \mu _1+\mu _2\right) N}}}, \end{aligned}$$

which is weaker than our bound (13) since \(\mu _1 \le L_1\), although the \(O(1/\sqrt{N})\) dependence on N is the same. We will do a further, more direct, comparison of Theorem 1.2 and Corollary 3.1 in Sect. 3.2, where we consider the convergence rate of the sequence \(\Vert x^{k+1} - x^k\Vert \).

3.1 An Example to Prove Tightness

In what follows, we give a class of functions for which the bound in Corollary 3.1, part ii), is attained, implying that the \(O(1/\sqrt{N})\) convergence rate is tight. This result is new to the best of our knowledge.

Example 3.1

Let \(L_1\in (0, \infty )\). Suppose that N is selected such that \(U:=\sqrt{\tfrac{2}{L_1(N+1)}}< 1\). Let \(f_1: \mathbb {R}\rightarrow \mathbb {R}\) be given as follows,

$$\begin{aligned} f_1(x) = {\left\{ \begin{array}{ll} \tfrac{L_1}{2}\left( x-i(1-U)\right) ^2+\tfrac{L_1Ui(i-1)(1-U)}{2} &{} \ \ x\in \left[ \alpha _i, \beta _{i+1}\right) \\ L_1U\beta _i(x-\beta _i)+\tfrac{\beta _iL_1U^2}{2}+\tfrac{\beta _i(\beta _i-1)L_1U}{2} &{} \ \ x\in \left[ \beta _{i}, \alpha _i\right) \\ \tfrac{L_1}{2}x^2 &{} \ \ x\in \left( -\infty , 0\right) , \end{array}\right. } \end{aligned}$$

where for \(i\in \{1, \ldots , N+1\}\), \(\alpha _i=i-U\), \(\beta _i=i-1\), and \(\beta _{N+2}=\infty \). Note that \(f_1\in \mathcal {F}_{0, L_1}({\mathbb {R}})\). Suppose that \(f_2: \mathbb {R}\rightarrow \mathbb {R}\) is given by

$$\begin{aligned} f_2(x)=\max _{1\le i\le N+1}\left\{ L_1U(i-1)(x-i)+\tfrac{i(i-1)L_1U}{2}\right\} . \end{aligned}$$

An easy computation shows that

$$\begin{aligned} {\left\{ \begin{array}{ll} \partial f_2(i)=[L_1U(i-1), L_1Ui] &{} \ \ \ i\in \{1,\dots ,N,\}\\ \partial f_2(N+1)=L_1UN. \\ \end{array}\right. } \end{aligned}$$

Note that \(f_2\in \mathcal {F}_{0, \infty }({\mathbb {R}})\). One can check that, at \(x^1=N+1\), one has \(f_1(x^1)-f_2(x^1)=1\), \(\min _{x\in \mathbb {R}} f_1(x)-f_2(x)=0\) and \(\text {argmin}_{x\in \mathbb {R}} f_1(x)-f_2(x)=[0, 1-U]\). By taking \(x^1\) as a starting point, Algorithm 1 can generate the following iterates:

$$\begin{aligned} x^k=N+2-k, \ \ \ \ k\in \{1, \ldots , N+1\}. \end{aligned}$$

Here at iteration, \(k\in \{1, \ldots , N+1\}\), we set \(g_2^k=L_1U(N+1-k)\). It follows that \(|\nabla f_1(x_k)-g_2^k|=\sqrt{\tfrac{2L_1}{N+1}}\), \(k\in \{1,\ldots ,N+1\}\). Hence,

$$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| =\sqrt{\tfrac{2L_1}{N+1}}, \end{aligned}$$

which shows bound (13) in Corollary 3.1 is exact for this example.

3.2 Convergence Rates for the Iterates

In this section, we investigate the implications of our results so far on convergence rates of the iterates \(\{x^k\}\).

Proposition 3.1

Let \(f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})\) and \(f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})\) and let \(f(x^1)-f^\star \le \Delta \). If \(\mu _1\) or \(\mu _2\) is strictly positive, then after N iterations of Algorithm 1, one has:

$$\begin{aligned}&\min _{1\le k\le N}\left\| x^{k+1}-x^k\right\| \le \left( \frac{\mathcal {A}}{\mathcal {B} N+\mathcal {C}}\cdot \Delta \right) ^{\tfrac{1}{2}}, \end{aligned}$$

where

$$\begin{aligned}&\mathcal {A}=2\left( \mu _2^{-1}\mu _1^{-1}-L_2^{-1}\mu _1^{-1}I_{\mathbb {R}_+}(\mu _2^{-1}-\mu _1^{-1})-L_1^{-1}\mu _2^{-1}I_{\mathbb {R}_+}({\mu _1^{-1}}-{\mu _2^{-1}})\right) ,\\&\mathcal {B}=\mu _2^{-1}+\mu _1^{-1}+L_2^{-1}\left( \tfrac{\mu _1}{\mu _2}-3\right) I_{\mathbb {R}_+}\left( {\mu _2^{-1}}-{\mu _1^{-1}}\right) + L_1^{-1}\left( \tfrac{\mu _2}{\mu _1}-3\right) I_{\mathbb {R}_+}\left( {\mu _1^{-1}}-{\mu _2^{-1}}\right) ,\\&\text {and}\\&\mathcal {C}=\frac{\mu _2^{-1}\mu _1^{-1}-L_2^{-1}\mu _1^{-1}I_{\mathbb {R}_+}\left( {\mu _2^{-1}}-{\mu _1^{-1}}\right) -L_1^{-1}\mu _2^{-1}I_{\mathbb {R}_+}\left( {\mu _1^{-1}}-{\mu _2^{-1}}\right) }{\mu _2^{-1}-L_1^{-1}}. \end{aligned}$$

Proof

The proof is based on the computation of the worst-case convergence rate of DCA for problem (8) by applying Theorem 3.1. By Toland duality, \(f^\star \) is also a lower bound of problem (8). By virtue of conjugate function properties, it follows that \( f_2^*(g_2^1)-f_1^*(g_2^1)-f^\star \le \Delta \) and \(f_2^*\in \mathcal {F}_{L_2^{-1}, \mu _2^{-1}}({\mathbb {R}^n})\) and \(f_1^*\in \mathcal {F}_{L_1^{-1}, \mu _1^{-1}}({\mathbb {R}^n})\). In addition, \(x^{k+1}\in \partial f_1^*(g_2^k)\) and \(x^{k}\in \partial f_2^*(g_2^k)\) for \(k\in \{1, \ldots , N\}\). Hence, all assumptions of Theorem 3.1 hold, and subsequently the bound follows from Theorem 3.1.

Recall the known result from Theorem 1.2:

$$\begin{aligned} \min _{1\le k\le N}\left\| x^{k+1}-x^k\right\| \le \left( \frac{2(f(x^1)-f^\star )}{N(\mu _1+\mu _2)}\right) ^{\tfrac{1}{2}}. \end{aligned}$$
(14)

By employing Theorem 3.1, we get

$$\begin{aligned} \min _{1\le k\le N}\left\| x^{k+1}-x^k\right\| \le \left( \frac{2(f(x^1)-f^\star )}{N(\mu _1+\mu _2)+\mu _1}\right) ^{\tfrac{1}{2}}, \end{aligned}$$

which is tighter than the bound (14). Moreover, the bound given in Proposition 3.1 provides more information concerning the worst-case convergence rate of the DCA when \(L_1<\infty \) or \(L_2<\infty \).

4 Performance Estimation using a Convergence Criterion for Critical Points in the Non-smooth Case

Theorem 3.1 addresses the case that \(f_1\) or \(f_2\) is L-smooth with \(L<\infty \). In what follows, we investigate the case that \(f_1\) and \(f_2\) are proper convex functions and where both may be non-smooth. For this general case, we need to adopt a different termination criterion to obtain results, since the termination criterion \(\Vert g_1^k-g_2^k\Vert \le \epsilon \) may be of no use in this case. For example, suppose that a DC function \(f: \mathbb {R}\rightarrow \mathbb {R}\cup \{\infty \}\) is given by

$$\begin{aligned} f(x)={\left\{ \begin{array}{ll} f_1(x)-f_2(x) &{} x\ge 0 \\ \infty &{} x<0, \end{array}\right. } \end{aligned}$$

where

$$\begin{aligned}&f_1(x)=\max _{n\in \mathbb {N}\cup \{0\}}\{-n(x-2^{-n})+2-2^{1-n}-n2^{-n} \},\\&f_2(x)=\max _{n\in \mathbb {N}\cup \{0\}} \{-(n+1)(x-2^{-n})+2-3(2^{-n})-n2^{-n} \}. \end{aligned}$$

With \(x^1=1\) and the given DC decomposition, Algorithm 1 may generate

$$\begin{aligned} x^k=2^{-k}, \ \ \ \ g_1^k=-(k-1), \ \ \ \ g_2^k=-k, \ \ \ k\in \{1, 2, ...\}. \end{aligned}$$

As \(|g^k_1-g_2^k|=1\), Algorithm 1 never stops by employing the given termination criterion while it is convergent to global minimum \(\bar{x}=0\). We therefore will use the termination criterion of the following value being sufficiently small:

$$\begin{aligned} \nonumber T(x^{k+1})&:=f_1(x^k)-f_2(x^k)-\min _{x\in \mathbb {R}^n} \left( f_1(x)-f_2(x^k)-\left\langle g_2^k, x-x^k\right\rangle \right) \\ {}&=f_1(x^k)-f_1(x^{k+1})-\left\langle g_2^k, x^k-x^{k+1}\right\rangle . \end{aligned}$$
(15)

Note that \(T(x^{k+1})\ge 0\). It follows that if \(T(x^{k+1})=0\) then \(f(x^k)= f(x^{k+1})\), and \(x^{k}\in \text {argmin}_{x\in \mathbb {R}^n} f_1(x)-f_2(x^k)-\langle g_2^k, x-x^k\rangle \). Indeed, by the optimality conditions for convex problems, we have \(\partial f_1(x^k)\cap \partial f_2(x^k)\ne \emptyset \). Consequently, \(T(x^{k+1})=0\) implies that \(x^{k}\) is a critical point of problem (6). The aforementioned stopping criterion has also been employed for the analysis of the Frank–Wolfe method for non-convex problems; see Eq. (2.6) in [18].

In what follows, we investigate Algorithm 1 with the termination criterion \(T(x^{k+1})<\epsilon \) for the given accuracy \(\epsilon >0\). The performance estimation problem with termination criterion (15) may be written as follows,

$$\begin{aligned} \nonumber \max&\ \ell \\ \nonumber \text {s.t.}\ {}&f_1(x^k)-f_1(x^{k+1})-\left\langle g_1^{k+1}, x^k-x^{k+1}\right\rangle \ge \ell \ \ \ i\in \{1,\dots , N\}\\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\left( \tfrac{1}{L_1}\left\| g_1^i-g_1^j\right\| ^2+\mu _1\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _1}{L_1}\left\langle g_1^j-g_1^i,x^j-x^i\right\rangle \right) \\ \nonumber&\quad \le f_1^i-f_1^j-\left\langle g_1^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N+1\right\} \\ \nonumber&\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{i+1}-g_1^{j+1}\right\| ^2+\mu _2\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{j+1}-g_1^{i+1},x^j-x^i\right\rangle \right) \\ \nonumber&\quad \le \ f_2^i-f_2^j-\left\langle g_1^{j+1}, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_2^{N+1}-g_1^{j+1}\right\| ^2+\mu _2\left\| x^{N+1}-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{j+1}-g_2^{N+1},x^j-x^{N+1}\right\rangle \right) \\ \nonumber&\quad \le f_2^{N+1}-f_2^j-\left\langle g_1^{j+1}, x^{N+1}-x^j\right\rangle \ \ j\in \left\{ 1, \ldots , N\right\} \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{i+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^i-x^{N+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^{N+1}-g_1^{i+1},x^{N+1}-x^i\right\rangle \right) \\ \nonumber&\quad \le f_2^i-f_2^{N+1}-\left\langle g_2^{N+1}, x^i-x^j\right\rangle \ \ i\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\ f_1^k-f_2^k\ge f^\star \ \ k\in \left\{ 1, \ldots , N+1\right\} \\&f_1^1-f_2^1-f^\star \le \Delta . \end{aligned}$$
(16)

Note that we do not employ Lemma 2.1 in this formulation because we consider a general DC problem. Using the performance estimation procedure as described before the proof of Theorem 3.1 once more, we obtain the following result.

Theorem 4.1

Let \(f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})\) and \(f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})\). Then, after N iterations of Algorithm 1, one has

$$\begin{aligned}&\min _{1\le k\le N} f_1(x^k)-f_1(x^{k+1})-\langle g_2^k, x^k-x^{k+1}\rangle \nonumber \\&\quad \le \min \left\{ \frac{L_1}{N(L_1+\mu _2)}, \frac{L_2}{N(L_2+\mu _1)-\mu _1} \right\} \left( f(x^1)-f^\star \right) . \end{aligned}$$
(17)

Proof

We show separately that \(\tfrac{L_1(f(x^1)-f^\star )}{N(L_1+\mu _2)}\) and \(\frac{L_2(f(x^1)-f^\star )}{N(L_2+\mu _1)-\mu _1}\) are upper bounds for problem (16). The proof is analogous to that of Theorem 3.1. First, consider the bound \(\tfrac{L_1(f(x^1)-f^\star )}{N(L_1+\mu _2)}\). Since the given bound does not depend on \(\mu _1\) and \(L_2\), we may assume without loss of generality that \(L_2=\infty \) and \(\mu _1=0\). Suppose that \(B_1=\tfrac{L_1}{N(L_1+\mu _2)}\). With some algebra, one can show that

$$\begin{aligned}&\ell -B_1\Delta +\tfrac{1}{N}\sum _{k=1}^{N} \left( f_1^k-f_1^{k+1}-\langle g_1^{k+1}, x^k-x^{k+1}\rangle -\ell \right) +B_1\left( f_1^{N+1}- f_2^{N+1}-f^\star \right) \\&\qquad + B_1\left( f^\star -f_1^1+f^1_2+\Delta \right) +(\tfrac{1}{N}-B_1)\sum _{k=1}^{N} \left( f_1^{k+1}-f_1^{k}-\left\langle g_1^{k}, x^{k+1}-x^{k}\right\rangle -\tfrac{1}{2L_1}\left\| g_1^{k+1}-g_1^{k}\right\| ^2\right) \\&\qquad +B_1\sum _{k=1}^{N} \left( f_2^{k+1}-f_2^{k}-\left\langle g_1^{k+1}, x^{k+1}-x^{k}\right\rangle -\tfrac{\mu _2}{2}\left\| x^{k+1}-x^{k}\right\| ^2\right) \\&\quad = -\tfrac{B_1\mu _2}{2}\sum _{k=1}^N\left\| x^k-x^{k+1}-\tfrac{1}{L_1}(g_1^k-g_1^{k+1})\right\| ^2 \le 0. \end{aligned}$$

The rest of proof is similar to that of Theorem 3.1. Now, we consider the bound \(\tfrac{L_2(f(x^1)-f^\star )}{N(L_2+\mu _1)-\mu _1}\). Without loss generality, we may assume that \(L_1=\infty \) and \(\mu _2=0\). By doing some calculus, one can show that

$$\begin{aligned}&\ell -B_2\Delta + B_2\left( f_1^1-f_1^{2}-\left\langle g^2_1, x^1-x^{2}\right\rangle -\ell \right) + B_2\left( f_1^{N+1}- f_2^{N+1}-f^\star \right) \\&\qquad +B_2\left( f^\star -f_1^1+f^1_2+\Delta \right) +\tfrac{1-B_2}{N-1}\sum _{k=2}^{N} \left( f_1^k-f_1^{k+1}-\left\langle g_1^{k+1}, x^k-x^{k+1}\right\rangle -\ell \right) \\&\qquad +\alpha \sum _{k=2}^{N} \left( f_1^{k+1}-f_1^{k}-\left\langle g_1^{k}, x^{k+1}-x^{k}\right\rangle -\tfrac{\mu _1}{2}\left\| x^{k+1}-x^{k}\right\| ^2\right) \\&\qquad +B_2\sum _{k=1}^{N} \left( f_2^{k+1}-f_2^{k}-\left\langle g_1^{k+1}, x^{k+1}-x^{k}\right\rangle -\tfrac{1}{2L_2}\left\| g_1^{k+2}-g_1^{k+1}\right\| ^2\right) \\&\qquad +B_2\left( f_2^{N+1}-f_2^{N}-\left\langle g_1^{N+1}, x^{N+1}-x^{N}\right\rangle -\tfrac{1}{2L_2}\left\| g_2^{N+1}-g_1^{N+1}\right\| ^2\right) \\&\quad =-\tfrac{B_2}{2L_2}\left\| g_2^{N+1}-g_1^{N+1}\right\| ^2-\tfrac{B_2}{2L_2}\sum _{k=2}^{N} \left\| g_1^k-g_1^{k+1}-\tfrac{\alpha L_2}{B_2}(x^{k}-x^{k+1}) \right\| ^2 \le 0, \end{aligned}$$

where \(B_2=\tfrac{L_2}{N(L_2+\mu _1)-\mu _1}\) and \(\alpha =\tfrac{1-B_2}{N-1}-B_2\). Since we assume \(L_2>\mu _1\), we have \(B_2, \alpha \ge 0\). The rest of the proof runs as before. \(\square \)

The important point is that the last result provides a rate of convergence even if neither \(L_1\) nor \(L_2\) is finite, and we therefore state it as a corollary.

Corollary 4.1

Let \(f_1\in \mathcal {F}_{\mu _1, \infty }({\mathbb {R}^n})\) and \(f_2\in \mathcal {F}_{\mu _2, \infty }({\mathbb {R}^n})\), i.e. consider any DC decomposition in problem (1). Then, after N iterations of Algorithm 1, one has

$$\begin{aligned} \min _{1\le k\le N} f_1(x^k)-f_1(x^{k+1})-\langle g_2^k, x^k-x^{k+1}\rangle \le \frac{1}{N} \left( f(x^1)-f^\star \right) . \end{aligned}$$

This result is new to the best of our knowledge.

5 Linear Convergence of the DCA under the Polyak–Łojasiewicz Inequality

In the section, we provide some sufficient conditions under which the DCA is linearly convergent. Similar to the former sections, we employ the performance estimation for obtaining convergence rate.

In recent years, the linear convergence of some optimization methods for non-convex problems has been investigated under the Polyak–Łojasiewicz (PL) inequality; see [2, 12, 25] and the reference therein. We say that f satisfies PL inequality on X if there exists \(\eta >0\) such that

$$\begin{aligned} f(x)-f^\star \le \tfrac{1}{2\eta } \Vert \xi \Vert ^2, \ \ \forall x\in X, \forall \xi \in \text {co}(\partial _l f(x)). \end{aligned}$$
(18)

Note that when f is differentiable inequality (18) is a special case of (3) with \(\theta =\tfrac{1}{2}\) and different ground set. If \(f_1\) or \(f_2\) is strictly differentiable, we have \(\text {co}(\partial _l f)=\partial f_1-\partial f_2\); see Example 10.10 in [38]. Hence, the performance estimation problem with the PL inequality may be formulated as follows:

$$\begin{aligned} \nonumber \max&\ \frac{(f_1^2-f_2^2)-f^\star }{(f_1^1-f_2^1)-f^\star }\\ \nonumber \text {s.t.}\ {}&\tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\left( \tfrac{1}{L_1}\left\| g_1^i-g_1^j\right\| ^2+\mu _1\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _1}{L_1}\left\langle g_1^j-g_1^i,x^j-x^i\right\rangle \right) \\&\nonumber \quad \le f_1^i-f_1^j-\left\langle g_1^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, 2\right\} \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_2^i-g_2^j\right\| ^2+\mu _2\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^j-g_2^i,x^j-x^i\right\rangle \right) \\&\nonumber \quad \le f_2^i-f_2^j-\left\langle g_2^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, 2\right\} \\ \nonumber&\ f_1^k-f_2^k\ge f^\star \ \ k\in \left\{ 1, 2\right\} \\ \nonumber&g_2^1=g_1^2\\&\left( f_1^k-f_2^k\right) -f^\star \le \tfrac{1}{2\eta }\Vert g_1^k-g_2^k\Vert ^2, \ \ k\in \left\{ 1, 2\right\} . \end{aligned}$$
(19)

By doing constraint aggregation in problem (19) as before (i.e. demonstrating a dual feasible solution and using weak duality), we obtain the following linear convergence rate for the DCA under the PL inequality.

Theorem 5.1

Let \(f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})\) and \(f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})\). If \(L_1\) or \(L_2\) is finite and if f satisfies PL inequality on \(X=\{x: f(x)\le f(x^1)\}\), then for \(x^2\) from Algorithm 1, we have

$$\begin{aligned} \frac{f(x^2)-f^\star }{f(x^1)-f^\star }\le \left( \frac{1-\frac{\eta }{L_1}}{1+\frac{\eta }{L_2}}\right) . \end{aligned}$$
(20)

Proof

Since the given bound is independent of \(\mu _1\) and \(\mu _2\), without loss of generality, we assume that \(\mu _1=\mu _2=0\). In addition, we assume that \(f^\star =0\). Direct calculation shows that

$$\begin{aligned}&{\left( f_1^2-f_2^2\right) -f^\star }-\left( \frac{1-\frac{\eta }{L_1}}{1+\frac{\eta }{L_2}}\right) \left( {\left( f_1^1-f_2^1\right) -f^\star } \right) +\left( \frac{1}{1+\frac{\eta }{L_2}}\right) \\&\quad \times \left( f_1^1-f_1^{2}-\left\langle g_1^{2}, x^1-x^{2}\right\rangle -\tfrac{1}{2L_1}\left\| g_1^1-g_1^{2}\right\| ^2\right) \\&\quad +\left( \frac{1}{1+\frac{\eta }{L_2}}\right) \left( f_2^2-f_2^{1}-\left\langle g_1^{2}, x^2-x^{1}\right\rangle -\tfrac{1}{2L_2}\left\| g_1^2-g_2^{2}\right\| ^2\right) +\left( \frac{\frac{\eta }{L_1}}{1+\frac{\eta }{L_2}}\right) \\&\quad \times \left( \frac{1}{2\eta }\left\| g_1^1-g_1^2\right\| ^2-f_1^1+f_2^1\right) +\left( \frac{\frac{\eta }{L_2}}{1+\frac{\eta }{L_2}}\right) \left( \frac{1}{2\eta }\left\| g_1^2-g_2^2\right\| ^2-f_1^2+f_2^2\right) =0. \end{aligned}$$

As all the multipliers in the last expression are non-negative, for any feasible solution of problem (11), we have

$$\begin{aligned} f(x^2)-f^\star - \left( \frac{1-\frac{\eta }{L_1}}{1+\frac{\eta }{L_2}}\right) \left( f(x^1)-f^\star \right) \le 0, \end{aligned}$$

completing the proof.

Note that Theorem 1.1 by Le Thi et al. [27] does not imply Theorem 5.1 if inequality (3) holds on \(\{x: f(x)\le f(x^1)\}\) with \(\theta =\tfrac{1}{2}\), since we assume neither strong convexity of \(f_1\) or \(f_2\), nor boundedness of the sequence of iterates. Moreover, we give explicit expressions for the constants that determine the linear convergence rate of the sequence of objective values.

6 Conclusion

We have shown that the performance estimation framework of Drori and Teboulle [16] yields new insights into the convergence behavior of the difference-of-convex algorithm (DCA). As future work, one may also consider the convergence of the DCA on more restricted classes of DC problems, e.g. where \(f_1\) and \(f_2\) are convex polynomials, as studied in [3]. For constrained problems, even the case where \(f_1\) and \(f_2\) are quadratic polynomials is of interest, e.g. in the study of (extended) trust region problems.