On the Rate of Convergence of the Difference-of-Convex Algorithm (DCA)

Abbaszadehpeivasti, Hadi; de Klerk, Etienne; Zamani, Moslem

doi:10.1007/s10957-023-02199-z

On the Rate of Convergence of the Difference-of-Convex Algorithm (DCA)

Open access
Published: 29 March 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

On the Rate of Convergence of the Difference-of-Convex Algorithm (DCA)

Download PDF

Hadi Abbaszadehpeivasti¹,
Etienne de Klerk ORCID: orcid.org/0000-0003-3377-0063¹ &
Moslem Zamani¹

2056 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we study the non-asymptotic convergence rate of the DCA (difference-of-convex algorithm), also known as the convex–concave procedure, with two different termination criteria that are suitable for smooth and non-smooth decompositions, respectively. The DCA is a popular algorithm for difference-of-convex (DC) problems and known to converge to a stationary point of the objective under some assumptions. We derive a worst-case convergence rate of $O(1/\sqrt{N})$ after N iterations of the objective gradient norm for certain classes of DC problems, without assuming strong convexity in the DC decomposition and give an example which shows the convergence rate is exact. We also provide a new convergence rate of O(1/N) for the DCA with the second termination criterion. Moreover, we derive a new linear convergence rate result for the DCA under the assumption of the Polyak–Łojasiewicz inequality. The novel aspect of our analysis is that it employs semidefinite programming performance estimation.

An inexact successive quadratic approximation method for a class of difference-of-convex optimization problems

Article 02 March 2022

Linear convergence of first order methods for non-strongly convex optimization

Article 22 January 2018

A convergence analysis of the method of codifferential descent

Article 25 July 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this paper, we consider the general difference-of-convex (DC) optimization problem,

$$\begin{aligned}&\inf \;f(x):=f_1(x)-f_2(x)\nonumber \\&\text {s.t.}\; x\in \mathbb {R}^n, \end{aligned}$$

(1)

where $f_1, f_2$ are extended convex functions on $\mathbb {R}^n$ and f is an extended lower-semicontinuous function on $\mathbb {R}^n$. Throughout the paper, we assume that the infimum in problem (1) is finite, and denote by $f^\star $ a lower bound of f on $\mathbb {R}^n$.

DC problems appear naturally in many applications, e.g., power allocation in digital communication systems [4], production-transportation planning [22], location planning [13], image processing [31], sparse signal recovering [17], cluster analysis [7, 8], and supervised data classification [6, 29], to name but a few.

This wide range of applications is to be expected, since some important classes of non-convex functions may be represented as DC functions. For instance, twice continuously differentiable functions on any convex subset of $\mathbb {R}^n$ [20] and continuous piece-wise linear functions [34] may be written as DC functions. Furthermore, every continuous function on a compact and convex set can be approximated by a DC function [23, 44]. We refer the interested reader to Hiriart–Urruty [21] and Tuy [44] for more information on DC representable functions.

The celebrated difference-of-convex algorithm (DCA), also known as the convex–concave procedure, has been applied extensively to problem (1); see [28, 30, 40] and the references therein. Algorithm 1 presents the basic form of the DCA.

In the description of the DCA in Algorithm 1, (sub)gradients of $f_1$ and $f_2$ are assumed to be available at given points, the so-called black-box formulation. The DCA is sometimes also presented as a primal-dual method, where a dual sub-problem is solved to obtain the required (sub)gradients; see [28, 30] for further discussions of this topic. In recent years, some scholars have also extended the DCA and proposed some new variations; see [19, 32, 33, 36, 39].

The first convergence results for Algorithm 1 were given in [40, Theorem 3(iv)]. The authors showed that, if the sequence of iterates $\{x^k\}$ is bounded, then each accumulation point of this sequence is a critical point of f.

Le Thi et al. [27] established an asymptotic linear convergence rate of $\{x^k\}$ under some conditions, in particular under the assumption that f satisfies the Łojasiewicz gradient inequality at all stationary points. Recall that a differentiable function f is said to satisfy this inequality at a stationary point a ($\nabla f(a) = 0$), if there exist constants $\theta \in (0,1)$, $C > 0$ and $\epsilon >0$ such that

$$\begin{aligned} |f(x) - f(a)|^\theta \le C\Vert \nabla f(x)\Vert \text{ if } \Vert x-a\Vert \le \epsilon , \end{aligned}$$

(3)

where the constant $\theta $ is called the Łojasiewicz exponent. This inequality is known to hold, for example, for real analytic functions, but has been extended to include classes of non-smooth functions as well by considering general sub-differentials instead of gradients; see [10, 11] and the references therein.

The convergence rates established by Le Thi et al. [27] depend on the value of the Łojasiewicz exponent, as the following theorem shows. The theorem stated here is a special case of Theorems 3.4 and 3.5 in [27], to give a flavor of the convergence results in [27].

Theorem 1.1

(Theorems 3.4 and 3.5 in Le Thi et al. [27]) Let $f_1$ and $f_2$ be proper convex functions and let the domain of f be closed. Also assume that at least one of $f_1$ and $f_2$ is strongly convex, and $f_1$ or $f_2$ is differentiable with locally Lipschitz gradient in every critical point of the DC problem. Finally, assume the sequence $\{x^k\}$ is bounded, and let $x^\infty $ be a limit point of $\{x^k\}$. Then $x^\infty $ is also a stationary point. Moreover, if f satisfies the Łojasiewicz gradient inequality (3) at all stationary points, then

1.
if $\theta \in (1/2,1)$, then $\Vert x^k-x^\infty \Vert \le ck^{\tfrac{1-\theta }{1-2\theta }}$ for some $c>0$.
2.
if $\theta \in (0,1/2]$, then $\Vert x^k-x^\infty \Vert \le cq^k$ for some $c>0$ and $q\in (0,1)$.

In particular, item 2 shows a linear convergence rate when $\theta \in (0,1/2]$. Yen et al. [45] had already shown linear convergence earlier for a much smaller class of DC functions. We will present a complementary result to this theorem (see Theorem 5.1), for the case $\theta = 1/2$, where we show linear convergence of the objective function values and give explicit expressions for the constants that determine the linear convergence rate. Moreover, we will relax the assumption of a bounded sequence of iterates, and the assumption of strong convexity.

In the absence of conditions like the Łojasiewicz gradient inequality (3), only weaker convergence rates are known for the DCA. In particular, Tao and An [40, Proposition 2] and Le Thi et al. [26, Corollary 1] have shown an $O\left( \frac{1}{\sqrt{N}} \right) $ convergence rate after N iterations under suitable assumptions, as given in the next theorem.

Theorem 1.2

(Corollary 1 in [26], Proposition 2 in [40]) If $x^\infty $ is a limit point of the iteration sequence generated by the DCA, and at least one of $f_1$ and $f_2$ is strongly convex, i.e. for some $\mu _1, \mu _2 \ge 0$ such that $\mu _1 + \mu _2 > 0$,

$$\begin{aligned} x \mapsto f_i(x) - \frac{\mu _i}{2}\Vert x\Vert ^2 \text{ is } \text{ convex } \text{ for } i \in \{1,2\}, \end{aligned}$$

then the series $\Vert x^{k+1}-x^k\Vert $ converges, and, after $N +1$ iterations,

$$\begin{aligned} \sum _{k=1}^{N} \Vert x^{k+1}-x^k\Vert ^2\le \tfrac{2(f(x^{1})-f(x^{N+1}))}{\mu _1+\mu _2}, \end{aligned}$$

and, consequently,

$$\begin{aligned} \min _{1 \le k \le N} \Vert x^{k+1}-x^k\Vert \le \sqrt{\frac{2(f(x^1) - f^\star )}{(\mu _1+\mu _2)N} } = O\left( \frac{1}{\sqrt{N}} \right) . \end{aligned}$$

We will derive some variants on this $O\left( \frac{1}{\sqrt{N}} \right) $ convergence result in Corollary 3.1 and in Sect. 3.2, where we improve the constants in the $O\left( \frac{1}{\sqrt{N}} \right) $ bounds. We also show that we obtain the best possible constants, by demonstrating an example where our bound in Corollary 3.1 is tight.

1.1 Outline and Further Contributions of this Paper

The novel aspect of the analysis in this paper is that we will apply performance estimation to derive convergence rates. Drori and Teboulle, in the seminal paper [16], introduced performance estimation as a strong tool for the worst-case analysis of first-order methods. The underlying idea of performance estimation is that the worst-case complexity may be cast as an optimization problem. Furthermore, this optimization problem can often be reformulated as a semidefinite programming problem. It is worth noting that performance estimation has been employed extensively for the analysis of worst-case convergence rates of first-order methods, see, e.g. [1, 14,15,16, 41, 42], and the references therein.

This paper is organized as follows: In Sect. 2, we review some definitions and notions from convex analysis, which will be used in the following sections. We study the DCA for sufficiently smooth DC decompositions in Sect. 3. By using performance estimation, we give a convergence rate of $O(1/\sqrt{N})$ in Corollary 3.1, without any strong convexity assumption, thus extending and complementing Le Thi et al. [26, Corollary 1]. We construct an example that shows this $O(1/\sqrt{N})$ bound is tight. Since the first termination criterion is not suitable for the analysis of non-smooth DC compositions, we investigate the DCA with another stopping criterion in Sect. 4, and we show a convergence rate of O(1/N). This result is completely new to the best of our knowledge.

In Sect. 5, we study the DCA when the objective function satisfies the Polyak–Łojasiewicz inequality, and we derive a linear convergence rate in Theorem 5.1, thereby refining some linear convergence results in Le Thi et al. [27] as described above.

2 Basic Definitions and Preliminaries

In this section, we recall some notions and definitions from convex analysis. Throughout the paper, $\Vert \cdot \Vert $ and $\langle \cdot ,\cdot \rangle $ denote the Euclidean norm and the dot product, respectively. $I_{\mathbb {R}_+}$ stands for the indicator function on $\mathbb {R}_+\cup \{\infty \}$, i.e.,

$$\begin{aligned} I_{\mathbb {R}_+}(x)={\left\{ \begin{array}{ll} 1 &{} x\ge 0 \cup \{\infty \}\\ 0 &{} x<0\cup \{-\infty \}. \end{array}\right. } \end{aligned}$$

Let $f:\mathbb {R}^n\rightarrow [-\infty , \infty ]$ be an extended convex function. The domain of f is denoted and defined as $\text {dom}(f):=\{x: f(x)<\infty \}$. The function f is called proper if it does not attain the value $-\infty $, and its domain is non-empty. We call f closed if its epi-graph is closed, that is $\{(x, r): f(x)\le r\}$ is a closed subset of $\mathbb {R}^{n+1}$. We denote the convex hull of $X\subseteq \mathbb {R}^n$ by $\text {co}(X)$. We adopt the conventions that, for $a, b, c, d\in \mathbb {R}$ with $c\ne d$ and $a\ne 0$, $\frac{b}{\infty }=0, 0\times \infty =0$ and $\frac{a\infty +b}{c\infty -d\infty }=\frac{a}{c-d}$. For the function $f:\mathbb {R}^n\rightarrow [-\infty , \infty ]$, the conjugate function $f^*:\mathbb {R}^n\rightarrow \mathbb {R}$ is defined as $f^*(g)=\max _{x\in \mathbb {R}^n} \langle g, x\rangle -f(x)$. Moreover, we denote the set of subgradients of f at $x\in \text {dom}(f)$ by $\partial f(x)$,

$$\begin{aligned} \partial f(x)=\{g: f(y)\ge f(x)+\langle g, y-x\rangle , \forall y\in \mathbb {R}^n\}. \end{aligned}$$

Let $L\in (0, \infty ]$ and $\mu \in (0, \infty )$. We call an extended convex function $f:\mathbb {R}^n\rightarrow [-\infty , \infty ]$ L-smooth if for any $x_1, x_2\in \mathbb {R}^n$,

$$\begin{aligned} \Vert g_1-g_2\Vert \le L\Vert x_1-x_2\Vert \ \ \forall g_1\in \partial f(x_1),\ g_2\in \partial f(x_2). \end{aligned}$$

Note that if $L<\infty $, then f must be differentiable on $\mathbb {R}^n$. In addition, any extended convex function is $\infty $-smooth. Also recall that the function $f:\mathbb {R}^n\rightarrow [-\infty , \infty ]$ is called $\mu $-strongly convex function if the function $x \mapsto f(x)-\tfrac{\mu }{2}\Vert x\Vert ^2$ is convex. Clearly, any convex function is 0-strongly convex. We denote the set of closed proper convex functions which are L-smooth and $\mu $-strongly convex by $\mathcal {F}_{\mu ,L}(\mathbb {R}^n)$.

Let $\mathcal {I}$ be a finite index set and let $\{x^i; g^i; f^i\}_{i\in \mathcal {I}}\subseteq \mathbb {R}^n\times \mathbb {R}^n\times \mathbb {R}$. A set $\{x^i; g^i; f^i\}_{i\in \mathcal {I}}$ is called $\mathcal {F}_{\mu ,L}$-interpolable if there exists $f\in \mathcal {F}_{\mu ,L}(\mathbb {R}^n)$ with

$$\begin{aligned} f(x^i)=f^i, \ g^i\in \partial f(x^i) \ \ i\in \mathcal {I}. \end{aligned}$$

The next theorem gives necessary and sufficient conditions for $\mathcal {F}_{\mu ,L}$-interpolablity.

Theorem 2.1

[41, Theorem 4] Let $L\in (0, \infty ]$ and $\mu \in [0, \infty )$ and let $\mathcal {I}$ be a finite index set. The set $\{(x^i; g^i; f^i)\}_{i\in \mathcal {I}}\subseteq \mathbb {R}^n\times \mathbb {R}^n \times \mathbb {R}$ is $\mathcal {F}_{\mu ,L}$-interpolable if and only if for any $i, j\in \mathcal {I}$, we have

$$\begin{aligned}&\tfrac{1}{2(1-\tfrac{\mu }{L})}\left( \tfrac{1}{L}\left\| g^i-g^j\right\| ^2+\mu \left\| x^i-x^j\right\| ^2-\tfrac{2\mu }{L}\left\langle g^j-g^i,x^j-x^i\right\rangle \right) \\&\quad \le f^i-f^j-\left\langle g^j, x^i-x^j\right\rangle . \end{aligned}$$

In the next lemma, we extend the descent lemma for DCA when $L_1$ or $L_2$ is finite.

Lemma 2.1

Let $f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})$ and $f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})$ and let $f=f_1-f_2$. If $g_1\in \partial f_1(x)$ and $g_2\in \partial f_2(x)$, then

$$\begin{aligned} f^\star \le f(x)-\tfrac{1}{2\left( L_1-\mu _2\right) }\Vert g_1-g_2\Vert ^2. \end{aligned}$$

Proof

If $L_1=\infty $, the proof is immediate. Let $L_1<\infty $. By L-smoothness and strong convexity, we have

$$\begin{aligned} f_1(y)\le f_1(x)+\langle g_1, y-x\rangle +\tfrac{L_1}{2}\Vert y-x\Vert ^2,\\ f_2(y)\ge f_2(x)+\langle g_2, y-x\rangle +\tfrac{\mu _2}{2}\Vert y-x\Vert ^2, \end{aligned}$$

for $y\in \mathbb {R}^n$. By the above inequalities, we get

$$\begin{aligned} f(y)\le f(x)+\langle g_1-g_2, y-x\rangle +\tfrac{L_1-\mu _2}{2}\Vert y-x\Vert ^2. \end{aligned}$$

Hence, by taking minimum on both sides of the last inequality with respect to y for fixed x, we get

$$\begin{aligned} f^\star \le f(x)-\tfrac{1}{2(L_1-\mu _2)}\Vert g_1-g_2\Vert ^2. \end{aligned}$$

Since the DC optimization problem (1) may have a non-convex and non-smooth objective function f, we will also need a more general notion of subgradients than in the convex case.

Definition 2.1

Let $f:\mathbb {R}^n\rightarrow \mathbb {R}$ be lower semi-continuous and let $f(\bar{x})$ be finite.

The vector g is called regular subgradient of f at $\bar{x}$, written $g\in \hat{\partial }_l f(\bar{x})$, if for all x in some neighborhood of $\bar{x}$
$$\begin{aligned} f(x)\ge f(\bar{x})+\langle g, x-\bar{x}\rangle +o(\Vert x-\bar{x}\Vert ). \end{aligned}$$
The vector g is called general subgradient of f at $\bar{x}$, written $g\in \partial _l f(\bar{x})$, if there exist sequences $\{x^i\}$ and $\{g^i\}$ with $g^i\in \hat{\partial }_l f(x^i)$ such that
$$\begin{aligned} x^i\rightarrow \bar{x}, \ f(x^i)\rightarrow f(\bar{x}), \ g^i\rightarrow g. \end{aligned}$$

It is worth mentioning that $\hat{\partial }_l f(\bar{x})$ is a closed convex set. In addition, $\partial _l f(\bar{x})$ is also closed but not necessarily convex. Note that when f is closed proper convex, then $\partial f(x)=\hat{\partial }_l f(x)=\partial _l f(x)$ for $x\in \text {dom} (f)$. We refer the interested reader to Rockafellar and Wets [38] for more discussions on regular and general subdifferentials.

Definition 2.2

Let $f_1, f_2$ be closed proper convex functions, and let f be lower semi-continuous.

The point $\bar{x}\in \text {dom}(f)$ is called a critical point of problem (1) if
$$\begin{aligned} \partial f_1(\bar{x})\cap \partial f_2(\bar{x}) \ne \emptyset . \end{aligned}$$
(4)
The point $\bar{x}\in \text {dom}(f)$ is called a stationary point of problem (1) if
$$\begin{aligned} 0\in \partial _l f(\bar{x}). \end{aligned}$$
(5)

Obviously, the stationarity condition is stronger than criticality. We recall that a convex function will be locally Lipschitz around $\bar{x}$ providing it takes finite values in a neighborhood of $\bar{x}$; see Theorem 35.1 in [37]. Consequently, if $f_1$ or $f_2$ takes finite values around a neighborhood of a stationary point $\bar{x}$, then $\bar{x}$ is a critical point; see Corollary 10.9 in [38]. However, its converse does not hold in general. For instance, consider $f:\mathbb {R}\rightarrow \mathbb {R}$ given as $f(x)=x$. The function f may be written as $f=f_1-f_2$ where $f_1(x)=\max (x, 0)$ and $f_2(x)=\max (-x, 0)$. Suppose that $\bar{x}=0$. It is readily seen that $\partial f_1(\bar{x})\cap \partial f_2(\bar{x}) \ne \emptyset $, but $\bar{x} = 0$ is not a stationary point of f. It is worth noting that, if $f_2$ is strictly differentiable at $\bar{x}$, these definitions are equivalent; see Example 10.10 in [38]. Recall that function f is strictly differentiable at $\bar{x}$, if

$$\begin{aligned} \lim _{\begin{array}{c} (x, x^\prime )\rightarrow (\bar{x}, \bar{x})\\ x\ne x^\prime \end{array}} \frac{f(x)-f(x^\prime )-\langle \nabla f(\bar{x}), x-x^\prime \rangle }{\Vert x-x^\prime \Vert }=0. \end{aligned}$$

We refer the interested reader to An and Tao [5], Joki et al. [24] and Pang et al. [36] and references therein for more discussions on optimality conditions for DC problems.

2.1 The DC Problem

In this section, we consider

$$\begin{aligned} \min ~&f(x)=f_1(x)-f_2(x)\nonumber \\ \text {s.t. }&x\in \mathbb {R}^n, \end{aligned}$$

(6)

where $f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})$ and $f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})$. Here, we assume that $L_1, L_2\in (0, \infty ]$ and $\mu _1, \mu _2\in [0, \infty )$, and consequently, f may be non-differentiable. We may assume without loss of generality that $f_1$ and $f_2$ satisfy the following assumptions:

$$\begin{aligned} L_1>\mu _2, \ \ \ \ L_2>\mu _1. \end{aligned}$$

(7)

Indeed, if $L_1\le \mu _2$, then for $x,y\in \mathbb {R}^n$ and $\lambda \in [0, 1]$, we have

$$\begin{aligned}&\lambda f_1(x)+(1-\lambda )f_1(y)\le f_1(\lambda x+(1-\lambda )y)+\lambda (1-\lambda )\tfrac{L_1}{2}\Vert x-y\Vert ^2\\&-\lambda f_2(x)-(1-\lambda )f_2(y)\le -f_2(\lambda x+(1-\lambda )y)-\lambda (1-\lambda )\tfrac{\mu _2}{2}\Vert x-y\Vert ^2; \end{aligned}$$

see Theorem 2.15 and Theorem 2.19 in [35]. By summing the above inequalities, we obtain

$$\begin{aligned}&\lambda f(x)+(1-\lambda )f(y)\le f(\lambda x+(1-\lambda )y)+\lambda (1-\lambda )\tfrac{L_1-\mu _2}{2}\Vert x-y\Vert ^2, \end{aligned}$$

which implies concavity of f on $\mathbb {R}^n$. In this case, problem (6) will be unbounded from below. This follows from the fact that a concave function on $\mathbb {R}^n$ is unbounded from below unless it is constant. Likewise, one can show that problem (6) will be convex providing $L_2\le \mu _1$.

The Toland dual [43] of problem (6) may be written as

$$\begin{aligned} \min ~&f_2^*(x)-f_1^*(x)\\ \nonumber \text {s.t. }&x\in \mathbb {R}^n. \end{aligned}$$

(8)

It is known that problems (6) and (8) share the same optimal value [43].

In what follows, we investigate the convergence rate of Algorithm 1 with the termination criterion $\Vert g_1^k-g_2^k\Vert \le \epsilon $. As a motivation of this criterion, recall that $\Vert g_1^k-g_2^k\Vert = 0$ implies that $x^k$ is a critical point of (1) in the non-smooth case, and a stationary point of f if $f_2$ is strictly differentiable; see our discussion following Definition 2.2. In Sect. 3, we will derive results for the case that at least one of $f_1$ or $f_2$ is differentiable, and we will consider the more general situation in Sect. 4.

For well-definedness of the DCA (Algorithm 1), throughout the paper, we assume that

$$\begin{aligned} x^k\in \text {dom}(\partial f_1)\cap \text {dom}(\partial f_2) \quad k=1, 2, \ldots , \end{aligned}$$

where $\text {dom}(\partial f_1)=\{x: \partial f_1(x)\ne \emptyset \}$. It is worth noting that similar algorithm has been developed for the dual problem in [28] and (2) is equivalent to $x^{k+1}\in \partial f^*_1(g_2^k)$.

3 Performance Analysis of the DCA for Smooth $f_1$ or $f_2$

In this subsection, we apply performance estimation for the analysis of Algorithm 1 for the case that at least one of $f_1$ or $f_2$ is L-smooth for some finite $L>0$. The worst-case convergence rate of Algorithm 1 can be obtained by solving the following abstract optimization problem:

$$\begin{aligned} \max&\ \left( \min _{1\le k\le N+1} \left\| g_1^k-g_2^k\right\| ^2\right) \nonumber \\&g_1^{N+1}, g_2^{N+1}, x^{N+1}, \ldots , x^2 \ \text {are generated by Algorithm~1 w.r.t.}\ f_1, f_2, x^1\nonumber \\&f(x)\ge f^\star \ \ \ \forall x\in \mathbb {R}^n\nonumber \\ \nonumber&\ f_1\in \mathcal {F}_{\mu _1,L_1}(\mathbb {R}^n), f_2\in \mathcal {F}_{\mu _2,L_2}(\mathbb {R}^n)\\ \nonumber&f_1(x^1)-f_2(x^1)-f^\star \le \Delta \\&\ x^1\in \mathbb {R}^n, \end{aligned}$$

(9)

where $\Delta \ge 0$ denote the difference between the optimal value and the value of f at the starting point. Here, $f_1, f_2$ and $x^k$, $g_1^k$ and $g_2^k$ ($k\in \{1,..., N+1\}$) are decision variables, and $\Delta ,\mu _1,L_1,\mu _2,L_2$ and N are fixed parameters.

Problem (9) is an intractable infinite-dimensional optimization problem with an infinite number of constraints. In what follows, we provide a semidefinite programming relaxation of the problem.

By Theorem 2.1, problem (9) can be written as,

$$\begin{aligned} \nonumber \max&\ \left( \min _{1\le k\le N+1} \left\| g_1^k-g_2^k\right\| ^2\right) \\ \nonumber \text {s.t.}\ {}&\tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\left( \tfrac{1}{L_1}\left\| g_1^i-g_1^j\right\| ^2+\mu _1\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _1}{L_1}\left\langle g_1^j-g_1^i,x^j-x^i\right\rangle \right) \\ \nonumber&\ \ \ \ \ \le f_1^i-f_1^j-\left\langle g_1^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N+1\right\} \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_2^i-g_2^j\right\| ^2+\mu _2\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^j-g_2^i,x^j-x^i\right\rangle \right) \\&\ \ \ \ \ \le f_2^i-f_2^j-\left\langle g_2^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N+1\right\} \nonumber \\ \nonumber&\ g_1^{k+1}=g_2^k \ \ k\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\ f_1^k-f_2^k-\frac{1}{2(L_1-\mu _2)}\Vert g_1^k-g_2^{k}\Vert ^2\ge f^\star \ \ k\in \left\{ 1, \ldots , N+1\right\} \\&f_1^1-f_2^1-f^\star \le \Delta . \end{aligned}$$

(10)

In problem (10), $f^\star $ and $x^k,\ g_1^k, \ g_2^k, \ f_1^k, \ f_2^k$, $k\in \left\{ 1, \ldots , N+1\right\} $, are decision variables. By virtue of Lemma 2.1, constraints $f(x)\ge f^\star $ for each $x\in \mathbb {R}^n$ are replaced by $ f_1^k-f_2^k-\frac{1}{2(L_1-\mu _2)}\Vert g_1^k-g_2^{k}\Vert ^2\ge f^\star , \ \ k\in \left\{ 1, \ldots , N+1\right\} $. Due to the necessary and sufficient optimality conditions for convex problems, $x^{k+1}\in \text {argmin}_{x\in \mathbb {R}^n} f_1(x)-f_2(x^k)-\langle g_2^k, x-x^k\rangle $, $k\in \left\{ 1, \ldots , N\right\} $ implies $g_1^{k+1}=g_2^k$ for some $g_1^{k+1}\in \partial {f}(x^{k+1})$; see Theorem 3.63 in [9]. By substituting $g_2^{k}=g_1^{k+1}$, $k\in \{1,\ldots , N\}$, the above formulation may be written as:

$$\begin{aligned} \nonumber \max&\ \ell \\ \nonumber \text {s.t.}\ {}&\left\| g_1^i-g_1^{i+1}\right\| ^2\ge \ell \ \ \ i\in \{1,\dots , N\}\\&\nonumber \left\| g_1^{N+1}-g_2^{N+1}\right\| ^2\ge \ell \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\left( \tfrac{1}{L_1}\left\| g_1^i-g_1^j\right\| ^2+\mu _1\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _1}{L_1}\left\langle g_1^j-g_1^i,x^j-x^i\right\rangle \right) \\ \nonumber&\ \ \ \ \ \le f_1^i-f_1^j-\left\langle g_1^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N+1\right\} \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{i+1}-g_1^{j+1}\right\| ^2+\mu _2\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{j+1}-g_1^{i+1},x^j-x^i\right\rangle \right) \\&\ \ \ \ \ \le f_2^i-f_2^j-\left\langle g_1^{j+1}, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N\right\} \nonumber \\ \nonumber&\tfrac{1}{2\left( 1-\tfrac{\mu _2}{L_2}\right) }\left( \tfrac{1}{L_2}\left\| g_2^{N+1}-g_1^{j+1}\right\| ^2+\mu _2\left\| x^{N+1}-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{j+1}-g_2^{N+1},x^j-x^{N+1}\right\rangle \right) \\ \nonumber&\ \ \ \ \ \le f_2^{N+1}-f_2^j-\left\langle g_1^{j+1}, x^{N+1}-x^j\right\rangle \ \ j\in \left\{ 1, \ldots , N\right\} \\&\nonumber \tfrac{1}{2\left( 1-\tfrac{\mu _2}{L_2}\right) }\left( \tfrac{1}{L_2}\left\| g_1^{i+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^i-x^{N+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^{N+1}-g_1^{i+1},x^{N+1}-x^i\right\rangle \right) \\ \nonumber&\ \ \ \ \ \le f_2^i-f_2^{N+1}-\left\langle g_2^{N+1}, x^i-x^{N+1}\right\rangle \ \ i\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\ f_1^k-f_2^k-\frac{1}{2(L_1-\mu _2)}\Vert g_1^k-g_1^{k+1}\Vert ^2\ge f^\star \ \ k\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\ f_1^{N+1}-f_2^{N+1}-\frac{1}{2(L_1-\mu _2)}\Vert g_1^{N+1}-g_2^{N+1}\Vert ^2\ge f^\star \\&f_1^1-f_2^1-f^\star \le \Delta . \end{aligned}$$

(11)

By using this formulation, the next result (Theorem 3.1) provides a convergence rate for Algorithm 1. Since the proof is quite technical, a few remarks are in order. The proof uses the performance estimation technique of Drori and Teboulle [16] that consists of the following steps:

1.
Observe that problem (11) may be rewritten as a semidefinite programming (SDP) problem (for sufficiently large N) by replacing all inner products by the entries of an unknown Gram matrix.
2.
Use weak duality of SDP to bound the optimal value of (11) by constructing a dual feasible solution.
3.
The dual feasible solution is constructed empirically, by first doing numerical experiments with fixed values of the parameters $\Delta , N, \mu _1, L_1, \mu _2, L_2$, and noting the dual multipliers.
4.
Subsequently, the analytical expressions of the dual multipliers are guessed, based on the numerical values, and the guess is verified analytically.
5.
In the proof of Theorem 3.1, the conjectured dual multipliers are simply stated and then shown to provide the required bound on the optimal value of (11) through the corresponding aggregation of the constraints of (11).

Theorem 3.1

Let $f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})$ and $f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})$ and let $f(x^1)-f^\star = \Delta $. Suppose that $L_1$ or $L_2$ is finite. Then after N iterations of Algorithm 1, one has:

$$\begin{aligned}&\min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le \sqrt{\frac{\mathcal {A}\Delta }{\mathcal {B} N+\mathcal {C}}}, \end{aligned}$$

(12)

where

$$\begin{aligned}&\mathcal {A}=2\left( L_1L_2-\mu _1L_2I_{\mathbb {R}_+}(L_1-L_2)-\mu _2L_1I_{\mathbb {R}_+}({L_2}-{L_1})\right) ,\\&\mathcal {B}=L_1+L_2+\mu _1\left( \tfrac{L_1}{L_2}-3\right) I_{\mathbb {R}_+}\left( {L_1}-{L_2}\right) + \mu _2\left( \tfrac{L_2}{L_1}-3\right) I_{\mathbb {R}_+}\left( {L_2}-{L_1}\right) , \end{aligned}$$

and

$$\begin{aligned} \mathcal {C}=\frac{L_1L_2-\mu _1L_2I_{\mathbb {R}_+}\left( {L_1}-{L_2}\right) -\mu _2L_1I_{\mathbb {R}_+}\left( {L_2}-{L_1}\right) }{L_1-\mu _2}. \end{aligned}$$

Proof

We investigate two cases $L_1\ge L_2$ and $L_1<L_2$. Suppose that U denote the square of the right side of inequality (12) and let $B=\tfrac{U}{\Delta }$. To prove this bound, we show that U is an upper bound for problem (11). First, we consider $L_1\ge L_2$. Let

$$\begin{aligned}&\bar{\lambda }=\frac{2\left( L_1L_2-\mu _1(2L_2-L_1)\right) }{N\left( L_1+L_2+\mu _1\left( \tfrac{L_1}{L_2}-3\right) \right) +\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}\\&\bar{\eta }_1=\frac{L_2-\mu _1}{\left( L_1+L_2+\mu _1(\tfrac{L_1}{L_2}-3)\right) N+\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}\\&\bar{\eta }_k=\frac{\tfrac{L_1\mu _1}{L_2}+(L_1+L_2-3\mu _1)}{\left( L_1+L_2+\mu _1(\tfrac{L_1}{L_2}-3)\right) N+\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}, \ \ k\in \{2,\ldots ,N\}\\&\bar{\eta }_{N+1}=1-\bar{\eta }_1-\sum _{k=2}^{N}\bar{\eta }_k=\frac{\tfrac{L_1\mu _1}{L_2}+L_1-2\mu _1+\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}{\left( L_1+L_2+\mu _1(\tfrac{L_1}{L_2}-3)\right) N+\tfrac{L_2(L_1-\mu _1)}{L_1-\mu _2}}. \end{aligned}$$

By direct calculation, one can verify that

$$\begin{aligned}&\ell -U+\bar{\eta }_1\left( \left\| g_1^1-g_1^2\right\| ^2-\ell \right) +\sum _{k=2}^{N}\bar{\eta }_k \left( \left\| g_1^k-g_1^{k+1}\right\| ^2-\ell \right) +\bar{\eta }_{N+1}\left( \left\| g_1^{N+1}-g_2^{N+1}\right\| ^2-\ell \right) \\&\qquad +\,B\left( f^\star -f_1^1+f^1_2+\Delta \right) +B\left( f_1^{N+1}- f_2^{N+1}-\frac{1}{2(L_1-\mu _2)}\Vert g_1^{N+1}-g_2^{N+1}\Vert ^2-f^\star \right) \\&\qquad + B\sum _{k=1}^{N} \Bigg ( f_1^k-f_1^{k+1}-\left\langle g_1^{k+1}, x^k-x^{k+1}\right\rangle -\tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\Bigg (\tfrac{1}{L_1}\left\| g_1^k-g_1^{k+1}\right\| ^2+\mu _1\left\| x^k-x^{k+1}\right\| ^2\\&\qquad -\tfrac{2\mu _1}{L_1}\left\langle g_1^{k+1}-g_1^k,x^{k+1}-x^k\right\rangle \Bigg )\Bigg )+\bar{\lambda }\sum _{k=1}^{N-1} \Bigg ( f_2^{k+1}-f_2^{k}-\left\langle g_1^{k+1}, x^{k+1}-x^{k}\right\rangle \\&\qquad -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{k+1}-g_1^{k+2}\right\| ^2+\mu _2\left\| x^k-x^{k+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\langle g_1^{k+2}-g_1^{k+1},x^{k+1}-x^k\rangle \right) \Bigg )\\&\qquad +(\bar{\lambda }-B)\sum _{k=1}^{N-1} \Bigg ( f_2^k-f_2^{k+1}-\left\langle g_1^{k+2}, x^k-x^{k+1}\right\rangle -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}(\tfrac{1}{L_2}\left\| g_1^{k+1}-g_1^{k+2}\right\| ^2\\&\qquad +\mu _2\left\| x^k-x^{k+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\langle g_1^{k+2}-g_1^{k+1},x^{k+1}-x^k\rangle \Bigg )+(\bar{\lambda }-B)\Bigg ( f_2^N-f_2^{N+1}-\left\langle g_2^{N+1}, x^N-x^{N+1}\right\rangle \\&\qquad -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{N+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^N-x^{N+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\langle g_2^{N+1}-g_1^{N+1},x^{N+1}-x^N\rangle \right) \Bigg )\\&\qquad +\bar{\lambda }\Bigg ( f_2^{N+1}-f_2^{N}-\left\langle g_1^{N+1}, x^{N+1}-x^{N}\right\rangle -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\Bigg (\tfrac{1}{L_2}\left\| g_1^{N+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^N-x^{N+1}\right\| ^2\\&\qquad -\tfrac{2\mu _2}{L_2}\langle g_2^{N+1}-g_1^{N+1},x^{N+1}-x^N\rangle \Bigg )\Bigg )\\&\quad =-\bar{\beta }_{1}^{-1}\sum _{i=1}^{N}\left\| \bar{\beta }_{1}g^{i}_1-\bar{\beta }_{1}g_1^{i+1}-\bar{\alpha }_{1}x^{i}+\bar{\alpha }_{1}x^{i+1}\right\| ^2-\bar{\alpha }_{2}^{-1}\sum _{i=1}^{N-1}\left\| \bar{\alpha }_{2}x^{i}-\bar{\alpha }_{2}x^{i+1}-\bar{\beta }_{2}g^{i+1}_1+\bar{\beta }_{2}g^{i+2}_1\right\| ^2\\&\qquad -\bar{\alpha }_{2}^{-1}\left\| \bar{\alpha }_{2}x^{N}-\bar{\alpha }_{2}x^{N+1}-\bar{\beta }_{2}g^{N+1}_1+\bar{\beta }_{2}g^{N+1}_2\right\| ^2 \le 0, \end{aligned}$$

where

$$\begin{aligned}&\bar{\alpha }_1=\frac{\mu _1 B}{2(L_1-\mu _1)}, \ \ \ \ \bar{\beta }_1=\frac{\mu _1B}{2L_2(L_1-\mu _1)}, \\&\bar{\alpha }_2=\frac{(-\mu _1L_2^2-2\mu _1\mu _2L_2+\mu _1L_1L_2+\mu _1\mu _2L_1+\mu _2L_1L_2)B }{2(L_1-\mu _1)(L_2-\mu _2)},\\&\bar{\beta }_2=\frac{(L_1L_2\mu _2-2\mu _1\mu _2L_2+\mu _1\mu _2L_1-\mu _1L_2^2+\mu _1L_1L_2)B}{2L_2(L_1-\mu _1)(L_2-\mu _2)}. \end{aligned}$$

It is readily seen that $\bar{\lambda }, \bar{\eta }_k\ (k\in \{1, \ldots , N+1\}), \bar{\lambda }-B, \bar{\beta }_1, \bar{\alpha }_2\ge 0$. Thus, we have $\ell \le U$ for any feasible point of problem (11). Now, we consider $L_1<L_2$. In this case, because bound (12) does not depend on $\mu _1$, we may assume $\mu _1=0$ in problem (11). Let

$$\begin{aligned}&\hat{\lambda }=\frac{2\left( L_1L_2-\mu _2(2L_1-L_2)\right) }{\left( L_1+L_2+\mu _2\left( \tfrac{L_2}{L_1}-3\right) \right) N+\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}}\\&\hat{\eta }_1=\frac{\tfrac{L_2(L_1+\mu _2)}{L_1}-2\mu _2}{\left( L_1+L_2+\mu _2(\tfrac{L_2}{L_1}-3)\right) N+\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}}\\&\hat{\eta }_k=\frac{\tfrac{L_2(L_1+\mu _2)}{L_1}+(L_1-3\mu _2)}{\left( L_1+L_2+\mu _2(\tfrac{L_2}{L_1}-3)\right) N+\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}}, \ \ k\in \{2,\ldots ,N\}\\&\hat{\eta }_{N+1}=1-\hat{\eta }_1-\sum _{k=2}^{N}\hat{\eta }_k=\frac{\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}+L_1-\mu _2}{\left( L_1+L_2+\mu _2(\tfrac{L_2}{L_1}-3)\right) N+\tfrac{L_1(L_2-\mu _2)}{L_1-\mu _2}}. \end{aligned}$$

With some calculation, one can establish that

$$\begin{aligned}&\ell -U+\hat{\eta }_1\left( \left\| g_1^1-g_1^2\right\| ^2-\ell \right) +\sum _{k=2}^{N} \hat{\eta }_k\left( \left\| g_1^k-g_1^{k+1}\right\| ^2-\ell \right) +\hat{\eta }_{N+1}\left( \left\| g_1^{N+1}-g_2^{N+1}\right\| ^2-\ell \right) \\&\qquad + B\left( f^\star -f_1^1+f^1_2+\Delta \right) +B\left( f_1^{N+1}- f_2^{N+1}-\frac{1}{2(L_1-\mu _2)}\Vert g_1^{N+1}-g_2^{N+1}\Vert ^2-f^\star \right) \\&\qquad +(\hat{\lambda }-B)\sum _{k=1}^{N} \left( f_1^{k+1}-f_1^{k}-\left\langle g_1^{k}, x^{k+1}-x^{k}\right\rangle -\tfrac{1}{2L_1}\left\| g_1^{k+1}-g_1^{k}\right\| ^2\right) \\&\qquad +\hat{\lambda }\sum _{k=1}^{N} \left( f_1^k-f_1^{k+1}-\left\langle g_1^{k+1}, x^k-x^{k+1}\right\rangle -\tfrac{1}{2L_1}\left\| g_1^{k}-g_1^{k+1}\right\| ^2\right) \\&\qquad +B\sum _{k=1}^{N-1} \Bigg ( f_2^{k+1}-f_2^{k}-\left\langle g_1^{k+1}, x^{k+1}-x^{k}\right\rangle \\&\qquad -\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{k+1}-g_1^{k+2}\right\| ^2+\mu _2\left\| x^k-x^{k+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{k+2}-g_1^{k+1},x^{k+1}-x^k\right\rangle \right) \Bigg )\\&\qquad +B\Bigg ( f_2^{N+1}-f_2^{N}-\left\langle g_1^{N+1}, x^{N+1}-x^{N}\right\rangle \\&\qquad - \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{N+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^N-x^{N+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^{N+1}-g_1^{N+1},x^{N+1}-x^N\right\rangle \right) \Bigg )\\&\quad = -\hat{\beta }_{1}^{-1}\sum _{i=1}^{N}\left\| \hat{\beta }_{1}g^{i}_1-\hat{\beta }_{1}g_1^{i+1}-\hat{\alpha }_{1}x^{i}_1+\hat{\alpha }_{1}x^{i+1}\right\| ^2 -\hat{\alpha }_{2}^{-1}\sum _{i=1}^{N-1}\left\| \hat{\alpha }_{2}x^{i}-\hat{\alpha }_{2}x^{i+1}-\hat{\beta }_{2}g^{i+1}_1+\hat{\beta }_{2}g^{i+2}_1\right\| ^2\\&\qquad -\hat{\alpha }_{2}^{-1}\left\| \hat{\alpha }_{2}x^{N}-\hat{\alpha }_{2}x^{N+1}-\hat{\beta }_{2}g^{N+1}_1+\hat{\beta }_{2}g^{N+1}_2\right\| ^2 \le 0, \end{aligned}$$

where

$$\begin{aligned} \hat{\alpha }_1=\tfrac{\mu _2B(1-\tfrac{L_1}{L_2})}{2L_1(1-\tfrac{\mu _2}{L_2})}, \ \ \hat{\alpha }_2=\tfrac{\mu _2L_1B}{2(L_2-\mu _2)}, \ \ \hat{\beta }_1=\tfrac{\mu _2B(1-\tfrac{L_1}{L_2})}{2L_1^2(1-\tfrac{\mu _2}{L_2})}, \ \ \hat{\beta }_2=\tfrac{\mu _2B}{2(L_2-\mu _2)}. \end{aligned}$$

It is readily seen that $\hat{\lambda }, \hat{\eta }_k\ (k\in \{1, \ldots , N+1\}), \hat{\lambda }-B, \hat{\beta }_1, \hat{\alpha }_2\ge 0$. The rest of proof is similar to that of the former case, and the proof is complete.

The theorem implies that Algorithm 1 is convergent when at least one of the Lipschitz constants is finite. In the following corollary, we simplify the inequality (12) for some special cases of $L_1$, $L_2$, $\mu _1$, and $\mu _2$.

Corollary 3.1

Suppose that $f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})$ and $f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})$. Then, after N iterations of Algorithm 1, one has:

(i)
If $L_1=\infty $, $L_2<\infty $, then
$$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le \sqrt{\frac{2L_2^2\left( f(x^1)-f^\star \right) }{N(L_2+\mu _1)}}. \end{aligned}$$
(ii)
If $L_2=\infty $, $L_1<\infty $, then
$$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le \sqrt{\frac{2L_1^2\left( L_1-\mu _2\right) \left( f(x^1)-f^\star \right) }{\left( L_1^2-\mu _2^2\right) N+L_1^2}}. \end{aligned}$$
(13)
(iii)
If $L_1, L_2<\infty $, and $\mu _1=\mu _2=0$ then
$$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le \sqrt{\frac{2L_1L_2\left( f(x^1)-f^\star \right) }{\left( L_1+L_2\right) N+L_2}}. \end{aligned}$$

One can compare the results in Corollary 3.1 to that of Le Thi et al. [26] as reviewed earlier in Theorem 1.2. First of all, Corollary 3.1 part iii) does not assume strict convexity of $f_1$ or $f_2$, and in this sense it is more general than the result in Theorem 1.2. If we do assume $\mu _1+\mu _2 > 0$, then, for example, if $L_1 < \infty $, Theorem 1.2 implies,

$$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| \le L_1\sqrt{\frac{2\left( f(x^1)-f^\star \right) }{{\left( \mu _1+\mu _2\right) N}}}, \end{aligned}$$

which is weaker than our bound (13) since $\mu _1 \le L_1$, although the $O(1/\sqrt{N})$ dependence on N is the same. We will do a further, more direct, comparison of Theorem 1.2 and Corollary 3.1 in Sect. 3.2, where we consider the convergence rate of the sequence $\Vert x^{k+1} - x^k\Vert $.

3.1 An Example to Prove Tightness

In what follows, we give a class of functions for which the bound in Corollary 3.1, part ii), is attained, implying that the $O(1/\sqrt{N})$ convergence rate is tight. This result is new to the best of our knowledge.

Example 3.1

Let $L_1\in (0, \infty )$. Suppose that N is selected such that $U:=\sqrt{\tfrac{2}{L_1(N+1)}}< 1$. Let $f_1: \mathbb {R}\rightarrow \mathbb {R}$ be given as follows,

$$\begin{aligned} f_1(x) = {\left\{ \begin{array}{ll} \tfrac{L_1}{2}\left( x-i(1-U)\right) ^2+\tfrac{L_1Ui(i-1)(1-U)}{2} &{} \ \ x\in \left[ \alpha _i, \beta _{i+1}\right) \\ L_1U\beta _i(x-\beta _i)+\tfrac{\beta _iL_1U^2}{2}+\tfrac{\beta _i(\beta _i-1)L_1U}{2} &{} \ \ x\in \left[ \beta _{i}, \alpha _i\right) \\ \tfrac{L_1}{2}x^2 &{} \ \ x\in \left( -\infty , 0\right) , \end{array}\right. } \end{aligned}$$

where for $i\in \{1, \ldots , N+1\}$, $\alpha _i=i-U$, $\beta _i=i-1$, and $\beta _{N+2}=\infty $. Note that $f_1\in \mathcal {F}_{0, L_1}({\mathbb {R}})$. Suppose that $f_2: \mathbb {R}\rightarrow \mathbb {R}$ is given by

$$\begin{aligned} f_2(x)=\max _{1\le i\le N+1}\left\{ L_1U(i-1)(x-i)+\tfrac{i(i-1)L_1U}{2}\right\} . \end{aligned}$$

An easy computation shows that

$$\begin{aligned} {\left\{ \begin{array}{ll} \partial f_2(i)=[L_1U(i-1), L_1Ui] &{} \ \ \ i\in \{1,\dots ,N,\}\\ \partial f_2(N+1)=L_1UN. \\ \end{array}\right. } \end{aligned}$$

Note that $f_2\in \mathcal {F}_{0, \infty }({\mathbb {R}})$. One can check that, at $x^1=N+1$, one has $f_1(x^1)-f_2(x^1)=1$, $\min _{x\in \mathbb {R}} f_1(x)-f_2(x)=0$ and $\text {argmin}_{x\in \mathbb {R}} f_1(x)-f_2(x)=[0, 1-U]$. By taking $x^1$ as a starting point, Algorithm 1 can generate the following iterates:

$$\begin{aligned} x^k=N+2-k, \ \ \ \ k\in \{1, \ldots , N+1\}. \end{aligned}$$

Here at iteration, $k\in \{1, \ldots , N+1\}$, we set $g_2^k=L_1U(N+1-k)$. It follows that $|\nabla f_1(x_k)-g_2^k|=\sqrt{\tfrac{2L_1}{N+1}}$, $k\in \{1,\ldots ,N+1\}$. Hence,

$$\begin{aligned} \min _{1\le k\le N+1}\left\| g_1^k-g_2^k\right\| =\sqrt{\tfrac{2L_1}{N+1}}, \end{aligned}$$

which shows bound (13) in Corollary 3.1 is exact for this example.

3.2 Convergence Rates for the Iterates

In this section, we investigate the implications of our results so far on convergence rates of the iterates $\{x^k\}$.

Proposition 3.1

Let $f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})$ and $f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})$ and let $f(x^1)-f^\star \le \Delta $. If $\mu _1$ or $\mu _2$ is strictly positive, then after N iterations of Algorithm 1, one has:

$$\begin{aligned}&\min _{1\le k\le N}\left\| x^{k+1}-x^k\right\| \le \left( \frac{\mathcal {A}}{\mathcal {B} N+\mathcal {C}}\cdot \Delta \right) ^{\tfrac{1}{2}}, \end{aligned}$$

where

$$\begin{aligned}&\mathcal {A}=2\left( \mu _2^{-1}\mu _1^{-1}-L_2^{-1}\mu _1^{-1}I_{\mathbb {R}_+}(\mu _2^{-1}-\mu _1^{-1})-L_1^{-1}\mu _2^{-1}I_{\mathbb {R}_+}({\mu _1^{-1}}-{\mu _2^{-1}})\right) ,\\&\mathcal {B}=\mu _2^{-1}+\mu _1^{-1}+L_2^{-1}\left( \tfrac{\mu _1}{\mu _2}-3\right) I_{\mathbb {R}_+}\left( {\mu _2^{-1}}-{\mu _1^{-1}}\right) + L_1^{-1}\left( \tfrac{\mu _2}{\mu _1}-3\right) I_{\mathbb {R}_+}\left( {\mu _1^{-1}}-{\mu _2^{-1}}\right) ,\\&\text {and}\\&\mathcal {C}=\frac{\mu _2^{-1}\mu _1^{-1}-L_2^{-1}\mu _1^{-1}I_{\mathbb {R}_+}\left( {\mu _2^{-1}}-{\mu _1^{-1}}\right) -L_1^{-1}\mu _2^{-1}I_{\mathbb {R}_+}\left( {\mu _1^{-1}}-{\mu _2^{-1}}\right) }{\mu _2^{-1}-L_1^{-1}}. \end{aligned}$$

Proof

The proof is based on the computation of the worst-case convergence rate of DCA for problem (8) by applying Theorem 3.1. By Toland duality, $f^\star $ is also a lower bound of problem (8). By virtue of conjugate function properties, it follows that $ f_2^*(g_2^1)-f_1^*(g_2^1)-f^\star \le \Delta $ and $f_2^*\in \mathcal {F}_{L_2^{-1}, \mu _2^{-1}}({\mathbb {R}^n})$ and $f_1^*\in \mathcal {F}_{L_1^{-1}, \mu _1^{-1}}({\mathbb {R}^n})$. In addition, $x^{k+1}\in \partial f_1^*(g_2^k)$ and $x^{k}\in \partial f_2^*(g_2^k)$ for $k\in \{1, \ldots , N\}$. Hence, all assumptions of Theorem 3.1 hold, and subsequently the bound follows from Theorem 3.1.

Recall the known result from Theorem 1.2:

$$\begin{aligned} \min _{1\le k\le N}\left\| x^{k+1}-x^k\right\| \le \left( \frac{2(f(x^1)-f^\star )}{N(\mu _1+\mu _2)}\right) ^{\tfrac{1}{2}}. \end{aligned}$$

(14)

By employing Theorem 3.1, we get

$$\begin{aligned} \min _{1\le k\le N}\left\| x^{k+1}-x^k\right\| \le \left( \frac{2(f(x^1)-f^\star )}{N(\mu _1+\mu _2)+\mu _1}\right) ^{\tfrac{1}{2}}, \end{aligned}$$

which is tighter than the bound (14). Moreover, the bound given in Proposition 3.1 provides more information concerning the worst-case convergence rate of the DCA when $L_1<\infty $ or $L_2<\infty $.

4 Performance Estimation using a Convergence Criterion for Critical Points in the Non-smooth Case

Theorem 3.1 addresses the case that $f_1$ or $f_2$ is L-smooth with $L<\infty $. In what follows, we investigate the case that $f_1$ and $f_2$ are proper convex functions and where both may be non-smooth. For this general case, we need to adopt a different termination criterion to obtain results, since the termination criterion $\Vert g_1^k-g_2^k\Vert \le \epsilon $ may be of no use in this case. For example, suppose that a DC function $f: \mathbb {R}\rightarrow \mathbb {R}\cup \{\infty \}$ is given by

$$\begin{aligned} f(x)={\left\{ \begin{array}{ll} f_1(x)-f_2(x) &{} x\ge 0 \\ \infty &{} x<0, \end{array}\right. } \end{aligned}$$

where

$$\begin{aligned}&f_1(x)=\max _{n\in \mathbb {N}\cup \{0\}}\{-n(x-2^{-n})+2-2^{1-n}-n2^{-n} \},\\&f_2(x)=\max _{n\in \mathbb {N}\cup \{0\}} \{-(n+1)(x-2^{-n})+2-3(2^{-n})-n2^{-n} \}. \end{aligned}$$

With $x^1=1$ and the given DC decomposition, Algorithm 1 may generate

$$\begin{aligned} x^k=2^{-k}, \ \ \ \ g_1^k=-(k-1), \ \ \ \ g_2^k=-k, \ \ \ k\in \{1, 2, ...\}. \end{aligned}$$

As $|g^k_1-g_2^k|=1$, Algorithm 1 never stops by employing the given termination criterion while it is convergent to global minimum $\bar{x}=0$. We therefore will use the termination criterion of the following value being sufficiently small:

$$\begin{aligned} \nonumber T(x^{k+1})&:=f_1(x^k)-f_2(x^k)-\min _{x\in \mathbb {R}^n} \left( f_1(x)-f_2(x^k)-\left\langle g_2^k, x-x^k\right\rangle \right) \\ {}&=f_1(x^k)-f_1(x^{k+1})-\left\langle g_2^k, x^k-x^{k+1}\right\rangle . \end{aligned}$$

(15)

Note that $T(x^{k+1})\ge 0$. It follows that if $T(x^{k+1})=0$ then $f(x^k)= f(x^{k+1})$, and $x^{k}\in \text {argmin}_{x\in \mathbb {R}^n} f_1(x)-f_2(x^k)-\langle g_2^k, x-x^k\rangle $. Indeed, by the optimality conditions for convex problems, we have $\partial f_1(x^k)\cap \partial f_2(x^k)\ne \emptyset $. Consequently, $T(x^{k+1})=0$ implies that $x^{k}$ is a critical point of problem (6). The aforementioned stopping criterion has also been employed for the analysis of the Frank–Wolfe method for non-convex problems; see Eq. (2.6) in [18].

In what follows, we investigate Algorithm 1 with the termination criterion $T(x^{k+1})<\epsilon $ for the given accuracy $\epsilon >0$. The performance estimation problem with termination criterion (15) may be written as follows,

$$\begin{aligned} \nonumber \max&\ \ell \\ \nonumber \text {s.t.}\ {}&f_1(x^k)-f_1(x^{k+1})-\left\langle g_1^{k+1}, x^k-x^{k+1}\right\rangle \ge \ell \ \ \ i\in \{1,\dots , N\}\\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\left( \tfrac{1}{L_1}\left\| g_1^i-g_1^j\right\| ^2+\mu _1\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _1}{L_1}\left\langle g_1^j-g_1^i,x^j-x^i\right\rangle \right) \\ \nonumber&\quad \le f_1^i-f_1^j-\left\langle g_1^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N+1\right\} \\ \nonumber&\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{i+1}-g_1^{j+1}\right\| ^2+\mu _2\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{j+1}-g_1^{i+1},x^j-x^i\right\rangle \right) \\ \nonumber&\quad \le \ f_2^i-f_2^j-\left\langle g_1^{j+1}, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_2^{N+1}-g_1^{j+1}\right\| ^2+\mu _2\left\| x^{N+1}-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_1^{j+1}-g_2^{N+1},x^j-x^{N+1}\right\rangle \right) \\ \nonumber&\quad \le f_2^{N+1}-f_2^j-\left\langle g_1^{j+1}, x^{N+1}-x^j\right\rangle \ \ j\in \left\{ 1, \ldots , N\right\} \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_1^{i+1}-g_2^{N+1}\right\| ^2+\mu _2\left\| x^i-x^{N+1}\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^{N+1}-g_1^{i+1},x^{N+1}-x^i\right\rangle \right) \\ \nonumber&\quad \le f_2^i-f_2^{N+1}-\left\langle g_2^{N+1}, x^i-x^j\right\rangle \ \ i\in \left\{ 1, \ldots , N\right\} \\ \nonumber&\ f_1^k-f_2^k\ge f^\star \ \ k\in \left\{ 1, \ldots , N+1\right\} \\&f_1^1-f_2^1-f^\star \le \Delta . \end{aligned}$$

(16)

Note that we do not employ Lemma 2.1 in this formulation because we consider a general DC problem. Using the performance estimation procedure as described before the proof of Theorem 3.1 once more, we obtain the following result.

Theorem 4.1

Let $f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})$ and $f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})$. Then, after N iterations of Algorithm 1, one has

$$\begin{aligned}&\min _{1\le k\le N} f_1(x^k)-f_1(x^{k+1})-\langle g_2^k, x^k-x^{k+1}\rangle \nonumber \\&\quad \le \min \left\{ \frac{L_1}{N(L_1+\mu _2)}, \frac{L_2}{N(L_2+\mu _1)-\mu _1} \right\} \left( f(x^1)-f^\star \right) . \end{aligned}$$

(17)

Proof

We show separately that $\tfrac{L_1(f(x^1)-f^\star )}{N(L_1+\mu _2)}$ and $\frac{L_2(f(x^1)-f^\star )}{N(L_2+\mu _1)-\mu _1}$ are upper bounds for problem (16). The proof is analogous to that of Theorem 3.1. First, consider the bound $\tfrac{L_1(f(x^1)-f^\star )}{N(L_1+\mu _2)}$. Since the given bound does not depend on $\mu _1$ and $L_2$, we may assume without loss of generality that $L_2=\infty $ and $\mu _1=0$. Suppose that $B_1=\tfrac{L_1}{N(L_1+\mu _2)}$. With some algebra, one can show that

$$\begin{aligned}&\ell -B_1\Delta +\tfrac{1}{N}\sum _{k=1}^{N} \left( f_1^k-f_1^{k+1}-\langle g_1^{k+1}, x^k-x^{k+1}\rangle -\ell \right) +B_1\left( f_1^{N+1}- f_2^{N+1}-f^\star \right) \\&\qquad + B_1\left( f^\star -f_1^1+f^1_2+\Delta \right) +(\tfrac{1}{N}-B_1)\sum _{k=1}^{N} \left( f_1^{k+1}-f_1^{k}-\left\langle g_1^{k}, x^{k+1}-x^{k}\right\rangle -\tfrac{1}{2L_1}\left\| g_1^{k+1}-g_1^{k}\right\| ^2\right) \\&\qquad +B_1\sum _{k=1}^{N} \left( f_2^{k+1}-f_2^{k}-\left\langle g_1^{k+1}, x^{k+1}-x^{k}\right\rangle -\tfrac{\mu _2}{2}\left\| x^{k+1}-x^{k}\right\| ^2\right) \\&\quad = -\tfrac{B_1\mu _2}{2}\sum _{k=1}^N\left\| x^k-x^{k+1}-\tfrac{1}{L_1}(g_1^k-g_1^{k+1})\right\| ^2 \le 0. \end{aligned}$$

The rest of proof is similar to that of Theorem 3.1. Now, we consider the bound $\tfrac{L_2(f(x^1)-f^\star )}{N(L_2+\mu _1)-\mu _1}$. Without loss generality, we may assume that $L_1=\infty $ and $\mu _2=0$. By doing some calculus, one can show that

$$\begin{aligned}&\ell -B_2\Delta + B_2\left( f_1^1-f_1^{2}-\left\langle g^2_1, x^1-x^{2}\right\rangle -\ell \right) + B_2\left( f_1^{N+1}- f_2^{N+1}-f^\star \right) \\&\qquad +B_2\left( f^\star -f_1^1+f^1_2+\Delta \right) +\tfrac{1-B_2}{N-1}\sum _{k=2}^{N} \left( f_1^k-f_1^{k+1}-\left\langle g_1^{k+1}, x^k-x^{k+1}\right\rangle -\ell \right) \\&\qquad +\alpha \sum _{k=2}^{N} \left( f_1^{k+1}-f_1^{k}-\left\langle g_1^{k}, x^{k+1}-x^{k}\right\rangle -\tfrac{\mu _1}{2}\left\| x^{k+1}-x^{k}\right\| ^2\right) \\&\qquad +B_2\sum _{k=1}^{N} \left( f_2^{k+1}-f_2^{k}-\left\langle g_1^{k+1}, x^{k+1}-x^{k}\right\rangle -\tfrac{1}{2L_2}\left\| g_1^{k+2}-g_1^{k+1}\right\| ^2\right) \\&\qquad +B_2\left( f_2^{N+1}-f_2^{N}-\left\langle g_1^{N+1}, x^{N+1}-x^{N}\right\rangle -\tfrac{1}{2L_2}\left\| g_2^{N+1}-g_1^{N+1}\right\| ^2\right) \\&\quad =-\tfrac{B_2}{2L_2}\left\| g_2^{N+1}-g_1^{N+1}\right\| ^2-\tfrac{B_2}{2L_2}\sum _{k=2}^{N} \left\| g_1^k-g_1^{k+1}-\tfrac{\alpha L_2}{B_2}(x^{k}-x^{k+1}) \right\| ^2 \le 0, \end{aligned}$$

where $B_2=\tfrac{L_2}{N(L_2+\mu _1)-\mu _1}$ and $\alpha =\tfrac{1-B_2}{N-1}-B_2$. Since we assume $L_2>\mu _1$, we have $B_2, \alpha \ge 0$. The rest of the proof runs as before. $\square $

The important point is that the last result provides a rate of convergence even if neither $L_1$ nor $L_2$ is finite, and we therefore state it as a corollary.

Corollary 4.1

Let $f_1\in \mathcal {F}_{\mu _1, \infty }({\mathbb {R}^n})$ and $f_2\in \mathcal {F}_{\mu _2, \infty }({\mathbb {R}^n})$, i.e. consider any DC decomposition in problem (1). Then, after N iterations of Algorithm 1, one has

$$\begin{aligned} \min _{1\le k\le N} f_1(x^k)-f_1(x^{k+1})-\langle g_2^k, x^k-x^{k+1}\rangle \le \frac{1}{N} \left( f(x^1)-f^\star \right) . \end{aligned}$$

This result is new to the best of our knowledge.

5 Linear Convergence of the DCA under the Polyak–Łojasiewicz Inequality

In the section, we provide some sufficient conditions under which the DCA is linearly convergent. Similar to the former sections, we employ the performance estimation for obtaining convergence rate.

In recent years, the linear convergence of some optimization methods for non-convex problems has been investigated under the Polyak–Łojasiewicz (PL) inequality; see [2, 12, 25] and the reference therein. We say that f satisfies PL inequality on X if there exists $\eta >0$ such that

$$\begin{aligned} f(x)-f^\star \le \tfrac{1}{2\eta } \Vert \xi \Vert ^2, \ \ \forall x\in X, \forall \xi \in \text {co}(\partial _l f(x)). \end{aligned}$$

(18)

Note that when f is differentiable inequality (18) is a special case of (3) with $\theta =\tfrac{1}{2}$ and different ground set. If $f_1$ or $f_2$ is strictly differentiable, we have $\text {co}(\partial _l f)=\partial f_1-\partial f_2$; see Example 10.10 in [38]. Hence, the performance estimation problem with the PL inequality may be formulated as follows:

$$\begin{aligned} \nonumber \max&\ \frac{(f_1^2-f_2^2)-f^\star }{(f_1^1-f_2^1)-f^\star }\\ \nonumber \text {s.t.}\ {}&\tfrac{1}{2(1-\tfrac{\mu _1}{L_1})}\left( \tfrac{1}{L_1}\left\| g_1^i-g_1^j\right\| ^2+\mu _1\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _1}{L_1}\left\langle g_1^j-g_1^i,x^j-x^i\right\rangle \right) \\&\nonumber \quad \le f_1^i-f_1^j-\left\langle g_1^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, 2\right\} \\&\nonumber \tfrac{1}{2(1-\tfrac{\mu _2}{L_2})}\left( \tfrac{1}{L_2}\left\| g_2^i-g_2^j\right\| ^2+\mu _2\left\| x^i-x^j\right\| ^2-\tfrac{2\mu _2}{L_2}\left\langle g_2^j-g_2^i,x^j-x^i\right\rangle \right) \\&\nonumber \quad \le f_2^i-f_2^j-\left\langle g_2^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, 2\right\} \\ \nonumber&\ f_1^k-f_2^k\ge f^\star \ \ k\in \left\{ 1, 2\right\} \\ \nonumber&g_2^1=g_1^2\\&\left( f_1^k-f_2^k\right) -f^\star \le \tfrac{1}{2\eta }\Vert g_1^k-g_2^k\Vert ^2, \ \ k\in \left\{ 1, 2\right\} . \end{aligned}$$

(19)

By doing constraint aggregation in problem (19) as before (i.e. demonstrating a dual feasible solution and using weak duality), we obtain the following linear convergence rate for the DCA under the PL inequality.

Theorem 5.1

Let $f_1\in \mathcal {F}_{\mu _1, L_1}({\mathbb {R}^n})$ and $f_2\in \mathcal {F}_{\mu _2, L_2}({\mathbb {R}^n})$. If $L_1$ or $L_2$ is finite and if f satisfies PL inequality on $X=\{x: f(x)\le f(x^1)\}$, then for $x^2$ from Algorithm 1, we have

$$\begin{aligned} \frac{f(x^2)-f^\star }{f(x^1)-f^\star }\le \left( \frac{1-\frac{\eta }{L_1}}{1+\frac{\eta }{L_2}}\right) . \end{aligned}$$

(20)

Proof

Since the given bound is independent of $\mu _1$ and $\mu _2$, without loss of generality, we assume that $\mu _1=\mu _2=0$. In addition, we assume that $f^\star =0$. Direct calculation shows that

$$\begin{aligned}&{\left( f_1^2-f_2^2\right) -f^\star }-\left( \frac{1-\frac{\eta }{L_1}}{1+\frac{\eta }{L_2}}\right) \left( {\left( f_1^1-f_2^1\right) -f^\star } \right) +\left( \frac{1}{1+\frac{\eta }{L_2}}\right) \\&\quad \times \left( f_1^1-f_1^{2}-\left\langle g_1^{2}, x^1-x^{2}\right\rangle -\tfrac{1}{2L_1}\left\| g_1^1-g_1^{2}\right\| ^2\right) \\&\quad +\left( \frac{1}{1+\frac{\eta }{L_2}}\right) \left( f_2^2-f_2^{1}-\left\langle g_1^{2}, x^2-x^{1}\right\rangle -\tfrac{1}{2L_2}\left\| g_1^2-g_2^{2}\right\| ^2\right) +\left( \frac{\frac{\eta }{L_1}}{1+\frac{\eta }{L_2}}\right) \\&\quad \times \left( \frac{1}{2\eta }\left\| g_1^1-g_1^2\right\| ^2-f_1^1+f_2^1\right) +\left( \frac{\frac{\eta }{L_2}}{1+\frac{\eta }{L_2}}\right) \left( \frac{1}{2\eta }\left\| g_1^2-g_2^2\right\| ^2-f_1^2+f_2^2\right) =0. \end{aligned}$$

As all the multipliers in the last expression are non-negative, for any feasible solution of problem (11), we have

$$\begin{aligned} f(x^2)-f^\star - \left( \frac{1-\frac{\eta }{L_1}}{1+\frac{\eta }{L_2}}\right) \left( f(x^1)-f^\star \right) \le 0, \end{aligned}$$

completing the proof.

Note that Theorem 1.1 by Le Thi et al. [27] does not imply Theorem 5.1 if inequality (3) holds on $\{x: f(x)\le f(x^1)\}$ with $\theta =\tfrac{1}{2}$, since we assume neither strong convexity of $f_1$ or $f_2$, nor boundedness of the sequence of iterates. Moreover, we give explicit expressions for the constants that determine the linear convergence rate of the sequence of objective values.

6 Conclusion

We have shown that the performance estimation framework of Drori and Teboulle [16] yields new insights into the convergence behavior of the difference-of-convex algorithm (DCA). As future work, one may also consider the convergence of the DCA on more restricted classes of DC problems, e.g. where $f_1$ and $f_2$ are convex polynomials, as studied in [3]. For constrained problems, even the case where $f_1$ and $f_2$ are quadratic polynomials is of interest, e.g. in the study of (extended) trust region problems.

References

Abbaszadehpeivasti, H., De Klerk, E., Zamani, M.: The exact worst-case convergence rate of the gradient method with fixed step lengths for L-smooth functions. Optim. Lett. 16(6), 1649–1661 (2022). https://doi.org/10.1007/s11590-021-01821-1
Article MathSciNet MATH Google Scholar
Abbaszadehpeivasti, H., De Klerk, E., Zamani, M.: Conditions for linear convergence of the gradient method for non-convex optimization. Optim. Lett. (2023). https://doi.org/10.1007/s11590-023-01981-2
Ahmadi, A.A., Hall, G.: DC decomposition of nonconvex polynomials with algebraic techniques. Math. Program. 169(1), 69–94 (2018). https://doi.org/10.1007/s10107-017-1144-5
Article MathSciNet MATH Google Scholar
Alvarado, A., Scutari, G., Pang, J.S.: A new decomposition method for multiuser DC-programming and its applications. IEEE Trans. Signal Process. 62(11), 2984–2998 (2014). https://doi.org/10.1109/TSP.2014.2315167
Article MathSciNet MATH Google Scholar
An, L.T.H., Tao, P.D.: The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann. Oper. Res. 133(1–4), 23–46 (2005). https://doi.org/10.1007/s10479-004-5022-1
Article MathSciNet MATH Google Scholar
Astorino, A., Fuduli, A., Gaudioso, M.: Margin maximization in spherical separation. Comput. Optim. Appl. 53(2), 301–322 (2012). https://doi.org/10.1007/s10589-012-9486-7
Article MathSciNet MATH Google Scholar
Bagirov, A.M., Ugon, J.: Nonsmooth DC programming approach to clusterwise linear regression: optimality conditions and algorithms. Optim. Methods Softw. 33(1), 194–219 (2018). https://doi.org/10.1080/10556788.2017.1371717
Article MathSciNet MATH Google Scholar
Bagirov, A.M., Taheri, S., Ugon, J.: Nonsmooth DC programming approach to the minimum sum-of-squares clustering problems. Pattern Recogn. 53, 12–24 (2016). https://doi.org/10.1016/j.patcog.2015.11.011
Article MATH Google Scholar
Beck, A.: First-order Methods in Optimization. SIAM, Philadelphia (2017)
Book MATH Google Scholar
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2006). https://doi.org/10.1137/050644641
Article MathSciNet MATH Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1), 459–494 (2014). https://doi.org/10.1007/s10107-013-0701-9
Article MathSciNet MATH Google Scholar
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165, 471–507 (2017). https://doi.org/10.1007/s10107-016-1091-6
Article MathSciNet MATH Google Scholar
Chen, P.C., Hansen, P., Jaumard, B., Tuy, H.: Solution of the multisource Weber and conditional Weber problems by D.C. programming. Oper. Res. 46(4), 548–562 (1998). https://doi.org/10.1287/opre.46.4.548
Article MathSciNet MATH Google Scholar
De Klerk, E., Glineur, F., Taylor, A.B.: Worst-case convergence analysis of inexact gradient and Newton methods through semidefinite programming performance estimation. SIAM J. Optim. 30(3), 2053–2082 (2020). https://doi.org/10.1137/19M1281368
Article MathSciNet MATH Google Scholar
De Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017). https://doi.org/10.1007/s11590-016-1087-4
Article MathSciNet MATH Google Scholar
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1), 451–482 (2014). https://doi.org/10.1007/s10107-013-0653-0
Article MathSciNet MATH Google Scholar
Gasso, G., Rakotomamonjy, A., Canu, S.: Recovering sparse signals with a certain family of nonconvex penalties and DC programming. IEEE Trans. Signal Process. 57(12), 4686–4698 (2009). https://doi.org/10.1109/TSP.2009.2026004
Article MathSciNet MATH Google Scholar
Ghadimi, S.: Conditional gradient type methods for composite nonlinear and stochastic optimization. Math. Program. 173(1), 431–464 (2019). https://doi.org/10.1007/s10107-017-1225-5
Article MathSciNet MATH Google Scholar
Jy, Gotoh, Takeda, A., Tono, K.: DC formulations and algorithms for sparse optimization problems. Math. Program. 169(1), 141–176 (2018). https://doi.org/10.1007/s10107-017-1181-0
Article MathSciNet MATH Google Scholar
Hartman, P.: On functions representable as a difference of convex functions. Pac. J. Math. 9(3), 707–713 (1959)
Article MathSciNet MATH Google Scholar
Hiriart-Urruty, J.B.: Generalized differentiability/duality and optimization for problems dealing with differences of convex functions. In: Ponstein, J. (ed.) Convexity and Duality in Optimization, vol. 256. Springer, Berlin, Heidelberg (1985). https://doi.org/10.1007/978-3-642-45610-7_3
Chapter MATH Google Scholar
Holmberg, K., Tuy, H.: A production-transportation problem with stochastic demand and concave production costs. Math. Program. 85(1), 157–179 (1999). https://doi.org/10.1007/s101070050050
Article MathSciNet MATH Google Scholar
Horst, R., Thoai, N.V.: DC programming: overview. J. Optim. Theory Appl. 103(1), 1–43 (1999). https://doi.org/10.1023/A:1021765131316
Article MathSciNet MATH Google Scholar
Joki, K., Bagirov, A.M., Karmitsa, N., Mäkelä, M.M., Taheri, S.: Double bundle method for finding Clarke stationary points in nonsmooth DC programming. SIAM J. Optim. 28(2), 1892–1919 (2018). https://doi.org/10.1137/16M1115733
Article MathSciNet MATH Google Scholar
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) Machine Learning and Knowledge Discovery in Databases, vol. 9851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46128-1_50
Chapter Google Scholar
Le Thi, H.A., Phan, D.N., Dinh, T.P.: DCA based approaches for bi-level variable selection and application for estimate multiple sparse covariance matrices. Neurocomputing 466, 162–177 (2021). https://doi.org/10.1016/j.neucom.2021.09.039
Article Google Scholar
Le Thi, H.A., Dinh, T.P., Pham, D.T.: Convergence analysis of difference-of-convex algorithm with subanalytic data. J. Optim. Theory Appl. 179(1), 103–126 (2018). https://doi.org/10.1007/s10957-018-1345-y
Article MathSciNet MATH Google Scholar
Le Thi, H.A., Dinh, T.P.: DC programming and DCA: thirty years of developments. Math. Program. 169(1), 5–68 (2018). https://doi.org/10.1007/s10107-018-1235-y
Article MathSciNet MATH Google Scholar
Le Thi, H.A., Nguyen, M.C.: DCA based algorithms for feature selection in multi-class support vector machine. Ann. Oper. Res. 249(1–2), 273–300 (2017). https://doi.org/10.1007/s10479-016-2333-y
Article MathSciNet MATH Google Scholar
Lipp, T., Boyd, S.: Variations and extension of the convex-concave procedure. Optim. Eng. 17(2), 263–287 (2016). https://doi.org/10.1007/s11081-015-9294-x
Article MathSciNet MATH Google Scholar
Lou, Y., Zeng, T., Osher, S., Xin, J.: A weighted difference of anisotropic and isotropic total variation model for image processing. SIAM J. Imag. Sci. 8(3), 1798–1823 (2015). https://doi.org/10.1137/14098435X
Article MathSciNet MATH Google Scholar
Lu, Z., Zhou, Z.: Nonmonotone enhanced proximal DC algorithms for a class of structured nonsmooth DC programming. SIAM J. Optim. 29(4), 2725–2752 (2019). https://doi.org/10.1137/18M1214342
Article MathSciNet MATH Google Scholar
Lu, Z., Zhou, Z., Sun, Z.: Enhanced proximal DC algorithms with extrapolation for a class of structured nonsmooth DC minimization. Math. Program. 176(1), 369–401 (2019). https://doi.org/10.1007/s10107-018-1318-9
Article MathSciNet MATH Google Scholar
Melzer, D.: On the expressibility of piecewise-linear continuous functions as the difference of two piecewise-linear convex functions. In: Demyanov, V.F., Dixon, L.C.W. (eds.) Quasidifferential. Calculus Mathematical Programming Studies, vol. 29. Springer, Berlin, Heidelberg (1986). https://doi.org/10.1007/BFb0121142
Chapter Google Scholar
Nesterov, Y.: Lectures on Convex Optimization. Springer, Cham (2018)
Book MATH Google Scholar
Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing B-stationary points of nonsmooth DC programs. Math. Oper. Res. 42(1), 95–118 (2017). https://doi.org/10.1287/moor.2016.0795
Article MathSciNet MATH Google Scholar
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Book MATH Google Scholar
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, New York (2009)
MATH Google Scholar
Sun, K., Sun, X.A.: Algorithms for difference-of-convex programs based on difference-of-Moreau-envelopes smoothing. INFORMS J. Optim. (2022). https://doi.org/10.1287/ijoo.2022.0087
Tao, P.D., An, L.T.H.: Convex analysis approach to DC programming: theory, algorithms and applications. Acta Math. Vietnam 22(1), 289–355 (1997)
MathSciNet MATH Google Scholar
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017). https://doi.org/10.1007/s10107-016-1009-3
Article MathSciNet MATH Google Scholar
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017). https://doi.org/10.1137/16M108104X
Article MathSciNet MATH Google Scholar
Toland, J.F.: A duality principle for non-convex optimisation and the calculus of variations. Arch. Ration. Mech. Anal. 71(1), 41–61 (1979). https://doi.org/10.1007/BF00250669
Article MathSciNet MATH Google Scholar
Tuy, H.: Convex Analysis and Global Optimization. Springer, Dordrecht (1998)
Book MATH Google Scholar
Yen, I.E., Peng, N., Wang, P.W., Lin, S.D. (2012). On convergence rate of concave-convex procedure. In: Sra, S., Agarwal, A. (eds.) Proceedings of the NIPS 2012 Optimization Workshop, pp. 31–35

Download references

Acknowledgements

This work was supported by the Dutch Scientific Council (NWO) Grant Optimization for and with Machine Learning, OCENW.GROOT.2019.015.

Author information

Authors and Affiliations

Department of Econometrics and Operations Research, Tilburg University, Tilburg, The Netherlands
Hadi Abbaszadehpeivasti, Etienne de Klerk & Moslem Zamani

Authors

Hadi Abbaszadehpeivasti
View author publications
You can also search for this author in PubMed Google Scholar
Etienne de Klerk
View author publications
You can also search for this author in PubMed Google Scholar
Moslem Zamani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Etienne de Klerk.

Additional information

Communicated by Tibor Illés.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Abbaszadehpeivasti, H., de Klerk, E. & Zamani, M. On the Rate of Convergence of the Difference-of-Convex Algorithm (DCA). J Optim Theory Appl (2023). https://doi.org/10.1007/s10957-023-02199-z

Download citation

Received: 24 February 2022
Accepted: 03 March 2023
Published: 29 March 2023
DOI: https://doi.org/10.1007/s10957-023-02199-z

On the Rate of Convergence of the Difference-of-Convex Algorithm (DCA)

Abstract

Similar content being viewed by others

An inexact successive quadratic approximation method for a class of difference-of-convex optimization problems

Linear convergence of first order methods for non-strongly convex optimization

A convergence analysis of the method of codifferential descent

1 Introduction

Theorem 1.1

Theorem 1.2

1.1 Outline and Further Contributions of this Paper

2 Basic Definitions and Preliminaries

Theorem 2.1

Lemma 2.1

Proof

Definition 2.1

Definition 2.2

2.1 The DC Problem

3 Performance Analysis of the DCA for Smooth \(f_1\) or \(f_2\)

Theorem 3.1

Proof

Corollary 3.1

3.1 An Example to Prove Tightness

Example 3.1

3.2 Convergence Rates for the Iterates

Proposition 3.1

Proof

4 Performance Estimation using a Convergence Criterion for Critical Points in the Non-smooth Case

Theorem 4.1

Proof

Corollary 4.1

5 Linear Convergence of the DCA under the Polyak–Łojasiewicz Inequality

Theorem 5.1

Proof

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation