1 Introduction

We are interested in solving Difference of Convex (DC) optimization problems:

$$\begin{aligned} ({\mathcal {P}}) \quad \min \left\{ \varPsi (x) := f_1(x) - f_2(x) + g(x) \ \big | \ x \in {\overline{C}} \right\} , \end{aligned}$$

where \(f_1\) and \(f_2\) are convex functions on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\), and \({\overline{C}}\) is the closure of C which is a nonempty, convex, and open set. Note that the function \(f_1 - f_2\) is not always convex. Also, g may be nonsmooth, such as the \(\ell _1\)-norm \(\Vert x\Vert _1\) in [8, 20, 29], or alternatively, \(f_2\) may be nonsmooth [16]. Some interesting examples of \(({\mathcal {P}})\) can be found in [28]. Although we will place some assumptions on C, it can be regarded as \({{\,\mathrm{{\mathbb {R}}}\,}}^d\) for simplicity. Applications of DC optimization are summarized in [13, 17, 26].

The DC Algorithm (DCA) (see for instance [17]) is a well-known iterative method for solving the DC optimization problems \(({\mathcal {P}})\). At each iteration, its computational burden mainly depends on the resolution of the subproblem,

$$\begin{aligned} x^{k+1} = \mathop {\text {argmin}}\limits _{x \in {\overline{C}}} \left\{ f_1(x) - \langle \xi ^k, x - x^k \rangle + g(x) \right\} , \end{aligned}$$
(1)

where \(\xi ^k \in \partial _{\mathrm {c}} f_2(x^k) := \{\xi \in {{\,\mathrm{{\mathbb {R}}}\,}}^d\ |\ f_2(y) - f(x^k) - \langle \xi , y - x^k \rangle \ge 0, \forall y \in {{\,\mathrm{{\mathbb {R}}}\,}}^d\}\) is a (classical) subgradient of \(f_2\) at \(x^k\in {\overline{C}}\). Solving subproblem (1) may be computationally demanding unless \(f_1\) and g have simple structure or \(({\mathcal {P}})\) is small-scale. When g is convex, the proximal DC Algorithm (pDCA) (see for instance [28]) is an alternative method of solving large-scale DC optimization problems. However, to guarantee global convergence of its iterates to a critical point, \(f_1\) needs to be L-smooth; i.e., its gradient needs to be globally Lipschitz continuous. Each step of pDCA is given by

$$\begin{aligned}&x^{k+1} = \mathop {\text {argmin}}\limits _{x \in {\overline{C}}} \left\{ g(x) + \langle \nabla f_1(x^k) - \xi ^k, x - x^k \rangle + \frac{1}{2\lambda }\Vert x - x^k\Vert ^2 \right\} , \end{aligned}$$
(2)

where \(\xi ^k \in \partial _{\mathrm {c}} f_2(x^k)\), \(x^k\in {\overline{C}}\), \(\lambda > 0\) satisfies \(0< \lambda L < 1\), and \(\Vert \cdot \Vert\) denotes the Euclidean norm. Since \(\lambda \ (< 1/L)\) plays the role of a step size, finding a larger upper bound \(1/L\), i.e., finding a smaller L, is of fundamental importance to achieving fast convergence. Wen et al.[28] proposed the proximal DC Algorithm with extrapolation (pDCAe) to accelerate pDCA with the extrapolation technique, which is used, for instance, in the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) of Beck and Teboulle [4] and in the Nesterov’s extrapolation technique [21, 22].

Bolte et al.[8], who incorporated the kernel generating distance (function) h and Bregman distances [9] into the algorithm framework, came up with the notion of the L-smooth adaptable property (see also [2]). This property is less restrictive than L-smoothness. A variant of the Bregman Proximal Gradient algorithm (BPG) proposed by Mukkamala et al.[20] iteratively estimates a small L, while Zhang et al.[29] proposed the Bregman Proximal Gradient algorithm with extrapolation (BPGe), which combines BPG with a line search step for extrapolating parameters.

In this paper, we propose two new algorithms, namely, the Bregman Proximal Difference of Convex Algorithm (BPDCA) and the Bregman Proximal Difference of Convex Algorithm with extrapolation (BPDCAe), which are inspired by pDCA(e) and BPG(e). These novel algorithms combine pDCA(e) with the Bregman distances. In the subproblem of BPDCA(e), the use of Bregman distances guarantees the accuracy of a linear approximation of \(f_1-f_2\).

The novelty of our contributions can be better understood by comparing them with the existing work. As already mentioned, Bregman distances allow us to extend the class of functions to be minimized \(f_1\) from L-smooth in pDCAe [28] to the larger class of L-smooth adaptable pairs of functions \((f_1,h)\). In addition, the function g does not need to be convex in the case of BPDCA. By assuming that either \(f_2\) or g are differentiable and that their gradients are locally Lipschitz continuous, the iterates of BPDCA(e) converge globally to a limiting stationary point (Theorems 2 and 7) or a limiting critical point (Theorems 3 and 8), where the definitions of these convergent points are given in Definition 6. This means that either g or \(f_2\) can be nonsmooth.

Compared with BPG-type algorithms [8, 20, 29], BPDCA(e) well exploits the structure of the objective function. When applying these BPG-type algorithms to solve problem \(({\mathcal {P}})\), we decompose \(\varPsi\) into two functions. There are two naive ways to decompose \(\varPsi\). First, we consider the decomposition \(\varPsi = f_1 + (g - f_2)\) to apply BPG. In this case, BPG solves its subproblem \(\min \{g(x) - f_2(x) + \langle \nabla f_1(x^k), x-x^k\rangle +\frac{1}{\lambda }D_h(x, x^k)\}\) at the kth iteration, where \(D_h\) is the Bregman distance associated with a kernel generating distance h (see Definition 2) and \(\lambda < 1/L\) is a positive parameter. In general, it is difficult to efficiently solve it because \(f_2\) often does not have simple structure such as separability. This fact is also true even if g is convex and separable, simultaneously. With BPDCA, we only need to solve its subproblem \(\min \{g(x) + \langle \nabla f_1(x^k) - \xi ^k, x-x^k\rangle +\frac{1}{\lambda }D_{h'}(x, x^k)\}\), where \(\xi ^k\in \partial _{\mathrm {c}}f_2(x^k)\). If g is additionally convex, the subproblem becomes convex and hence is often efficiently solved. Moreover, if g and h are also separable, the subproblem is reduced to d independent one-dimensional convex optimization problems. Even without separability of h, it often has closed-form solution formulae as we mentioned in Sect. 5. As an alternative way of decomposition of \(\varPsi\), we consider \(\varPsi = (f_1 - f_2) + g\) to apply BPG. In this case, to guarantee global convergence, the L-smooth adaptability of \((f_1-f_2,h)\) is required (see Definition 3). Meanwhile, for the global convergence of BPDCA(e), the \(L'\)-smooth adaptability of \((f_1,h')\) is required. Comparing these constants, \(L'\le L\) in general, and then, we can expect substantial decrease in the objective function at each iteration (Lemma 5 and [8, Lemma 4.1]). This fact has dramatic consequences in practice, as we found in numerical experiments on phrase retrieval (Sect. 5.1).

The convergence of our algorithms and the monotonicity of the objective function are based on standard assumptions. Our new restart scheme (Sect. 3.2) plays an important role in guaranteeing the non-increasing property of the objective functions of BPDCAe without the need for a line search, as in [29]. We show global convergence under local Lipschitz continuity of the gradients and the Kurdyka-Łojasiewicz property or subanalyticity of the objective function. Additionally, we evaluated the rates of convergence of BPDCA(e).

To evaluate the performance of BPDCA(e), we applied them to phase retrieval, a well-known application in nonconvex optimization. Phase retrieval arises in many fields of science and engineering, such as X-ray crystallography and image processing [10, 24]. It can be formulated as a nonconvex optimization problem or DC optimization problem \(({\mathcal {P}})\), such as in [14]. It cannot be solved via pDCA or proximal algorithms, since the function we want to minimize is not L-smooth. When we formulated phase retrieval as a DC optimization problem, we obtained much smaller L-smooth adaptable parameters than the existing ones [8, Lemma 5.1], [20]. Thus, our algorithms outperformed BPG(e) in our numerical experiment. Further experiments showed that, under the Gaussian model, BPDCAe had a higher success rate of phrase retrieval than that of Wirtinger flow [10]. Although the kernel generating distance h we utilized does not satisfy Assumption 4 (i), the sequences generated by BPDCA(e) converged in the numerical experiments. Therefore, we conjecture that all convergence analyses can be carried out with weaker conditions.

This paper is organized as follows. Section 2 summarizes notions such as the limiting subdifferential, the Bregman distances, the L-smooth adaptable property, the Kurdyka-Łojasiewicz property, and the subanalytic functions. Section 3 introduces our Bregman proximal-type algorithms under the assumption that \((f_1,h)\) has an L-smooth adaptable property. Section 4 (and Appendix A) establishes the global convergence of BPDCA(e) to a limiting stationary point or a limiting critical point of the problem \(({\mathcal {P}})\) and analyzes its rate of convergence. Section 5 derives small values for the constant L and compares the performance of our algorithms with that of BPG(e). Section 6 summarizes our contributions and discusses future work.

2 Preliminaries

Here, we review the important notions we will need in the subsequent sections.

2.1 Subdifferentials

Definition 1

(Limiting Subdifferential [23]) For a proper and lower semicontinuous function \(f:{{\,\mathrm{{\mathbb {R}}}\,}}^d\rightarrow (-\infty , +\infty ]\), the limiting subdifferential [23] of f at \(x\in {{\,\mathrm{dom}\,}}f\) is defined by

$$\begin{aligned} \partial f(x) = \left\{ \xi \in {{\,\mathrm{{\mathbb {R}}}\,}}^d \ \big |\ \exists x^k \xrightarrow {f} x,\xi ^k\rightarrow \xi \ \mathrm {such\ that}\ \liminf _{y\rightarrow x^k, y\ne x^k}\frac{f(y) - f(x^k) - \langle \xi ^k, y - x^k \rangle }{\Vert y - x^k\Vert }\ge 0\right\} , \end{aligned}$$

where \(x^k \xrightarrow {f} x\) means \(x^k \rightarrow x\) and \(f(x^k)\rightarrow f(x)\).

Note that when f is convex, the limiting subdifferential coincides with the (classical) subdifferential [23, Proposition 8.12], that is, \(\partial f(x) = \partial _{\mathrm {c}} f(x)\) for all \(x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\).

2.2 Bregman distances

First, we define kernel generating distances and Bregman distances.

Definition 2

(Kernel Generating Distances [8] and Bregman Distances [9]) Let C be a nonempty, convex, and open subset of \({{\,\mathrm{{\mathbb {R}}}\,}}^d\). Associated with C, a function \(h:{{\,\mathrm{{\mathbb {R}}}\,}}^d\rightarrow (-\infty , +\infty ]\) is called a kernel generating distance if it meets the following conditions:

  1. (i)

    h is proper, lower semicontinuous, and convex, with \({{\,\mathrm{dom}\,}}h \subset {\overline{C}}\) and \({{\,\mathrm{dom}\,}}\partial h = C\).

  2. (ii)

    h is \({\mathcal {C}}^1\) on \({{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h = C\).

We denote the class of kernel generating distances by \({\mathcal {G}}(C)\). Given \(h \in {\mathcal {G}}(C)\), the Bregman distance \(D_h: {{\,\mathrm{dom}\,}}h \times {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h \rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}_+\) is defined by

$$\begin{aligned} D_h(x, y) := h(x) - h(y) - \langle \nabla h(y), x - y \rangle . \end{aligned}$$

From the gradient inequality, h is convex if and only if \(D_h(x, y) \ge 0\) for any \(x \in {{\,\mathrm{dom}\,}}h\) and \(y \in {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\). When h is a strictly convex function, the equality holds if and only if \(x = y\). When \(h = \frac{1}{2} \Vert \cdot \Vert ^2\), \(D_h(x, y) = \frac{1}{2} \Vert x - y\Vert ^2\), which is the squared Euclidean distance.

In addition, the Bregman distances satisfy the three-point identity [8],

$$\begin{aligned} D_h(x, z) - D_h(x, y) - D_h(y, z) = \langle \nabla h(y) - \nabla h(z), x - y \rangle , \end{aligned}$$
(3)

for any \(y, z \in {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\), and \(x \in {{\,\mathrm{dom}\,}}h\).

2.3 Smooth adaptable functions

Now let us define the notion of L-smooth adaptable.

Definition 3

(L-smooth adaptable [8]) Consider a pair of functions (fh) satisfying the following conditions:

  1. (i)

    \(h \in {\mathcal {G}}(C)\),

  2. (ii)

    \(f: {{\,\mathrm{{\mathbb {R}}}\,}}^d \rightarrow (-\infty , +\infty ]\) is a proper and lower semicontinuous function with \({{\,\mathrm{dom}\,}}h \subset {{\mathrm{dom}\,}}f\), which is \({\mathcal {C}}^1\) on \(C = {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\).

The pair (fh) is called L-smooth adaptable (L-smad) on C if there exists \(L > 0\) such that \(Lh - f\) and \(Lh + f\) are convex on C.

Since our focus is on DC optimization, the function \(f_1\) in \(({\mathcal {P}})\) is always convex. Thus, it will be enough to verify that \(Lh-f_1\) is convex on C to have \((f_1,h)\) L-smad on C.

From the L-smooth adaptable property comes the Descent Lemma [8].

Lemma 1

(Full Extended Descent Lemma [8]) A pair of functions (fh) is L-smad on \(C={{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\) if and only if:

$$\begin{aligned} |f(x) - f(y) - \langle \nabla f(y), x - y \rangle | \le L D_h(x, y),\quad \forall x, y \in {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h. \end{aligned}$$

2.4 Kurdyka-Łojasiewicz property and subanalytic functions

Given \(\eta >0\), let \(\varXi _{\eta }\) denote the set of all continuous concave functions \(\phi :[0, \eta ) \rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}_+\) that are \({\mathcal {C}}^1\) on \((0, \eta )\) with positive derivatives and which satisfy \(\phi (0) = 0\). Here, we introduce the Kurdyka-Łojasiewicz property [7, 15], which we need when analyzing our algorithms:

Definition 4

(Kurdyka-Łojasiewicz property) Let \(f: {{\,\mathrm{{\mathbb {R}}}\,}}^d \rightarrow (-\infty , +\infty ]\) be a proper and lower semicontinuous function.

  1. (i)

    f is said to have the Kurdyka-Łojasiewicz (KL) property at \({\hat{x}} \in {{\,\mathrm{dom}\,}}\partial f\) if there exist \(\eta \in (0, +\infty ]\), a neighborhood U of \({\hat{x}}\), and a function \(\phi \in \varXi _{\eta }\) such that the following inequality holds:

    $$\begin{aligned} \phi '(f(x) - f({\hat{x}})) \cdot {{\,\mathrm{dist}\,}}(0, \partial f(x)) &\ge 1, \\ &\forall x\in U \cap \{ x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d \mid f({\hat{x}})<f(x)<f({\hat{x}})+\eta \}. \end{aligned}$$
    (4)
  2. (ii)

    If f has the KL property at each point of \({{\,\mathrm{dom}\,}}\partial f\), then it is called a KL function.

Lemma 2

(Uniformized KL property [7]) Suppose that \(f:{{\,\mathrm{{\mathbb {R}}}\,}}^d\rightarrow (-\infty , +\infty ]\) is a proper and lower semicontinuous function and let \(\varGamma\) be a compact set. If f is constant on \(\varGamma\) and has the KL property at each point of \(\varGamma\), then there exist positive scalars \(\epsilon , \eta > 0\), and \(\phi \in \varXi _{\eta }\) such that

$$\begin{aligned} \phi '(f(x) - f({\hat{x}})) \cdot {{\,\mathrm{dist}\,}}(0, \partial f(x)) \ge 1, \end{aligned}$$

for any \({\hat{x}}\in \varGamma\) and any x satisfying \({{\,\mathrm{dist}\,}}(x, \varGamma ) < \epsilon\) and \(f({\hat{x}})< f(x) < f({\hat{x}}) + \eta\).

Next, we describe subanalytic functions.

Definition 5

(Subanalyticity [6])

  1. (i)

    A subset A of \({{\,\mathrm{{\mathbb {R}}}\,}}^d\) is called semianalytic if each point of \({{\,\mathrm{{\mathbb {R}}}\,}}^d\) admits a neighborhood V for which \(A \cap V\) assumes the following form:

    $$\begin{aligned} \bigcup _{i=1}^p\bigcap _{j=1}^q\left\{ x \in V\ \big |\ f_{ij}(x) = 0, g_{ij}(x) > 0 \right\} , \end{aligned}$$

    where the functions \(f_{ij}, g_{ij}:V\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) are real-analytic for all \(1 \le i \le p,1 \le j \le q\).

  2. (ii)

    The set A is called subanalytic if each point of \({{\,\mathrm{{\mathbb {R}}}\,}}^d\) admits a neighborhood V such that

    $$\begin{aligned} A \cap V = \left\{ x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\ \big |\ (x, y) \in B \right\} , \end{aligned}$$

    where B is a bounded semianalytic subset of \({{\,\mathrm{{\mathbb {R}}}\,}}^d \times {{\,\mathrm{{\mathbb {R}}}\,}}^m\) for some \(m \ge 1\).

  3. (iii)

    A function \(f:{{\,\mathrm{{\mathbb {R}}}\,}}^d\rightarrow (-\infty , +\infty ]\) is called subanalytic if its graph is a subanalytic subset of \({{\,\mathrm{{\mathbb {R}}}\,}}^d\times {{\,\mathrm{{\mathbb {R}}}\,}}\).

For instance, given a subanalytic set S, \({{\,\mathrm{dist}\,}}(x, S)\) is subanalytic, and every analytic function is subanalytic. Note that subanalytic functions are KL functions. See [5, 6] for further properties of subanalyticity.

3 Proposed methods: Bregman Proximal DC Algorithms

We place the following assumptions on the DC optimization problem \(({\mathcal {P}})\). Recall that \(C={{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\).

Assumption 1

  1. (i)

    \(h \in {\mathcal {G}}(C)\) with \({\overline{C}} = \overline{{{\,\mathrm{dom}\,}}h}\).

  2. (ii)

    \(f_1:{{\,\mathrm{{\mathbb {R}}}\,}}^d \rightarrow (-\infty , +\infty ]\) is proper and convex with \({{\,\mathrm{dom}\,}}h \subset {{\,\mathrm{dom}\,}}(f_1 + g)\), which is \({\mathcal {C}}^1\) on C.

  3. (iii)

    \(f_2:{{\,\mathrm{{\mathbb {R}}}\,}}^d \rightarrow (-\infty , +\infty ]\) is proper and convex.

  4. (iv)

    \(g:{{\,\mathrm{{\mathbb {R}}}\,}}^d \rightarrow (-\infty , +\infty ]\) is proper and lower semicontinuous with \({{\,\mathrm{dom}\,}}g \cap C \ne \emptyset\).

  5. (v)

    \(v({\mathcal {P}}) := \inf \{ \varPsi (x) \ | \ x \in {\overline{C}} \} > -\infty\).

  6. (vi)

    For any \(\lambda > 0\), \(\lambda g + h\) is supercoercieve, that is,

    $$\begin{aligned} \lim _{\Vert u\Vert \rightarrow \infty } \frac{\lambda g(u) + h(u)}{\Vert u\Vert } = \infty . \end{aligned}$$

Let \(x\in {{\,\mathrm{dom}\,}}(f_1 + g)\), then \(f_2(x) \le g(x) + f_1(x) - v({\mathcal {P}}) < +\infty\) due to Assumption 1 (v). Thus, \(x\in {{\,\mathrm{dom}\,}}f_2\), i.e., \({{\,\mathrm{dom}\,}}(f_1 + g) \subset {{\,\mathrm{dom}\,}}f_2\). From Assumption 1 (ii), we have \(C \subset {{\,\mathrm{dom}\,}}(f_1 + g) \subset {{\,\mathrm{dom}\,}}f_2\). Note that Assumption 1 (iv) holds when \({\overline{C}}\) is compact [8].

3.1 Bregman Proximal DC Algorithm (BPDCA)

To obtain the Bregman Proximal DC Algorithm (BPDCA) mapping for some \(\lambda > 0\), we recast the objective function of \(({\mathcal {P}})\) via a DC decomposition:

$$\begin{aligned} \varPsi (u) = f_1(u) - f_2(u) + g(u) =\left( \frac{1}{\lambda } h(u) + g(u) \right) - \left( \frac{1}{\lambda } h(u) - f_1(u) + f_2(u) \right) , \end{aligned}$$

and, given \(x\in C={{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\) and \(\xi \in \partial _{\mathrm {c}} f_2(x)\), define the mapping,

$$\begin{aligned} {\mathcal {T}}_{\lambda }(x) := \mathop {\text {argmin}}\limits _{u \in {\overline{C}}} \left\{ g(u) + \langle \nabla f_1(x) - \xi , u - x \rangle + \frac{1}{\lambda }D_h(u, x) \right\} . \end{aligned}$$

Additionally, we put the following assumption on \(({\mathcal {P}})\).

Assumption 2

For all \(x \in C\) and \(\lambda > 0\), we have

$$\begin{aligned} {\mathcal {T}}_{\lambda }(x) \subset C,\quad \forall x \in C. \end{aligned}$$

Note that Assumption 2 is not so restrictive because it holds when \(C\equiv {{\,\mathrm{{\mathbb {R}}}\,}}^d\). Under Assumptions 1 and 2, we have the following lemma [8, Lemma 3.1].

Lemma 3

Suppose that Assumptions 1 and 2 hold, and let \(x \in C = {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\). Then, the set \({\mathcal {T}}_{\lambda }(x)\) is a nonempty and compact subset of C for any \(\lambda >0\).

Note that when h is strictly convex, \({\mathcal {T}}_{\lambda }(x)\) is a singleton. Also, when g and h are separable, this mapping is easily computable since \({\mathcal {T}}_{\lambda }(x)\) can be decomposed into a single-valued optimization problem, and often has a closed-form solution. For instance, when \(h(x)=\frac{1}{2}\Vert x\Vert ^2\), for \(g(x)=\Vert x\Vert _1\), \({{\mathcal {T}}}_{\lambda }(x)\) becomes the soft-thresholding operator, or for \(g(x)=\Vert x\Vert _0\), the hard-thresholding operator. Other well-known examples where this mapping has a closed-form solution are when we use an appropriate h such as Burg entropy [2], Shannon entropy [3], or \(h(x)=\frac{1}{4}\Vert x\Vert ^4+\frac{1}{2}\Vert x\Vert ^2\) [8] for the corresponding g. Note that this h(x) is not separable. For further examples, see [12, Table 2.1].

The Bregman Proximal DC Algorithm (BPDCA), which we are proposing, is listed as Algorithm 1.

figure a

As a recurrent example, \(D_h(x, x^k) = \frac{1}{2}\Vert x - x^k\Vert ^2\) when \(h(x) = \frac{1}{2}\Vert x\Vert ^2\). In this case, if L is regarded as the Lipschitz constant for the gradient of \(f_1\), subproblem (5) reduces to subproblem (2). If \(f_2\) is \({\mathcal {C}}^1\) on C and the pair \((f_1 - f_2, h)\) is L-smad, BPDCA reduces to BPG [8].

3.2 Bregman Proximal DC Algorithm with extrapolation (BPDCAe)

Algorithm 2, which we are proposing, is an acceleration of BPDCA that uses the extrapolation technique [4, 21, 22] to solve the DC optimization problem \(({\mathcal {P}})\).

figure b

When \(\beta _k\equiv 0\) for all \(k \ge 0\), BPDCAe reduces to BPDCA. Here, we prefer the popular choice for the coefficients \(\beta _k\) (and \(\theta _k\)) given in [28] for acceleration. Accordingly, (6) guarantees that \(\{\beta _k\}_{k=0}^{\infty } \subset [0, 1)\) and \(\sup _{k\ge 0} \beta _k < 1\). These properties are needed to prove global subsequential convergence of the iterates (see Theorem 6 (ii)). Algorithm 2 introduces a new adaptive restart scheme, which resets \(\theta _k\) and \(\beta _k\) whenever

$$\begin{aligned} D_h(x^k, y^k) > \rho D_h(x^{k-1}, x^k), \end{aligned}$$
(8)

is satisfied for a fixed \(\rho \in [0,1)\). This adaptive restart scheme guarantees the non-increasing property for BPDCAe (see Lemma 6). In addition, we can enforce this resetting every K iterations for a given positive integer K. In our numerical experiments, we set \(\{\beta _k\}_{k=0}^{\infty }\) as both the fixed and the adaptive restart schemes.

When \(C = {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h = {{\,\mathrm{{\mathbb {R}}}\,}}^d\), \(y^k\) always stays in C. However, when \(C \ne {{\,\mathrm{{\mathbb {R}}}\,}}^d\) and \(x^k+\beta _k(x^k-x^{k-1}) \notin C\), Algorithm 2 enforces \(\beta _k=0\) and BPDCAe is not accelerated at the kth iteration. This operation guarantees that \(y^k\) always stays in C. In practice, however, the extrapolation technique may be valid and accelerates BPDCAe.

We define the following BPDCAe mapping for all \(x, y \in C ={{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\), and \(\lambda \in (0, 1/L)\):

$$\begin{aligned} {\mathcal {S}}_{\lambda }(x, y) := \mathop {\text {argmin}}\limits _{u \in {\overline{C}}} \left\{ g(u) + \langle \nabla f_1(y) - \xi , u - y \rangle + \frac{1}{\lambda }D_h(u, y) \right\} , \end{aligned}$$

where \(\xi \in \partial _{\mathrm {c}} f_2(x)\). Similarly to the case of BPDCA, we make an Assumption 3 and can prove Lemma 4 for \({\mathcal {S}}_{\lambda }(x, y) \subset {\overline{C}}\).

Assumption 3

For all \(x, y \in C\) and \(\lambda > 0\), we have

$$\begin{aligned} {\mathcal {S}}_{\lambda }(x, y) \subset C,\quad \forall x, y \in C. \end{aligned}$$

Lemma 4

Suppose that Assumptions 1 and 3 hold, and let \(x, y \in C={{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\). Then, the set \({\mathcal {S}}_{\lambda }(x, y)\) is a nonempty and compact subset of C for any \(\lambda >0\).

4 Convergence analysis

Throughout this section, we will assume that the pair of functions \((f_1,h)\) is L-smad on C.

4.1 Properties of BPDCA

First, we show the decreasing property of BPDCA mapping for \(0< \lambda L < 1\) (the argument is adapted from [8, Lemma 4.1]).

Lemma 5

Suppose that Assumptions 1 and 2 hold. For any \(x \in C={{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\) and any \(x^+ \in C={{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\) defined by

$$\begin{aligned} x^+ \in \mathop {\text {argmin}}\limits _{u \in {\overline{C}}} \left\{ g(u) + \langle \nabla f_1(x) - \xi , u - x \rangle + \frac{1}{\lambda }D_h(u, x) \right\} , \end{aligned}$$
(9)

where \(\xi \in \partial _{\mathrm {c}} f_2(x)\) and \(\lambda > 0\), it holds that

$$\begin{aligned} \lambda \varPsi (x^+) \le \lambda \varPsi (x) - (1 - \lambda L) D_h(x^+, x). \end{aligned}$$
(10)

In particular, the sufficiently decreasing property in the objective function value \(\varPsi\) is ensured when \(0< \lambda L < 1\).

Proof

From the global optimality of \(x^+\) by taking \(u = x \in {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\) and \(\xi \in \partial _{\mathrm {c}} f_2(x)\), we obtain

$$\begin{aligned} g(x^+) + \langle \nabla f_1(x) - \xi , x^+ - x \rangle + \frac{1}{\lambda }D_h(x^+, x) \le g(x). \end{aligned}$$

Invoking the full Extended Descent Lemma (Lemma 1) for \(f_1\), the definition of the subgradient for \(f_2\), and the above inequality, we have

$$\begin{aligned} f_1(x^+) - f_2(x^+) + g(x^+)&\le f_1(x) - f_2(x) + \langle \nabla f_1(x) - \xi , x^+ - x \rangle + LD_h(x^+, x) + g(x^+)\\&\le f_1(x) - f_2(x) + LD_h(x^+, x) + g(x) - \frac{1}{\lambda }D_h(x^+, x)\\&= f_1(x) - f_2(x) + g(x) - \left( \frac{1}{\lambda } - L \right) D_h(x^+, x), \end{aligned}$$

for \(\varPsi = f_1 - f_2 + g\). The last statement follows with \(0< \lambda L < 1\). \(\square\)

Proposition 1 follows immediately from Lemma 5, as in [8].

Proposition 1

Suppose that Assumptions 1 and 2 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCA with \(0< \lambda L < 1\). Then, the following statements hold:

  1. (i)

    The sequence \(\{\varPsi (x^k)\}_{k=0}^{\infty }\) is non-increasing.

  2. (ii)

    \(\sum _{k=1}^{\infty }D_h(x^k, x^{k-1}) < \infty\); hence, the sequence \(\{D_h(x^k, x^{k-1})\}_{k=0}^{\infty }\) converges to zero.

  3. (iii)

    \(\min _{1\le k \le n}D_h(x^k, x^{k-1}) \le \frac{\lambda }{n}\left( \frac{\varPsi (x^0) - \varPsi _*}{1 - \lambda L} \right)\), where \(\varPsi _* = v({\mathcal {P}}) > -\infty\) (by Assumption 1 (v)).

4.2 Convergence analysis of BPDCA

Suppose that the following conditions hold.

Assumption 4

  1. (i)

    \({{\,\mathrm{dom}\,}}h = {{\,\mathrm{{\mathbb {R}}}\,}}^d\) and h is \(\sigma\)-strongly convex on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\).

  2. (ii)

    \(\nabla h\) and \(\nabla f_1\) are Lipschitz continuous on any bounded subset of \({{\,\mathrm{{\mathbb {R}}}\,}}^d\).

  3. (iii)

    The objective function \(\varPsi\) is level-bounded; i.e., for any \(r \in {{\,\mathrm{{\mathbb {R}}}\,}}\), the lower level sets \(\{ x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\ |\ \varPsi (x)\le r \}\) are bounded.

Since \(C = {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h = {{\,\mathrm{{\mathbb {R}}}\,}}^d\) under Assumption 4 (i), Assumptions 2 and 3 are automatically fulfilled. For nonconvex functions, we use the limiting subdifferential [23] (Definition 1). Inspired by Fermat’s rule [23, Theorem 10.1], we define the limiting critical points and the limiting stationary points of \(\varPsi\).

Definition 6

We say that \({\tilde{x}}\) is a limiting critical point of \(({\mathcal {P}})\) with \(C \equiv {{\,\mathrm{{\mathbb {R}}}\,}}^d\) if

$$\begin{aligned} 0 \in \nabla f_1({\tilde{x}}) - \partial _{\mathrm {c}} f_2({\tilde{x}}) + \partial g({\tilde{x}}). \end{aligned}$$
(11)

The set of all limiting critical points of \(({\mathcal {P}})\) is denoted by \({\mathcal {X}}\). In addition, we say that \({\tilde{x}}\) is a limiting stationary point of \(({\mathcal {P}})\) with \(C \equiv {{\,\mathrm{{\mathbb {R}}}\,}}^d\) if

$$\begin{aligned} 0 \in \partial \varPsi ({\tilde{x}}). \end{aligned}$$
(12)

Although the limiting stationary points are sometimes called the limiting critical points in some papers, for example [7, Definition 1 (iv)], we distinguish these two terms. The reasons are the following: When \(\varPsi\) is convex, we call \({\tilde{x}}\) a stationary point if it satisfies \(0\in \partial _{\mathrm {c}}\varPsi ({\tilde{x}})\). Because (12) is its natural extension by replacing \(\partial _{\mathrm {c}}\varPsi\) with \(\partial \varPsi\), we use the terminology “limiting stationary point” after [11, Definition 6.1.4]. We similarly name \({\tilde{x}}\) satisfying (11): When g is convex, we call \({\tilde{x}}\) a critical point if it satisfies \(0\in \nabla f_1({\tilde{x}}) - \partial _{\mathrm {c}} f_2({\tilde{x}}) + \partial _{\mathrm {c}} g({\tilde{x}})\). Because (11) is its natural extension by replacing \(\partial _{\mathrm {c}}g\) with \(\partial g\), we use the terminology “limiting critical point.”

The limiting stationary point is a first-order necessary condition for the local optimality. This relation is known as the generalized Fermat’s rule [23, Theorem 10.1]. We can deduce \(\partial (g - f_2)(x) \subseteq \partial g(x) - \partial _{\mathrm {c}} f_2(x)\) from [19, Corollary 3.4]. Plugging it into [23, Corollary 10.9], it generally holds that \(\partial \varPsi (x) \subseteq \nabla f_1(x) - \partial _{\mathrm {c}} f_2(x) + \partial g(x)\) for all \(x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\). It implies the limiting critical point is weaker than the limiting stationary point. When \(f_2\) is \({\mathcal {C}}^1\) on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\), it holds that \(\partial \varPsi (x) \equiv \nabla f_1(x) - \nabla f_2(x) + \partial g(x)\) from [23, Corollary 10.9] or [18, Proposition 1.107 (ii)] and the definition of the limiting subdifferentials of \(f_2\) and g. Thus, every limiting critical point is a limiting stationary point when \(f_2\) is \({\mathcal {C}}^1\).

Next, using Lemma 5 and Proposition 1, we will show global subsequential convergence of the iterates to a limiting critical point of the problem \(({\mathcal {P}})\). We can easily see that Theorem 1 (i) holds from the level-boundedness of \(\varPsi\). Theorem 1 (iii) and (iv) will play an essential role in determining the global convergence and the rate of convergence of BPDCA.

Theorem 1

(Global subsequential convergence of BPDCA) Suppose that Assumptions 12, and 4 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCA with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\). Then, the following statements hold:

  1. (i)

    The sequence \(\{x^k\}_{k=0}^{\infty }\) is bounded.

  2. (ii)

    The sequence \(\{\xi ^k\}_{k=0}^{\infty }\) is bounded.

  3. (iii)

    \(\lim _{k\rightarrow \infty }\Vert x^{k+1} - x^k\Vert = 0\).

  4. (iv)

    Any accumulation point of \(\{x^k\}_{k=0}^{\infty }\) is a limiting critical point of \(({\mathcal {P}})\).

Proof

(i) From Proposition 1, we obtain \(\varPsi (x^k) \le \varPsi (x^0)\) for all \(k \in {\mathbb {N}}\), which shows that \(\{x^k\}_{k=0}^{\infty }\) is bounded from Assumption 4 (iii).

(ii) From Assumption 1 (ii), 4 (i), and the convexity of \(f_2\), \({{\,\mathrm{dom}\,}}f_2={{\,\mathrm{{\mathbb {R}}}\,}}^d\) and \(\partial _{\mathrm {c}} f_2 (x^k) \ne \emptyset\). Suppose, for the sake of proof by contradiction, that \(\{\xi ^{k}\}_{k=0}^{\infty }\) is unbounded, i.e., \(\Vert \xi ^{k}\Vert \rightarrow \infty\) as \(k\rightarrow \infty\). By the definition of the subgradients of convex functions, we have that, for any \(y\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\),

$$\begin{aligned} f_2(y) \ge f_2(x^{k}) + \langle \xi ^{k}, y - x^{k} \rangle . \end{aligned}$$
(13)

Assume for a moment that \(\Vert \xi ^k\Vert \ne 0\). Letting \(\{d^{k}\}_{k=0}^{\infty }\) be the subsequence given by \(d^{k} = \xi ^{k}/\Vert \xi ^{k}\Vert\) and substituting \(x^{k} + d^{k} = x^{k} + \xi ^{k}/\Vert \xi ^{k}\Vert\) into y in (13), we obtain

$$\begin{aligned} f_2(x^{k} + d^{k}) \ge f_2(x^{k}) + \left\langle \xi ^{k}, d^k \right\rangle = f_2(x^{k}) + \Vert \xi ^{k}\Vert , \end{aligned}$$

which is also true when \(\Vert \xi ^k\Vert = 0\) by defining \(d^k = 0\). By taking \(k\rightarrow \infty\), we obtain

$$\begin{aligned} \limsup _{k\rightarrow \infty }\Vert \xi ^{k}\Vert \le \limsup _{k\rightarrow \infty }\left( f_2(x^{k} + d^{k}) - f_2(x^{k}) \right) . \end{aligned}$$
(14)

We can take a compact set S such that \(\{x^{k} + d^{k}\}_{k=0}^{\infty } \subset S\), since \(\{x^{k} + d^{k}\}_{k=0}^{\infty }\) is bounded. For \({\bar{x}} \in \mathop {\text {argmax}}\nolimits _{x \in S} f_2(x)\), since \(f_2\) is continuous because of its convexity on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\) and \(\{x^k\}_{k=0}^{\infty }\) is bounded, it holds that

$$\begin{aligned} \limsup _{k\rightarrow \infty }\left( f_2(x^{k} + d^{k}) - f_2(x^{k}) \right) \le f_2({\bar{x}}) - {\bar{f}}_2 < \infty , \end{aligned}$$
(15)

for some value \({\bar{f}}_2 \le f_2(x^k), k\ge 0\). (14) and (15) contradict to \(\Vert \xi ^{k}\Vert \rightarrow \infty\).

(iii) From (10), we obtain

$$\begin{aligned} \varPsi (x^{k-1}) - \varPsi (x^k)&\ge \left( \frac{1}{\lambda } - L \right) D_h(x^k, x^{k-1})\nonumber \\&\ge \left( \frac{1}{\lambda } - L \right) \frac{\sigma }{2}\Vert x^k - x^{k-1}\Vert ^2, \end{aligned}$$
(16)

where the last inequality holds since h is a \(\sigma\)-strongly convex function from Assumption 4 (i). Summing the above inequality from \(k=1\) to \(\infty\), we obtain

$$\begin{aligned} \left( \frac{1}{\lambda } - L \right) \sum _{k=1}^{\infty }\frac{\sigma }{2}\Vert x^k - x^{k-1}\Vert ^2 \le \varPsi (x^0) - \liminf _{n\rightarrow \infty }\varPsi (x^n) \le \varPsi (x^0) - v({\mathcal {P}}) < \infty , \end{aligned}$$

which shows that \(\lim _{k\rightarrow \infty }\Vert x^{k+1} - x^k\Vert = 0\).

(iv) Let \({\tilde{x}}\) be an accumulation point of \(\{x^k\}_{k=0}^{\infty }\) and let \(\{x^{k_j}\}\) be a subsequence such that \(\lim _{j\rightarrow \infty }x^{k_j} = {\tilde{x}}\). Then, from the first-order optimality condition of subproblem (5) under Assumption 2, we have

$$\begin{aligned} 0 \in \partial g(x^{k_j+1}) + \nabla f_1(x^{k_j}) - \xi ^{k_j} + \frac{1}{\lambda }\left( \nabla h(x^{k_j + 1}) - \nabla h(x^{k_j}) \right) . \end{aligned}$$

Therefore,

$$\begin{aligned} \xi ^{k_j} + \frac{1}{\lambda }\left( \nabla h(x^{k_j}) - \nabla h(x^{k_j+1}) \right) \in \partial g(x^{k_j+1}) + \nabla f_1(x^{k_j}). \end{aligned}$$
(17)

From the boundedness of \(\{x^{k_j}\}\) and the Lipschitz continuity of \(\nabla h\) on a bounded subset of \({{\,\mathrm{{\mathbb {R}}}\,}}^d\), there exists \(A_0 > 0\) such that

$$\begin{aligned} \left\| \frac{1}{\lambda }\left( \nabla h(x^{k_j}) - \nabla h(x^{k_j+1}) \right) \right\|&\le \frac{A_0}{\lambda }\Vert x^{k_j+1} - x^{k_j} \Vert . \end{aligned}$$

Therefore, using \(\Vert x^{k_j+1} - x^{k_j}\Vert \rightarrow 0\), we obtain

$$\begin{aligned} \frac{1}{\lambda }\left( \nabla h(x^{k_j}) - \nabla h(x^{k_j+1}) \right) \rightarrow 0. \end{aligned}$$
(18)

Note that the sequence \(\{\xi ^{k_j}\}\) is bounded due to (ii). Thus, by taking the limit as \(j\rightarrow \infty\) or more precisely, its subsequence, we can assume without loss of generality that \(\lim _{j\rightarrow \infty }\xi ^{k_j} =:{\tilde{\xi }}\) exists, which belongs to \(\partial _{\mathrm {c}} f_2({\tilde{x}})\) since \(f_2\) becomes continuous due to its convexity on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\). Using this and (18), we can take the limit of (17). Setting \(\Vert x^{k_j+1} - x^{k_j}\Vert \rightarrow 0\) and invoking the lower semicontinuity of g and \(\nabla f_1\), we obtain \({\tilde{\xi }}\in \partial g({\tilde{x}})+\nabla f_1({\tilde{x}})\). Therefore, \(0 \in \partial g({\tilde{x}}) + \nabla f_1({\tilde{x}}) - \partial _{\mathrm {c}} f_2({\tilde{x}})\), which shows that \({\tilde{x}}\) is a limiting critical point of \(({\mathcal {P}})\). \(\square\)

We can estimate the objective value at an accumulation point from \(\liminf _{j\rightarrow \infty }\varPsi (x^{k_j})\) and \(\limsup _{j\rightarrow \infty }\varPsi (x^{k_j})\). Consequently, we can prove that \(\varPsi\) is constant on the set of accumulation points of BPDCA.

Proposition 2

Suppose that Assumptions 12, and 4 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCA with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\). Then, the following statements hold:

  1. (i)

    \(\zeta := \lim _{k\rightarrow \infty }\varPsi (x^k)\) exists.

  2. (ii)

    \(\varPsi \equiv \zeta\) on \(\varOmega\), where \(\varOmega\) is the set of accumulation points of \(\{x^k\}_{k=0}^{\infty }\).

Proof

(i) From Assumption 1 (v) and Proposition 1 (i), the sequence \(\{\varPsi (x^k)\}_{k=0}^{\infty }\) is bounded from below and non-increasing. Consequently, \(\zeta := \lim _{k\rightarrow \infty }\varPsi (x^k)\) exists.

(ii) Take any \({\hat{x}}\in \varOmega\), that is \(\lim _{j\rightarrow \infty }x^{k_j}={\hat{x}}\). From (5), it follows that

$$\begin{aligned} g(x^{k})&+ \langle \nabla f_1(x^{k-1}) - \xi ^{k-1}, x^{k} - x^{k-1}\rangle + \frac{1}{\lambda }D_h(x^{k}, x^{k-1})\\&\le g({\hat{x}}) + \langle \nabla f_1(x^{k-1}) - \xi ^{k-1}, {\hat{x}} - x^{k-1}\rangle + \frac{1}{\lambda }D_h({\hat{x}}, x^{k-1}). \end{aligned}$$

From the above inequality and the fact that \(f_1\) is convex at \(x^k\), we obtain

$$\begin{aligned} g(x^{k}) + f_1(x^k) \le g({\hat{x}})&+ \langle \nabla f_1(x^{k-1}) - \xi ^{k-1}, {\hat{x}} - x^{k}\rangle + \frac{1}{\lambda }D_h({\hat{x}}, x^{k-1}) - \frac{1}{\lambda }D_h(x^{k}, x^{k-1})\\&+ f_1({\hat{x}})+\langle \nabla f_1(x^k),x^k-{\hat{x}}\rangle . \end{aligned}$$

Substituting \(k_j\) for k and limiting j to \(\infty\), we have, from Proposition 1 (ii),

$$\begin{aligned} \limsup _{j\rightarrow \infty } \left( g(x^{k_j}) + f_1(x^{k_j}) \right) \le g({\hat{x}}) + f_1({\hat{x}}), \end{aligned}$$

which provides \(\limsup _{j\rightarrow \infty }\varPsi (x^{k_j})\le \varPsi ({\hat{x}})\) from the continuity of \(- f_2\) since \(f_2\) is convex. Combining this and the lower semicontinuity of \(\varPsi\) yields \(\varPsi (x^{k_j})\rightarrow \varPsi ({\hat{x}})=:\zeta\) as \(j\rightarrow \infty\). Since \({\hat{x}}\in \varOmega\) is arbitrary, we conclude that \(\varPsi \equiv \zeta\) on \(\varOmega\). \(\square\)

To discuss the global convergence of BPDCA, we will suppose either of the following two assumptions.

Assumption 5

\(f_2\) is continuously differentiable on an open set \({\mathcal {N}}_0\subset {{\,\mathrm{{\mathbb {R}}}\,}}^d\) that contains the set of all limiting critical points of \(\varPsi\), i.e., \({\mathcal {X}}\). Furthermore, \(\nabla f_2\) is locally Lipschitz continuous on \({\mathcal {N}}_0\).

Assumption 6

g is differentiable on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\) and \(\nabla g\) is locally Lipschitz continuous on an open set \({\mathcal {N}}_0\subset {{\,\mathrm{{\mathbb {R}}}\,}}^d\) that contains the set of all limiting stationary points of \(-\varPsi\).

Assumption 5 is nonrestrictive because many functions in [28], including the \(f_2\) in our numerical experiments, satisfy it. Thus, let us discuss the global convergence of Algorithm 1 under Assumption 5 by following the argument presented in [28]. Note that every limiting critical point is a limiting stationary point from the differentiability of \(f_2\) under Assumption 5.

Theorem 2

(Global convergence of BPDCA under the local differentiability of \(f_2\)) Suppose that Assumptions 1, 2, 4, and 5 hold and that \(\varPsi\) is a KL function. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCA with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\). Then, the following statements hold:

  1. (i)

    \(\lim _{k\rightarrow \infty }{{\,\mathrm{dist}\,}}(0, \partial \varPsi (x^k)) = 0\).

  2. (ii)

    The sequence \(\{x^k\}_{k=0}^{\infty }\) converges to a limiting stationary point of \(({\mathcal {P}})\); moreover, \(\sum _{k=1}^{\infty } \Vert x^k - x^{k-1}\Vert < \infty\).

Proof

(i) Since \(\{x^k\}_{k=0}^{\infty }\) is bounded and \(\varOmega\) is the set of accumulation points of \(\{x^k\}_{k=0}^{\infty }\), we have

$$\begin{aligned} \lim _{k\rightarrow \infty }{{\,\mathrm{dist}\,}}(x^k, \varOmega ) = 0. \end{aligned}$$
(19)

From Theorem 1 (iv), we also have \(\varOmega \subseteq {\mathcal {X}}\). Thus, for any \(\mu >0\), there exists \(k_0>0\) such that \({{\,\mathrm{dist}\,}}(x^k, \varOmega )<\mu\) and \(x^k\in {\mathcal {N}}_0\) for any \(k \ge k_0\), where \({\mathcal {N}}_0\) is defined in Assumption 5. As for \({\mathcal {N}}_0\), since \(\varOmega\) is compact from the boundedness of \(\{x^k\}_{k=0}^{\infty }\), by decreasing \(\mu\), if needed, we can suppose without loss of generality that \(\nabla f_2\) is globally Lipschitz continuous on \({\mathcal {N}}:=\{ x\in {\mathcal {N}}_0 \mid {{\,\mathrm{dist}\,}}(x, \varOmega )<\mu \}\).

The subdifferential of \(\varPsi\) at \(x^k\) for \(k \ge k_0\) is

$$\begin{aligned} \partial \varPsi (x^k) = \nabla f_1(x^k) - \nabla f_2(x^k) + \partial g(x^k). \end{aligned}$$
(20)

Moreover, considering the first-order optimality condition of subproblem (5), we have that, for any \(k \ge k_0 + 1\),

$$\begin{aligned} \frac{1}{\lambda }\left( \nabla h(x^{k-1}) - \nabla h(x^k) \right) - \nabla f_1(x^{k-1}) + \nabla f_2(x^{k-1}) \in \partial g(x^k), \end{aligned}$$

since \(f_2\) is \({\mathcal {C}}^1\) on \({\mathcal {N}}\) and \(x^{k-1}\in {\mathcal {N}}\) for any \(k \ge k_0 + 1\). Using the above and (20), we see that

$$\begin{aligned} \frac{1}{\lambda }\left( \nabla h(x^{k-1}) - \nabla h(x^k) \right) + \nabla f_1(x^k) - \nabla f_1(x^{k-1}) + \nabla f_2(x^{k-1}) - \nabla f_2(x^k) \in \partial \varPsi (x^k). \end{aligned}$$

From the global Lipschitz continuity of \(\nabla f_1, \nabla f_2\), and \(\nabla h\), there exists \(A_1 > 0\) such that

$$\begin{aligned} {{\,\mathrm{dist}\,}}(0, \partial \varPsi (x^k)) \le A_1\Vert x^k - x^{k-1}\Vert , \end{aligned}$$
(21)

where \(k \ge k_0 + 1\). From Theorem 1 (iii), we conclude that \(\lim _{k\rightarrow \infty }{{\,\mathrm{dist}\,}}(0, \partial \varPsi (x^k)) = 0.\)

(ii) From Theorem 1 (iv), it is sufficient to prove that \(\{x^k\}_{k=0}^{\infty }\) is convergent. Here, consider the case in which there exists a positive integer \(k > 0\) such that \(\varPsi (x^k) = \zeta\). From Proposition 1 (i) and Proposition 2 (i), the sequence \(\{\varPsi (x^k)\}_{k=0}^{\infty }\) is non-increasing and converges to \(\zeta\). Hence, for any \({\hat{k}}\ge 0\), we have \(\varPsi (x^{k + {\hat{k}}}) = \zeta\). By recalling (16), we conclude that there exists a positive scalar \(A_2\) such that

$$\begin{aligned} \varPsi (x^{k-1}) - \varPsi (x^k) \ge A_2\Vert x^k - x^{k-1}\Vert ^2,\quad \forall k \in {\mathbb {N}}. \end{aligned}$$
(22)

From (22), we obtain \(x^k = x^{k+{\hat{k}}}\) for any \({\hat{k}}\ge 0\), meaning that \(\{x^k\}_{k=0}^{\infty }\) is finitely convergent.

Next, consider the case in which \(\varPsi (x^k) > \zeta\) for all \(k\ge 0\). Since \(\{x^k\}_{k=0}^{\infty }\) is bounded, \(\varOmega\) is a compact subset of \({{\,\mathrm{dom}\,}}\partial \varPsi\) and \(\varPsi \equiv \zeta\) on \(\varOmega\) from Proposition 2 (ii). From Lemma 2 and since \(\varPsi\) is a KL function, there exist a positive scalar \(\epsilon > 0\) and a continuous concave function \(\phi \in \varXi _{\eta }\) with \(\eta > 0\) such that

$$\begin{aligned} \phi '(\varPsi (x) - \zeta ) \cdot {{\,\mathrm{dist}\,}}(0, \partial \varPsi (x)) \ge 1, \end{aligned}$$
(23)

for all \(x \in U\), where

$$\begin{aligned} U = \left\{ x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\ \big |\ {{\,\mathrm{dist}\,}}(x, \varOmega )< \epsilon \right\} \cap \left\{ x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\ \big |\ \zeta< \varPsi (x) < \zeta + \eta \right\} . \end{aligned}$$

From (19), there exists \(k_1 > 0\) such that \({{\,\mathrm{dist}\,}}(x^k, \varOmega ) < \epsilon\) for any \(k \ge k_1\). Since \(\{\varPsi (x^k)\}_{k=0}^{\infty }\) is non-increasing and converges to \(\zeta\), there exists \(k_2 > 0\) such that \(\zeta< \varPsi (x^k) < \zeta + \eta\) for all \(k \ge k_2\). Taking \({\bar{k}} = \max \{k_0+1, k_1, k_2\}\), then \(\{x^k\}_{k \ge {\bar{k}}}\) belongs to U. Hence, from (23), we obtain

$$\begin{aligned} \phi '(\varPsi (x^k) - \zeta ) \cdot {{\,\mathrm{dist}\,}}(0, \partial \varPsi (x^k)) \ge 1,\quad \forall k \ge {\bar{k}}. \end{aligned}$$
(24)

Since \(\phi\) is a concave function, we see that for any \(k \ge {\bar{k}}\),

$$\begin{aligned} \left[ \phi (\varPsi (x^k) - \zeta ) - \phi (\varPsi (x^{k+1}) - \zeta ) \right]&\cdot {{\,\mathrm{dist}\,}}(0, \partial \varPsi (x^k))\\&\ge \phi '(\varPsi (x^k) - \zeta ) \cdot {{\,\mathrm{dist}\,}}(0, \partial \varPsi (x^k)) \cdot \left( \varPsi (x^k) - \varPsi (x^{k+1}) \right) \\&\ge \varPsi (x^k) - \varPsi (x^{k+1})\\&\ge A_2\Vert x^{k+1} - x^k\Vert ^2, \end{aligned}$$

where the second inequality holds from (24) and the fact that \(\{\varPsi (x^k)\}_{k=0}^{\infty }\) is non-increasing, and the last inequality holds from (22). From the above inequality and (21), we obtain

$$\begin{aligned} \Vert x^{k+1} - x^k\Vert ^2 \le \frac{A_1}{A_2} \left( \phi (\varPsi (x^k) - \zeta ) - \phi (\varPsi (x^{k+1}) - \zeta ) \right) \Vert x^k - x^{k-1}\Vert . \end{aligned}$$
(25)

Taking the square root of (25) and using the inequality of the arithmetic and geometric means, we find that

$$\begin{aligned} \Vert x^{k+1} - x^k \Vert&\le \sqrt{\frac{A_1}{A_2} \left( \phi (\varPsi (x^k) - \zeta ) - \phi (\varPsi (x^{k+1}) - \zeta ) \right) } \cdot \sqrt{\Vert x^k - x^{k-1}\Vert }\\&\le \frac{A_1}{2A_2} \left( \phi (\varPsi (x^k) - \zeta ) - \phi (\varPsi (x^{k+1}) - \zeta ) \right) + \frac{1}{2}\Vert x^k - x^{k-1}\Vert . \end{aligned}$$

This shows that

$$\begin{aligned} \frac{1}{2}\Vert x^{k+1} - x^k \Vert \le \frac{A_1}{2A_2} \left( \phi (\varPsi (x^k) - \zeta ) - \phi (\varPsi (x^{k+1}) - \zeta ) \right) + \frac{1}{2}\Vert x^k - x^{k-1}\Vert - \frac{1}{2}\Vert x^{k+1} - x^k \Vert . \end{aligned}$$
(26)

Summing (26) from \(k={\bar{k}}\) to \(\infty\), we have

$$\begin{aligned} \sum _{k={\bar{k}}}^{\infty }\Vert x^{k+1} - x^k \Vert \le \frac{A_1}{A_2} \phi (\varPsi (x^{{\bar{k}}}) - \zeta ) + \Vert x^{{\bar{k}}} - x^{{\bar{k}}-1}\Vert < \infty , \end{aligned}$$

which implies that \(\sum _{k=1}^{\infty } \Vert x^k - x^{k-1}\Vert < \infty\), i.e., the sequence \(\{x^k\}_{k=0}^{\infty }\) is a Cauchy sequence. Thus, \(\{x^k\}_{k=0}^{\infty }\) converges to a limiting critical point of \(({\mathcal {P}})\) from Theorem 1 (iv). Because every limiting critical point is a limiting stationary point from the differentiability of \(f_2\), \(\{x^k\}_{k=0}^{\infty }\) converges to a limiting stationary point of \(({\mathcal {P}})\). \(\square\)

Next, suppose that Assumption 6 holds instead of Assumption 5. Here, we can show the global convergence of BPDCA by referring to [16, Theorem 3.4]. We will use subanalyticity instead of the KL property in the proof.

Theorem 3

(Global convergence of BPDCA under the local differentiability of g) Suppose that Assumptions 124, and 6 hold and that \(\varPsi\) is subanalytic. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCA with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\). Then, the sequence \(\{x^k\}_{k=0}^{\infty }\) converges to a limiting critical point of \(({\mathcal {P}})\); moreover, \(\sum _{k=1}^{\infty } \Vert x^k - x^{k-1}\Vert < \infty\).

Proof

Since g is differentiable, g is continuous on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\). Since the convexity of \(f_1\) and \(f_2\) implies their continuity, \(\varPsi\) is continuous on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\).

Let \(\{\xi ^k\}_{k=0}^{\infty }\) on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\) be a sequence of subgradients of \(f_2\). From Theorem 1 (i) and (ii), \(\{x^k\}_{k=0}^{\infty }\) and \(\{\xi ^k\}_{k=0}^{\infty }\) are bounded. Let \({\tilde{x}}\) be a limiting stationary point of \(-\varPsi\) and \(B({\tilde{x}}, \epsilon _0)\) be an open ball with center \({\tilde{x}}\) and radius \(\epsilon _0 > 0\). Since \(\nabla g\) is locally Lipschitz continuous, and Assumption 4 (ii) holds, for \(\lambda > 0\), there exist \(\kappa _0 > 0\) and \(\epsilon _0 > 0\) such that

$$\begin{aligned} \left\| \nabla \left( g + \frac{1}{\lambda }h\right) (u) - \nabla \left( g + \frac{1}{\lambda }h\right) (v)\right\| \le \kappa _0\Vert u-v\Vert , \quad \forall u, v \in B({\tilde{x}}, \epsilon _0). \end{aligned}$$
(27)

From Assumption 1 (v), \(-\varPsi\) is finite. In addition, by recalling the continuity and subanalyticity of \(-\varPsi\) on \(B({\tilde{x}}, \epsilon _0)\), we can apply [6, Theorem 3.1] to the subanalytic function \(-\varPsi\) and obtain \(\nu _0 > 0\) and \(\theta _0\in [0, 1)\) such that

$$\begin{aligned} |\varPsi (u) - \zeta |^{\theta _0} \le \nu _0\Vert {\hat{x}}\Vert ,\quad \forall u\in B({\tilde{x}}, \epsilon _0),\quad {\hat{x}}\in \partial (-\varPsi )(u), \end{aligned}$$
(28)

where \(\zeta = \varPsi ({\tilde{x}})\).

Let \(\varOmega\) be the set of accumulation points of \(\{x^k\}_{k=0}^{\infty }\). Since \(\varOmega\) is compact, \(\varOmega\) can be covered by a finite number of \(B({\tilde{x}}_j, \epsilon _j)\) with \({\tilde{x}}_j\in \varOmega\) and \(\epsilon _j>0\), \(j=1,\ldots ,p\). From Theorem 1 (iv), \({\tilde{x}}_j\in \varOmega , j=1,\ldots ,p\) are limiting critical points of \(({\mathcal {P}})\). Hence, (27) with \(\kappa _j>0\) and \(\epsilon _j>0\) and (28) with \(\nu _j > 0\) and \(\theta _j\in [0, 1)\) hold for \(j=1,\ldots ,p\). Letting \(\epsilon > 0\) be a sufficiently small constant, we obtain

$$\begin{aligned} \left\{ x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\ \big |\ {{\,\mathrm{dist}\,}}(x, \varOmega ) < \epsilon \right\} \subset \bigcup _{j=1}^{p}B({\tilde{x}}_j, \epsilon _j). \end{aligned}$$

From (19), there exists \(k_1 > 0\) such that \({{\,\mathrm{dist}\,}}(x^k, \varOmega ) < \epsilon\) for any \(k \ge k_1\); hence, \(x^k\in \bigcup _{j=1}^{p}B({\tilde{x}}_j, \epsilon _j)\) for any \(k \ge k_1\). From Theorem 1 (iii), letting \({\bar{\epsilon }} > 0\) be a sufficiently small constant, there exists \(k_2 > 0\) such that \(\Vert x^k - x^{k+1}\Vert \le \frac{{\bar{\epsilon }}}{2}\) for any \(k \ge k_2\). Therefore, redefining \({\bar{\epsilon }}, \epsilon _j, j=1,\ldots ,p\) and relabeling if necessary, we can assume without loss of generality that

$$\begin{aligned} x^k \in \bigcup _{j=1}^{p}B\left({\tilde{x}}_j, \frac{\epsilon _j}{2}\right) \quad \mathrm {and} \quad \Vert x^k - x^{k+1}\Vert \le \frac{{\bar{\epsilon }}}{2},\quad \forall k \ge {\bar{k}}, \end{aligned}$$

where \({\bar{\epsilon }} = \min _{j=1,\ldots ,p}\epsilon _j\) and \({\bar{k}} = \max \{k_1, k_2\}\), which implies \(x^k\in B({\tilde{x}}_{j_k}, \epsilon _{j_k}/2),\ j_k \in \{1, \ldots , p\}\) and hence \(x^{k+1}\in B({\tilde{x}}_{j_k}, \epsilon _{j_k})\). Thus, from (27) and (28), we have

$$\begin{aligned}&\left\| \nabla \left( g + \frac{1}{\lambda }h\right) (x^k) - \nabla \left( g + \frac{1}{\lambda }h\right) (x^{k+1})\right\| \le \kappa \Vert x^k-x^{k+1}\Vert , \end{aligned}$$
(29)
$$\begin{aligned}&|\varPsi (x^k) - \zeta |^{\theta } \le \nu \Vert {\hat{x}}^k\Vert ,\quad {\hat{x}}^k\in \partial (-\varPsi )(x^k), \quad \forall k \ge {\bar{k}}, \end{aligned}$$
(30)

where \(\kappa = \max _{j=1,\ldots ,p}\kappa _j, \nu = \max _{j=1,\ldots ,p}\nu _j\), and \(\theta = \max _{j=1,\ldots ,p}\theta _j\). From (5), we find that

$$\begin{aligned} 0 = \nabla g(x^{k+1}) + \nabla f_1(x^{k}) - \xi ^{k} + \frac{1}{\lambda }\left( \nabla h(x^{k + 1}) - \nabla h(x^{k}) \right) , \end{aligned}$$

which implies

$$\begin{aligned} \nabla g(x^{k+1}) - \nabla g(x^k) + \frac{1}{\lambda }\left( \nabla h(x^{k + 1}) - \nabla h(x^{k}) \right) = \xi ^k - \nabla f_1(x^{k}) - \nabla g(x^k) \in \partial (-\varPsi )(x^k), \end{aligned}$$

where we have used \(\partial (-\varPsi )(x^k) = \partial _{\mathrm {c}} f_2(x^k) - \nabla f_1(x^k) - \nabla g(x^k)\), which comes from the convexity of \(f_2\). Using (29) and (30), we obtain

$$\begin{aligned} |\varPsi (x^k) - \zeta |^{\theta }\le \nu \left\| \nabla \left( g + \frac{1}{\lambda }h\right) (x^k) - \nabla \left( g + \frac{1}{\lambda }h\right) (x^{k+1})\right\| \le \kappa \nu \Vert x^k-x^{k+1}\Vert , \quad \forall k \ge {\bar{k}}. \end{aligned}$$
(31)

Since the function \(t^{1-\theta }\) is concave on \([0, \infty )\), \(\varPsi (x^k) - \zeta \ge 0\), (16), and (31), we find that, for all \(k \ge {\bar{k}}\),

$$\begin{aligned} (\varPsi (x^k) - \zeta )^{1-\theta } - (\varPsi (x^{k+1}) - \zeta )^{1-\theta }&{}\ge (1-\theta )(\varPsi (x^k) - \zeta )^{-\theta }(\varPsi (x^k) - \varPsi (x^{k+1}))\nonumber \\&{}\ge \frac{1-\theta }{\kappa \nu \Vert x^k-x^{k+1}\Vert }\left( \frac{1}{\lambda } - L \right) \frac{\sigma }{2}\Vert x^k-x^{k+1}\Vert ^2\nonumber \\&{}= \frac{(1-\theta )\sigma }{2\kappa \nu }\left( \frac{1}{\lambda } - L \right) \Vert x^k-x^{k+1}\Vert . \end{aligned}$$
(32)

Summing (32) from \(k={\bar{k}}\) to \(\infty\) yields

$$\begin{aligned} \sum _{k={\bar{k}}}^{\infty }\Vert x^k-x^{k+1}\Vert \le \frac{2\kappa \nu }{(1/\lambda - L)(1-\theta )\sigma }(\varPsi (x^{{\bar{k}}}) - \zeta )^{1-\theta } < \infty , \end{aligned}$$

which implies that \(\sum _{k=1}^{\infty } \Vert x^k - x^{k-1}\Vert < \infty\), i.e., the sequence \(\{x^k\}_{k=0}^{\infty }\) is a Cauchy sequence. Thus, \(\{x^k\}_{k=0}^{\infty }\) converges to a limiting critical point of \(({\mathcal {P}})\) from Theorem 1 (iv). \(\square\)

Finally, we will show the rate of convergence in a manner following [1, 28].

Theorem 4

(Rate of convergence under the local differentiability of \(f_2\)) Suppose that Assumptions 124, and 5 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCA with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\) and suppose that \(\{x^k\}_{k=0}^{\infty }\) converges to some \({\tilde{x}}\in {\mathcal {X}}\). Suppose further that \(\varPsi\) is a KL function with \(\phi\) in the KL inequality (4) taking the form \(\phi (s) = cs^{1-\theta }\) for some \(\theta \in [0, 1)\) and \(c > 0\). Then, the following statements hold:

  1. (i)

    If \(\theta = 0\), then there exists \(k_0 > 0\) such that \(x^k\) is constant for \(k > k_0\);

  2. (ii)

    If \(\theta \in (0, \frac{1}{2}]\), then there exist \(c_1> 0, k_1 > 0\), and \(\eta \in (0, 1)\) such that \(\Vert x^k - {\tilde{x}}\Vert < c_1\eta ^k\) for \(k > k_1\);

  3. (iii)

    If \(\theta \in (\frac{1}{2}, 1)\), then there exist \(c_2 > 0\) and \(k_2 > 0\) such that \(\Vert x^k - {\tilde{x}}\Vert < c_{2}k^{-\frac{1-\theta }{2\theta -1}}\) for \(k > k_2\).

Proof

(i) For the case of \(\theta = 0\), we will prove that there exists an integer \(k_0 > 0\) such that \(\varPsi (x^{k_0}) = \zeta\) by assuming to the contrary that \(\varPsi (x^k) > \zeta\) for all \(k > 0\) and showing a contradiction. The sequence \(\{\varPsi (x^k)\}_{k=0}^{\infty }\) converges to \(\zeta\) due to Proposition 2 (i). In addition, from the KL inequality (24) and \(\phi '(\cdot ) = c\), we can see that for all sufficiently large k,

$$\begin{aligned} {{\,\mathrm{dist}\,}}(0, \partial \varPsi (x^k)) \ge \frac{1}{c}, \end{aligned}$$

which contradicts Theorem 2 (i). Therefore, there exists \(k_0 > 0\) such that \(\varPsi (x^{k_0}) = \zeta\). Since \(\{\varPsi (x^k)\}_{k=0}^{\infty }\) is non-increasing and converges to \(\zeta\), we have \(\varPsi (x^{k_0+{\bar{k}}}) = \zeta\) for all \({\bar{k}} \ge 0\). This, together with (22), lead us to conclude that there exists \(k_0 > 0\) such that \(x^k\) is constant for \(k > k_0\).

(ii–iii) Next, consider the case \(\theta \in (0, 1)\). If there exists \(k_0 > 0\) such that \(\varPsi (x^{k_0}) = \zeta\), then we can show that the sequence \(\{x^k\}_{k=0}^{\infty }\) is finitely convergent in the same way as in the proof of (i). Therefore, for \(\theta \in (0, 1)\), we only need to consider the case that \(\varPsi (x^k) > \zeta\) for all \(k > 0\).

Define \(R_k = \varPsi (x^k) - \zeta\) and \(S_k = \sum _{j=k}^{\infty }\Vert x^{j+1} - x^j\Vert\), where \(S_k\) is well-defined due to Theorem 2 (ii). From (26), for any \(k \ge {\bar{k}}\), where \({\bar{k}}\) is defined in (24), we obtain

$$\begin{aligned} S_k&= 2\sum _{j=k}^{\infty }\frac{1}{2}\Vert x^{j+1}-x^j\Vert \nonumber \\&\le 2\sum _{j=k}^{\infty }\left[ \frac{A_1}{2A_2} \left( \phi (\varPsi (x^j) - \zeta ) - \phi (\varPsi (x^{j+1}) - \zeta ) \right) + \frac{1}{2}\Vert x^j - x^{j-1}\Vert - \frac{1}{2}\Vert x^{j+1} - x^j \Vert \right] \nonumber \\&\le \frac{A_1}{A_2}\phi (\varPsi (x^k) - \zeta ) + \Vert x^k - x^{k-1}\Vert \nonumber \\&= \frac{A_1}{A_2}\phi (R_k) + S_{k-1} - S_k. \end{aligned}$$
(33)

On the other hand, since \(\lim _{k\rightarrow \infty }x^k = {\tilde{x}}\) and \(\{\varPsi (x^k)\}\) is non-increasing and converges to \(\zeta\), the KL inequality (24) with \(\phi (s) = cs^{1-\theta }\) ensures that, for all sufficiently large k,

$$\begin{aligned} c(1-\theta )R_k^{-\theta }{{\,\mathrm{dist}\,}}(0, \partial \varPsi (x^k))\ge 1. \end{aligned}$$
(34)

From the definition of \(S_k\) and (21), we also have that, for all sufficiently large k,

$$\begin{aligned} {{\,\mathrm{dist}\,}}(0, \partial \varPsi (x^k)) \le A_1(S_{k-1} - S_k). \end{aligned}$$
(35)

Combining (34) and (35), we have \(R_k^{\theta } \le A_1 \cdot c(1-\theta ) (S_{k-1} - S_k)\) for all sufficiently large k. Raising the above inequality to the power of \(\frac{1-\theta }{\theta }\) and scaling both sides by c, we find that \(cR_k^{1-\theta } \le c(A_1 \cdot c(1-\theta ) (S_{k-1} - S_k))^{\frac{1-\theta }{\theta }}\). Combining this with (33) and recalling \(\phi (R_k) = cR_k^{1-\theta }\), we find that, for all sufficiently large k,

$$\begin{aligned} S_k \le A_3 (S_{k-1} - S_k)^{\frac{1-\theta }{\theta }} + S_{k-1} - S_k, \end{aligned}$$
(36)

where \(A_3 = \frac{A_1}{A_2}c(A_1 \cdot c(1-\theta ))^{\frac{1-\theta }{\theta }}\).

(ii) When \(\theta \in (0,\frac{1}{2}]\), we have \(\frac{1-\theta }{\theta }\ge 1\). Since \(\lim _{k\rightarrow \infty }\Vert x^{k+1} - x^k\Vert = 0\) by Theorem 1 (iii), \(\lim _{k\rightarrow \infty }S_{k-1} - S_k = 0\). From these considerations and (36), we conclude that there exists \(k_1 > 0\) such that for all \(k \ge k_1\), \(S_k \le (A_3 + 1)(S_{k-1} - S_k)\), which implies \(S_k \le \frac{A_3 + 1}{A_3 + 2} S_{k-1}\). Therefore, for all \(k \ge k_1\),

$$\begin{aligned} \Vert x^k - {\tilde{x}}\Vert \le \sum _{j=k}^{\infty }\Vert x^{j+1} - x^j\Vert = S_k \le S_{k_1-1}\left( \frac{A_3 + 1}{A_3 + 2} \right) ^{k - k_1 + 1}. \end{aligned}$$

(iii) For \(\theta \in ( \frac{1}{2}, 1)\), \(\frac{1-\theta }{\theta } < 1\). From (36) and \(\lim _{k\rightarrow \infty }S_{k-1} - S_k = 0\), there exists \(k_2 > 0\) such that

$$\begin{aligned} S_k \le A_3 (S_{k-1} - S_k)^{\frac{1-\theta }{\theta }} + S_{k-1} - S_k&\le A_3 (S_{k-1} - S_k)^{\frac{1-\theta }{\theta }} + (S_{k-1} - S_k)^{\frac{1-\theta }{\theta }}\\&\le (A_3 + 1)(S_{k-1} - S_k)^{\frac{1-\theta }{\theta }}, \end{aligned}$$

for all \(k \ge k_2\). Raising the above inequality to the power of \(\frac{\theta }{1-\theta }\), for any \(k \ge k_2\) we find that \(S_k^{\frac{\theta }{1-\theta }} \le A_4(S_{k-1} - S_k)\), where \(A_4 = (A_3 + 1)^{\frac{\theta }{1-\theta }}\). From [1, Theorem 2], we find that, for all sufficiently large k, there exists \(A_5 > 0\) such that \(S_k \le A_5 k^{-\frac{1 - \theta }{2\theta - 1}}\). \(\square\)

Using Theorem 3, we can obtain another rate of convergence in the same way as in the proof of [1, Theorem 2] or [16, Theorem 3.5].

Theorem 5

(Rate of convergence under the local differentiability of g) Suppose that Assumptions 124, and 6 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCA with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\) and suppose that \(\{x^k\}_{k=0}^{\infty }\) converges to some \({\tilde{x}}\in {\mathcal {X}}\). Suppose further that \(\varPsi\) is subanalytic. Let \(\theta \in [0, 1)\) be a Łojasiewicz exponent of \({\tilde{x}}\). Then, the following statements hold:

  1. (i)

    If \(\theta = 0\), then there exists \(k_0 > 0\) such that \(x^k\) is constant for \(k > k_0\);

  2. (ii)

    If \(\theta \in (0, \frac{1}{2}]\), then there exist \(c_1> 0, k_1 > 0\), and \(\eta \in (0, 1)\) such that \(\Vert x^k - {\tilde{x}}\Vert < c_1\eta ^k\) for \(k > k_1\);

  3. (iii)

    If \(\theta \in (\frac{1}{2}, 1)\), then there exist \(c_2 > 0\) and \(k_2 > 0\) such that \(\Vert x^k - {\tilde{x}}\Vert < c_{2}k^{-\frac{1-\theta }{2\theta -1}}\) for \(k > k_2\).

4.3 Properties of BPDCAe

Inspired by [29], we introduce the auxiliary function,

$$\begin{aligned} H_M(x, y) = \varPsi (x) + MD_h(y, x),\quad M>0. \end{aligned}$$

To show the decreasing property of \(H_M\), instead of \(\varPsi\), with respect to \(\{x^k\}_{k=0}^{\infty }\), we further assume the convexity of g.

Assumption 7

The function g is convex.

Under the adaptive restart scheme (see (8)), we show the decreasing property of \(H_M\).

Lemma 6

Suppose that Assumptions 13, and 7 hold. For any \(x^k, y^k \in C = {{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\) and any \(x^{k+1} \in C ={{\,\mathrm{int}\,}}{{\mathrm{dom}\,}}h\) defined by

$$\begin{aligned} x^{k+1} \in \mathop {\text {argmin}}\limits _{y \in {{\,\mathrm{{\mathbb {R}}}\,}}^d} \left\{ g(y) + \langle \nabla f_1(y^k) - \xi ^k, y - y^k \rangle + \frac{1}{\lambda }D_h(y, y^k) \right\} , \end{aligned}$$
(37)

where \(\xi ^k \in \partial _{\mathrm {c}} f_2(x^k)\), \(y^k = x^k + \beta _k(x^k - x^{k-1}), \lambda > 0\), and \(\{\beta _k\}_{k=0}^{\infty }\subset [0,1)\), it holds that

$$\begin{aligned} \lambda \varPsi (x^{k+1}) \le \lambda \varPsi (x^k) + D_h(x^k, y^k) - D_h(x^k, x^{k+1}) - (1 - \lambda L) D_h(x^{k+1}, y^k). \end{aligned}$$
(38)

Furthermore, when \(0< \lambda L < 1\) and \(\{\beta _k\}_{k=0}^\infty\) is given by the adaptive restart scheme (8),

$$\begin{aligned} H_M(x^{k+1}, x^{k}) \le {}&H_M(x^k, x^{k-1}) - \left( \frac{1}{\lambda } - M \right) D_h(x^k, x^{k+1})\nonumber \\&- \left( M - \frac{\rho }{\lambda } \right) D_h(x^{k-1}, x^k) - \left( \frac{1}{\lambda }-L\right) D_h(x^{k+1},y^k). \end{aligned}$$
(39)

In addition, when \(\frac{\rho }{\lambda } \le M \le \frac{1}{\lambda }\) for \(\rho \in [0, 1)\), the auxiliary function \(H_M\) is ensured to be non-increasing.

Proof

From the first-order optimality condition for (37), we obtain

$$\begin{aligned} 0 \in \partial _{\mathrm {c}} g(x^{k+1}) + \nabla f_1(y^k) - \xi ^k + \frac{1}{\lambda } (\nabla h(x^{k+1}) - \nabla h(y^k)). \end{aligned}$$

From the convexity of g, we find that

$$\begin{aligned} g(x^k) - g(x^{k+1}) \ge \langle -\nabla f_1(y^k) + \xi ^k - \frac{1}{\lambda } (\nabla h(x^{k+1}) - \nabla h(y^k)), x^k - x^{k+1} \rangle . \end{aligned}$$

Using the three-point identity (3) of the Bregman distances,

$$\begin{aligned} \frac{1}{\lambda }\langle \nabla h(x^{k+1}) - \nabla h(y^k), x^k - x^{k+1} \rangle = \frac{1}{\lambda } (D_h(x^k, y^k) - D_h(x^k, x^{k+1}) - D_h(x^{k+1}, y^k)), \end{aligned}$$

we have

$$\begin{aligned} g(x^k) - g(x^{k+1}) + f_1(x^k) - f_1(x^{k+1}) \ge {}&f_1(x^k) - f_1(x^{k+1}) + \langle - \nabla f_1(y^k) + \xi ^k, x^k - x^{k+1} \rangle \\&-\frac{1}{\lambda } \left( D_h(x^k, y^k) - D_h(x^k, x^{k+1}) - D_h(x^{k+1}, y^k)\right) . \end{aligned}$$

From the convexity of \(f_1\) and Lemma 1, we find that

$$\begin{aligned}&f_1(x^k) - f_1(x^{k+1}) - \langle \nabla f_1(y^k), x^k - x^{k+1} \rangle \\&\quad = f_1(x^k) - f_1(y^k) - \langle \nabla f_1(y^k), x^k - y^k \rangle - f_1(x^{k+1}) + f_1(y^k) + \langle \nabla f_1(y^k), x^{k+1} - y^k \rangle \\&\quad \ge -LD_h(x^{k+1}, y^k). \end{aligned}$$

The above inequalities and the definition of the subgradient for \(f_2\) lead us to

$$\begin{aligned} \varPsi (x^{k+1}) \le \varPsi (x^k) + \frac{1}{\lambda }D_h(x^k, y^k) - \frac{1}{\lambda }D_h(x^k, x^{k+1}) - \left( \frac{1}{\lambda } -L\right) D_h(x^{k+1}, y^k), \end{aligned}$$

which implies inequality (38). If \(\beta _k = 0\), then \(y^k = x^k\) and \(D_h(x^k, y^k) = 0\). If \(\beta _k \ne 0\), since we chose the adaptive restart scheme, there is a \(\rho \in [0, 1)\) satisfying \(D_h(x^k, y^k) \le \rho D_h(x^{k-1}, x^k)\). From the definition of \(H_M(x^k, x^{k-1})\) and \(0< \lambda L < 1\), we have

$$\begin{aligned} H_M(x^{k+1}, x^{k}) \le {}&H_M(x^k, x^{k-1}) + \frac{1}{\lambda }D_h(x^k, y^k) - \left( \frac{1}{\lambda } - M \right) D_h(x^k, x^{k+1})\nonumber \\&- M D_h(x^{k-1}, x^k) - \left( \frac{1}{\lambda }-L\right) D_h(x^{k+1},y^k)\nonumber \\ \le {}&H_M(x^k, x^{k-1}) - \left( \frac{1}{\lambda } - M \right) D_h(x^k, x^{k+1})\nonumber \\&- \left( M - \frac{\rho }{\lambda } \right) D_h(x^{k-1}, x^k) - \left( \frac{1}{\lambda }-L\right) D_h(x^{k+1},y^k), \end{aligned}$$
(40)

where the second inequality comes from \(D_h(x^k, y^k) \le \rho D_h(x^{k-1}, x^k)\). When \(\frac{\rho }{\lambda } \le M \le \frac{1}{\lambda }\), we have

$$\begin{aligned} H_M(x^{k+1}, x^{k}) \le H_M(x^k, x^{k-1}), \quad \forall k \ge 0, \end{aligned}$$

which shows that the sequence \(\{H_M\}_{k=0}^{\infty }\) is non-increasing. \(\square\)

We can use Lemma 6 to prove Proposition 3.

Proposition 3

Suppose that Assumptions 13, and 7 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCAe with \(0< \lambda L < 1\). Assume that the auxiliary function \(H_M(x^k, x^{k-1})\) satisfies \(\frac{\rho }{\lambda } \le M \le \frac{1}{\lambda }\) for \(\rho \in [0, 1)\). Then, the following statements hold:

  1. (i)

    The sequence \(\{H_M(x^k, x^{k-1})\}_{k=0}^{\infty }\) is non-increasing.

  2. (ii)

    \(\sum _{k=1}^{\infty }D_h(x^{k-1}, x^{k}) < \infty\); hence, the sequence \(\{D_h(x^{k-1}, x^{k})\}_{k=0}^{\infty }\) converges to zero.

  3. (iii)

    \(\min _{1\le k \le n}D_h(x^{k-1}, x^{k}) \le \frac{\lambda }{n(1-\rho )}\left( \varPsi (x^0) - \varPsi _* \right)\), where \(\varPsi _* = v({\mathcal {P}}) > -\infty\) (by Assumption 1 (v)).

Proof

(i) The statement was proved in Lemma 6.

(ii) Modify (40) into

$$\begin{aligned} \lambda ( H_M(x^{k+1}, x^{k}) - H_M(x^{k}, x^{k-1})) \le&-(1 - \lambda M) D_h(x^k, x^{k+1}) - (\lambda M - \rho )D_h(x^{k-1}, x^{k})\\&- (1-\lambda L) D_h(x^{k+1}, y^k)\\ \le&-(1 - \lambda M) D_h(x^k, x^{k+1}) - (\lambda M - \rho )D_h(x^{k-1}, x^{k}), \end{aligned}$$

where the last inequality comes from \((1-\lambda L) D_h(x^{k+1}, y^k) \ge 0\). Let n be a positive integer. Summing the above inequality from \(k=0\) to n and letting \(\varPsi _* = v({\mathcal {P}}) > - \infty\), we find that

$$\begin{aligned} \sum _{k=1}^n D_h(x^{k-1}, x^{k}) = \sum _{k=0}^n D_h(x^{k-1}, x^{k})&\le \frac{\lambda \left( H_M(x^0, x^{-1}) - H_M(x^{n+1}, x^{n}) \right) }{1 - \rho }\nonumber \\&\le \frac{\lambda \left( \varPsi (x^0) - \varPsi (x^{n+1}) \right) }{1 - \rho }\nonumber \\&\le \frac{\lambda \left( \varPsi (x^0) - \varPsi _* \right) }{1 - \rho }, \end{aligned}$$
(41)

where the second inequality comes from \(D_h(x^{-1}, x^0) = 0\), \(x^{-1} = x^0\), and \(D_h(x^{n}, x^{n+1}) \ge 0\). Note that \(x^{n+1} \in C\) by Assumption 3. By taking the limit as \(n \rightarrow \infty\), we arrive at the former statement (ii). The latter statement directly follows from the former.

(iii) From (41), we immediately have

$$\begin{aligned} n\min _{1\le k \le n}D_h(x^{k-1}, x^{k}) \le \sum _{k=1}^n D_h(x^{k-1}, x^{k}) \le \frac{\lambda \left( \varPsi (x^0) - \varPsi _*\right) }{1 - \rho }. \end{aligned}$$

\(\square\)

4.4 Convergence analysis of BPDCAe

The proofs of Theorems 678, and Proposition 4 are given in the Appendix. They follow arguments that are similar to their BPDCA counterparts.

Theorem 6

(Global subsequential convergence of BPDCAe) Suppose that Assumptions 1, 3, 4, and 7 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCAe with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\). Assume that the auxiliary function \(H_M(x^k, x^{k-1})\) satisfies \(\frac{\rho }{\lambda } \le M \le \frac{1}{\lambda }\) for \(\rho \in [0, 1)\). Then, the following statements hold:

  1. (i)

    The sequence \(\{x^k\}_{k=0}^{\infty }\) is bounded.

  2. (ii)

    \(\lim _{k\rightarrow \infty }\Vert x^{k+1} - x^k\Vert = 0\).

  3. (iii)

    Any accumulation point of \(\{x^k\}_{k=0}^{\infty }\) is a limiting critical point of \(({\mathcal {P}})\).

Proposition 4

Suppose that Assumptions 1, 34, and 7 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCAe with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\) and \(\frac{\rho }{\lambda } \le M \le \frac{1}{\lambda }\) for \(\rho \in [0, 1)\). Then, the following statements hold:

  1. (i)

    \(\zeta := \lim _{k\rightarrow \infty }\varPsi (x^k)\) exists.

  2. (ii)

    \(\varPsi \equiv \zeta\) on \(\varOmega\), where \(\varOmega\) is the set of accumulation points of \(\{x^k\}_{k=0}^{\infty }\).

Since \(H_M(x, y)\) has a Bregman distance term, the subdifferential of \(H_M(x, y)\) has a \(\nabla h\) term. To prove Theorem 7, we should additionally suppose that there is a bounded subdifferential of the gradient \(\nabla h\) [29].

Assumption 8

There exists a bounded u such that \(u \in \partial (\nabla h)\) on any bounded subset of \({{\,\mathrm{{\mathbb {R}}}\,}}^d\).

We can prove the following theorems by supposing the KL property or the subanalyticity of the auxiliary function \(H_M(x, y)\) in relation to x and y.

Theorem 7

(Global convergence of BPDCAe under the local differentiability of \(f_2\)) Suppose that Assumptions 1, 3, 4, 5, 7, and 8 hold and that the auxiliary function \(H_M(x, y)\) is a KL function satisfying \(\frac{\rho }{\lambda } \le M \le \frac{1}{\lambda }\) for \(\rho \in [0, 1)\). Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCAe with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\). Then, the following statements hold:

  1. (i)

    \(\lim _{k\rightarrow \infty }{{\,\mathrm{dist}\,}}((0, 0), \partial H_M(x^k, x^{k-1})) = 0\).

  2. (ii)

    The set of accumulation points of \(\{(x^k, x^{k-1})\}_{k=0}^{\infty }\) is \(\varUpsilon := \left\{ (x, x)\ \big |\ x\in \varOmega \right\}\) and \(H_M\equiv \zeta\) on \(\varUpsilon\), where \(\varOmega\) is the set of accumulation points of \(\{x^k\}_{k=0}^{\infty }\).

  3. (iii)

    The sequence \(\{x^k\}_{k=0}^{\infty }\) converges to a limiting stationary point of \(({\mathcal {P}})\); moreover, \(\sum _{k=1}^{\infty } \Vert x^k - x^{k-1}\Vert < \infty\).

Theorem 8

(Global convergence of BPDCAe under the local differentiability of g) Suppose that Assumptions 1, 3, 4, 6, 7, and 8 hold and that the auxiliary function \(H_M(x, y)\) is subanalytic satisfying \(\frac{\rho }{\lambda } \le M \le \frac{1}{\lambda }\) for \(\rho \in [0, 1)\). Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCAe with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\). Then, the sequence \(\{x^k\}_{k=0}^{\infty }\) converges to a limiting critical point of \(({\mathcal {P}})\); moreover, \(\sum _{k=1}^{\infty } \Vert x^k - x^{k-1}\Vert < \infty\).

Finally, we have theorems regarding the convergence rate of BPDCAe, whose proof is almost identical to Theorems 4 and 5.

Theorem 9

(Rate of convergence under the local differentiability of \(f_2\)) Suppose that Assumptions 1, 3, 4, 5, 7, and 8 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCAe with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\) and suppose that \(\{x^k\}_{k=0}^{\infty }\) converges to some \({\tilde{x}}\in {\mathcal {X}}\). Suppose further that the auxiliary function \(H_M(x, y)\) satisfying \(\frac{\rho }{\lambda } \le M \le \frac{1}{\lambda }\) for \(\rho \in [0, 1)\) is a KL function with \(\phi\) in the KL inequality (4) taking the form \(\phi (s) = cs^{1-\theta }\) for some \(\theta \in [0, 1)\) and \(c > 0\). Then, the following statements hold:

  1. (i)

    If \(\theta = 0\), then there exists \(k_0 > 0\) such that \(x^k\) is constant for \(k > k_0\);

  2. (ii)

    If \(\theta \in (0, \frac{1}{2}]\), then there exist \(c_1> 0, k_1 > 0\), and \(\eta \in (0, 1)\) such that \(\Vert x^k - {\tilde{x}}\Vert < c_1\eta ^k\) for \(k > k_1\);

  3. (iii)

    If \(\theta \in (\frac{1}{2}, 1)\), then there exist \(c_2 > 0\) and \(k_2 > 0\) such that \(\Vert x^k - {\tilde{x}}\Vert < c_{2}k^{-\frac{1-\theta }{2\theta -1}}\) for \(k > k_2\).

Theorem 10

(Rate of convergence under the local differentiability of g) Suppose that Assumptions 1, 3, 4, 6, 7, and 8 hold. Let \(\{x^k\}_{k=0}^{\infty }\) be a sequence generated by BPDCAe with \(0< \lambda L < 1\) for solving \(({\mathcal {P}})\) and suppose that \(\{x^k\}_{k=0}^{\infty }\) converges to some \({\tilde{x}}\in {\mathcal {X}}\). Suppose further that the auxiliary function \(H_M(x, y)\) satisfying \(\frac{\rho }{\lambda } \le M \le \frac{1}{\lambda }\) for \(\rho \in [0, 1)\) is subanalytic. Let \(\theta \in [0, 1)\) be a Łojasiewicz exponent of \({\tilde{x}}\). Then, the following statements hold:

  1. (i)

    If \(\theta = 0\), then there exists \(k_0 > 0\) such that \(x^k\) is constant for \(k > k_0\);

  2. (ii)

    If \(\theta \in (0, \frac{1}{2}]\), then there exist \(c_1> 0, k_1 > 0\), and \(\eta \in (0, 1)\) such that \(\Vert x^k - {\tilde{x}}\Vert < c_1\eta ^k\) for \(k > k_1\);

  3. (iii)

    If \(\theta \in (\frac{1}{2}, 1)\), then there exist \(c_2 > 0\) and \(k_2 > 0\) such that \(\Vert x^k - {\tilde{x}}\Vert < c_{2}k^{-\frac{1-\theta }{2\theta -1}}\) for \(k > k_2\).

5 Applications

5.1 Application to phase retrieval

In phase retrieval, we are interested in finding a (parameter) vector \(x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\) that approximately solves the system,

$$\begin{aligned} \langle a_r, x \rangle ^2 \simeq b_r, \quad r=1, 2,\ldots ,m, \end{aligned}$$
(42)

where the vectors \(a_r\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\) describe the model and \(b=(b_1,b_2,\ldots ,b_m)^{\mathrm {T}}\) is a vector of (usually) noisy measurements. As described in [8, 10], the system (42) can be formulated as a nonconvex optimization problem:

$$\begin{aligned} \min \left\{ \varPsi (x) := \frac{1}{4}\sum _{r=1}^m\left( \langle a_r, x \rangle ^2 - b_r \right) ^2 + \theta g(x)\ \bigg|\ x \in {{\,\mathrm{{\mathbb {R}}}\,}}^d \right\} , \end{aligned}$$
(43)

where \(\theta \ge 0\) is a trade-off parameter between the data fidelity criteria and the regularizer g. We define \(g:{{\,\mathrm{{\mathbb {R}}}\,}}^d \rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\), in particular \(g(x) = \Vert x\Vert _1\).

In this case, the underlying space of \(({\mathcal {P}})\) is \(C \equiv {{\,\mathrm{{\mathbb {R}}}\,}}^d\). Define \(f:{{\,\mathrm{{\mathbb {R}}}\,}}^d\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) as \(f(x) = \frac{1}{4} \sum _{r=1}^m \left( \langle a_r, x \rangle ^2 - b_r \right) ^2\), which is a nonconvex differentiable function that does not admit a global Lipschitz continuous gradient. The objective function of the phase retrieval problem can be also reformulated as a difference between two convex functions such as in [14]. That is, \(f(x) = f_1(x) - f_2(x)\), where

$$\begin{aligned} f_1(x) = \frac{1}{4} \sum _{r=1}^m \langle a_r, x \rangle ^4 + \frac{1}{4}\Vert b\Vert ^2 \quad \mathrm {and} \quad f_2(x) = \frac{1}{2} \sum _{r=1}^m b_r \langle a_r, x \rangle ^2. \end{aligned}$$
(44)

When we do not regard the phase retrieval (43) as a DC optimization problem, the Bregman Proximal Gradient algorithm (BPG) can be used instead [8]. Enhancements using the extrapolation technique were proposed: the Bregman Proximal Gradient algorithm with extrapolation (BPGe) [29] and Convex-Concave Inertial BPG [20] for estimating L. For BPG(e), assuming L-smad for the pair \((f_1 - f_2, h)\) using \(h(x) = \frac{1}{4}\Vert x\Vert ^4 + \frac{1}{2}\Vert x\Vert ^2\), L satisfies the following inequality [8, Lemma 5.1]:

$$\begin{aligned} L \ge \sum _{r=1}^m \left( 3\Vert a_r a_r^{\mathrm {T}}\Vert ^2 + \Vert a_r a_r^{\mathrm {T}}\Vert |b_r| \right) . \end{aligned}$$
(45)

On the other hand, for DC optimization problems, we define \(h:{{\,\mathrm{{\mathbb {R}}}\,}}^d\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\) as

$$\begin{aligned} h(x) = \frac{1}{4}\Vert x\Vert ^4. \end{aligned}$$
(46)

This function is simpler than the original nonconvex formulation. The function \(h(x) = \frac{1}{4} \Vert x\Vert ^4\) is not \(\sigma\)-strongly convex. Therefore, this function does not satisfy Assumption 4 (i).

Proposition 5

Let \(f_1\) and h be as defined above. Then, for any L satisfying

$$\begin{aligned} L \ge 3\left\| \sum _{r=1}^m \Vert a_r\Vert ^2 a_r a_r^{\mathrm {T}}\right\| , \end{aligned}$$
(47)

the function \(Lh-f_1\) is convex on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\). Therefore, the pair \((f_1,h)\) is L-smad on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\).

Proof

Let \(x\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\). Since \(f_1\) and h are \({\mathcal {C}}^2\) on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\), to guarantee the convexity of \(Lh-f_1\), it is sufficient to find \(L > 0\) such that \(L\lambda _{\min }(\nabla ^2 h(x)) \ge \lambda _{\max }(\nabla ^2 f_1(x))\), where \(\lambda _{\min }(M)\) and \(\lambda _{\max }(M)\) denote the minimal and maximal eigenvalues of a matrix M, respectively. Now, we have the Hessian for \(f_1\) and h:

$$\begin{aligned} \nabla ^2 f_1(x) = 3 \sum _{r=1}^m \langle a_r, x \rangle ^2 a_r a_r^{\mathrm {T}} \quad \mathrm {and} \quad \nabla ^2 h(x) = \Vert x\Vert ^2 I_d + 2xx^{\mathrm {T}}. \end{aligned}$$

Since \(\nabla ^2 h(x) \succeq \Vert x\Vert ^2 I_d\), we obtain \(\lambda _{\min }\left( \nabla ^2 h(x)\right) \ge \Vert x\Vert ^2\). From the well-known fact, \(\lambda _{\max }(M) \le \Vert M\Vert\), we have the following inequality:

$$\begin{aligned} \lambda _{\max }\left( \nabla ^2 f_1(x)\right) \le 3\left\| \sum _{r=1}^m \langle a_r, x \rangle ^2 a_r a_r^{\mathrm {T}} \right\| \le 3\left\| \sum _{r=1}^m\Vert a_r\Vert ^2 a_r a_r^{\mathrm {T}} \right\| \Vert x\Vert ^2 \le L\Vert x\Vert ^2 \le L\lambda _{\min }(\nabla ^2 h(x)). \end{aligned}$$

Therefore, we obtain the desired result. \(\square\)

Comparing the right hand side of (45) and that of (47), we can see that

$$\begin{aligned} 3\left\| \sum _{r=1}^m \Vert a_r\Vert ^2 a_r a_r^{\mathrm {T}}\right\| \le \sum _{r=1}^m \left( 3\Vert a_r a_r^{\mathrm {T}}\Vert ^2 + \Vert a_r a_r^{\mathrm {T}}\Vert |b_r| \right) . \end{aligned}$$
(48)

The constant L has the important role of defining the step size, and thereby affects the performance of the algorithms. Note that even if \(\left\| \sum _{r=1}^m \Vert a_r\Vert ^2 a_r a_r^{\mathrm {T}}\right\| = \sum _{r=1}^m \Vert a_r a_r^{\mathrm {T}}\Vert ^2\), the left-hand side of (48) is always smaller than the right-hand side because \(\sum _{r=1}^m \Vert a_r a_r^{\mathrm {T}}\Vert |b_r| \ge 0\). When \(h(x)=\frac{1}{4}\Vert x\Vert ^4+\frac{1}{2}\Vert x\Vert ^2\), the subproblems of BPG(e) have a closed-form solution formula [8, Proposition 5.1]. When \(h(x) = \frac{1}{4}\Vert x\Vert ^4\), subproblems (5) and (7) also have a closed-form solution formula, which is obtained by slightly modifications of those in BPG(e).

In this application, the functions \(f_1, f_2, g\), and h satisfy Assumptions from 1 to 8 excepting Assumption 4 (i) and 6. In particular, Assumption 4 (i) is not satisfied for our choice \(h(x)=\frac{1}{4}\Vert x\Vert ^4\), but it is satisfied if we replace it by \(h(x)=\frac{1}{4}\Vert x\Vert ^4+\frac{1}{2}\Vert x\Vert ^2\). Finally, \(\varPsi\) and \(H_M\) are KL functions due to their semi-algebraicity [1]. Therefore, in this application, Assumption 6 is not required for the global convergence of BPDCAe.

5.2 Lower bound on the L-smooth adaptable parameter in the Gaussian model

We dealt with the following Gaussian model. We generated the elements of m vectors \(a_r \in {{\,\mathrm{{\mathbb {R}}}\,}}^d\) and the ground truth \({\tilde{x}} \in {{\,\mathrm{{\mathbb {R}}}\,}}^d\), which was a sparse vector (sparsity of 5%), independently from the standard Gaussian distribution. Then, we generated \(b_r = \langle a_r, {\tilde{x}} \rangle ^2, r = 1,2,\ldots , m\) from \(a_r\) and \({\tilde{x}}\).

From the linearity of the expectation, we consider the expectation of \(\nabla ^2 f_1\),

$$\begin{aligned} {\mathbb {E}}\left[ \nabla ^2 f_1(x)\right] = 3\sum _{r=1}^m {\mathbb {E}}\left[ \langle a_r, x \rangle ^2 a_r a_r^{\mathrm {T}} \right] . \end{aligned}$$

Since the elements of \(a_r\) are independently generated from the standard Gaussian distribution, the j-th diagonal element of the above matrix is given by

$$\begin{aligned} {\mathbb {E}}\left[ \langle a_r, x \rangle ^2 a_{r, j}^2 \right] = {\mathbb {E}}\left[ a_{r, j}^4 x_j^2 + \sum _{k=1, k\ne j}^d a_{r, j}^2 a_{r, k}^2 x_k^2 \right] = 3x_j^2 + \sum _{k=1, k\ne j}^d x_k^2 = 2x_j^2 + \Vert x\Vert ^2. \end{aligned}$$

The non-diagonal (jk) elements are

$$\begin{aligned} {\mathbb {E}}\left[ \langle a_r, x \rangle ^2 a_{r, j}a_{r, k} \right]&= {\mathbb {E}}\left[ 2a_{r, j}^2 a_{r, k}^2 x_j x_k \right] = 2 x_j x_k. \end{aligned}$$

Moreover, noting that \(h(x) = \frac{1}{4}\Vert x\Vert ^4\), we obtain \({\mathbb {E}}\left[ \langle a_r, x \rangle ^2 a_r a_r^{\mathrm {T}} \right] = \Vert x\Vert ^2 I_d + 2xx^{\mathrm {T}} = \nabla ^2 h(x)\). The expectation of the Hessian of \(f_1(x)\) is thus given by \({\mathbb {E}}[\nabla ^2 f_1(x)]=3m\nabla ^2 h(x)\).

Under the Gaussian model, we can reduce the lower bound of L given in Proposition 5 with high probability by applying [10, Lemma 7.4] as shown in the following proposition.

Proposition 6

Let the functions \(f_1\) and h be given by (44) and (46), respectively. Moreover, assume that the vectors \(a_r\) are independently distributed according to the Gaussian model with a sufficiently large number of measurements. Let \(\gamma\) and \(\delta\) be a fixed positive numerical constant and \(c(\cdot )\) be a sufficiently large numerical constant that depends on \(\delta\); this means that the number of samples obeys \(m \ge c(\delta ) \cdot d\log d\) in the Gaussian model. Then, for any L satisfying

$$\begin{aligned} L \ge 9 \left\| \sum _{r=1}^m a_r a_r^{\mathrm {T}} \right\| + \delta , \end{aligned}$$
(49)

the function \(Lh-f_1\) is convex on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\) and hence the pair \((f_1,h)\) is L-smad on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\) with probability at least \(1 - 5e^{-\gamma d} - 4 / d^2\).

Proof

Consider the expectation of \(\sum _{r=1}^m a_r a_r^{\mathrm {T}}\). Since the elements of \(a_r\) are independently generated from the standard Gaussian distribution, for any \(y\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\), we have

$$\begin{aligned} y^{\mathrm {T}}{\mathbb {E}}\left[ \sum _{r=1}^m a_r a_r^{\mathrm {T}}\right] y = \sum _{r=1}^m{\mathbb {E}}\left[ \langle a_r, y \rangle ^2\right] = \sum _{r=1}^m\sum _{j=1}^d y_j^2 = \sum _{r=1}^m\Vert y\Vert ^2. \end{aligned}$$
(50)

From (50), for any \(y\in {{\,\mathrm{{\mathbb {R}}}\,}}^d\), we have

$$\begin{aligned} y^{\mathrm {T}}{\mathbb {E}}[\nabla ^2 f_1(x)]y =3\sum _{r=1}^m\left( \Vert x\Vert ^2\Vert y\Vert ^2 + 2\langle x, y \rangle ^2 \right) \le 9\Vert x\Vert ^2\sum _{r=1}^m\Vert y\Vert ^2 = 9\Vert x\Vert ^2 y^{\mathrm {T}}{\mathbb {E}}\left[ \sum _{r=1}^m a_r a_r^{\mathrm {T}}\right] y. \end{aligned}$$
(51)

We can easily find that

$$\begin{aligned} 9\sum _{r=1}^m a_r a_r^{\mathrm {T}} \preceq 9\left\| \sum _{r=1}^m a_r a_r^{\mathrm {T}}\right\| I_d, \end{aligned}$$

which implies that

$$\begin{aligned} 9{\mathbb {E}}\left[ \sum _{r=1}^m a_r a_r^{\mathrm {T}}\right] \preceq 9\left\| \sum _{r=1}^m a_r a_r^{\mathrm {T}}\right\| I_d. \end{aligned}$$
(52)

From (51) and (52), we have

$$\begin{aligned} {\mathbb {E}}[\nabla ^2 f_1(x)] \preceq 9\Vert x\Vert ^2\left\| \sum _{r=1}^m a_r a_r^{\mathrm {T}}\right\| I_d . \end{aligned}$$
(53)

From [10, Lemma 7.4], (49), and (53), we conclude that

$$\begin{aligned} \nabla ^2 f_1(x) \preceq {\mathbb {E}}[\nabla ^2 f_1(x)] + \delta \Vert x\Vert ^2 I_d \preceq L\Vert x\Vert ^2 I_d \end{aligned}$$
(54)

with probability at least \(1 - 5e^{-\gamma d} - 4 / d^2\). From \(\nabla ^2 h(x) \succeq \Vert x\Vert ^2 I_d\) and (54), we have \(\nabla ^2 f_1(x)\preceq L\nabla ^2 h(x)\), which proves that \(Lh-f_1\) is convex with probability at least \(1 - 5e^{-\gamma d} - 4 / d^2\). Therefore, the pair \((f_1,h)\) is L-smad on \({{\,\mathrm{{\mathbb {R}}}\,}}^d\). \(\square\)

Remark 1

Since each element of \(a_r\) independently follows the standard Gaussian distribution, \(\Vert a_r\Vert ^2\) follows the chi-squared distribution with d degrees of freedom. Thus, we can show \(\Vert a_r\Vert ^2 \ge 3\) with high probability for sufficiently large d. It implies that the bound given in Proposition 6 is smaller than that given in Proposition 5.

5.3 Performance results for phase retrieval with the Gaussian model

Here, we summarize the results for the Gaussian model. All numerical experiments were performed in Python 3.7 on an iMac with a 3.3 GHz Intel Core i5 Processor and 8 GB 1867 MHz DDR3 memory.

First, let us examine the results for Bregman proximal-type algorithms, i.e., BPG [8], BPGe [29], BPDCA (Algorithm 1), and BPDCAe (Algorithm 2). We compared the averages of 100 random instances in terms of the number of iterations, CPU time, and accuracy (Tables 1 and 2). Let \({\hat{x}}\) be a recovered solution and \({\tilde{x}}\) be the ground truth generated according to the method described in Sect. 5.2. In order to compare the objective function values, we took the difference \(\log _{10}|\varPsi ({\hat{x}}) - \varPsi ({\tilde{x}})|\) to be the accuracy. In the numerical experiments, \(\varPsi ({\hat{x}}) > \varPsi ({\tilde{x}})\). The termination criterion was defined as \(\Vert x^k - x^{k-1}\Vert /\max \{1, \Vert x^k\Vert \} \le 10^{-6}\). The equation numbers under each algorithm in Tables 1 and 2 indicate the value of \(\lambda\); that is, we set \(\lambda = 1/L\) for L satisfying the equations. For the restart schemes, we used the adaptive restart scheme with \(\rho = 0.99\) and the fixed restart scheme with \(K=200\). We set \(\theta = 1\) for the regularizer g in (43). We forcibly stopped the algorithms when they reached the maximum number of iterations (50,000). Table 2 compares the results of BPGe and BPDCAe under the same settings as the results in Table 1. BPDCA with (49) was the fastest among the algorithms without extrapolation (Table 1). On the other hand, the extrapolation method makes each algorithm faster (Table 2).

Table 1 Average number of iterations, CPU time, and accuracy for BPG [8] and BPDCA using 100 random instances of phase retrieval (over the Gaussian model) for different values of L
Table 2 Average number of iterations, CPU time, and accuracy for BPGe [29] and BPDCAe using 100 random instances of phase retrieval (over the Gaussian model) for different values of L
Fig. 1
figure 1

Empirical probability of success based on 100 trials for BPDCAe and Wirtinger flow [10] using the same initialization step (of the Wirtinger flow). We set \(d = 128\) and varied the number m of measurements

We can conclude that, at least for phase retrieval, BPDCA has a clear advantage over BPG because of its reformulation as a nonconvex DC optimization problem (44), which permits choosing a smaller L in (47) instead of (45). In particular, for the Gaussian model, we can use a smaller L in (49) with high probability. The extrapolation technique can further enhance performance. Also, we can see that the iterates of BPDCA(e) globally converge to their optimal solutions despite that the kernel generating distance h (46) does not satisfy Assumption 4 (i). This suggests that this condition may be relaxed in some cases.

Next, we compared the empirical probability of success for BPDCAe and Wirtinger flow [10], which is a well-known algorithm for phase retrieval. Here we took \(x_0\) in BPDCAe to be the value calculated in the initialization step of the Wirtinger flow. The empirical probability of success in Fig. 1 is an average over 100 trials. We regard that the algorithms succeeded if the relative error \(\Vert {\hat{x}} - {\tilde{x}}\Vert /\Vert {\tilde{x}}\Vert\) falls below \(10^{-5}\) after 2,500 iterations. The dimension d was fixed at 128, and we varied the number of measurements m. We used the adaptive restart scheme with \(\rho = 0.99\) and the fixed restart scheme with \(K=200\). We set \(\theta = 0\); i.e., we solved (43) without its regularizer. From the figure, we can see that BPDCAe with the initialization step of the Wirtinger flow achieved almost 100% success rate when \(m/d \ge 6\) and obtained more stable results than those of Wirtinger flow.

6 Conclusions

We proposed two Bregman proximal-type algorithms for solving DC optimization problems \(({\mathcal {P}})\). One is the Bregman Proximal DC Algorithm (BPDCA), the other is BPDCA with extrapolation (BPDCAe). Proximal-type algorithms including ours are effective on large-scale problems. In addition, our algorithms assume that the function \(f_1\) has the L-smooth adaptable property in relation to the kernel generating distance h, instead of L-smoothness. The restart condition for our adaptive restart scheme is different from the existing ones.

We conducted convergence analyses of our algorithms. Assuming the Kurdyka-Łojasiewicz property or subanalyticity of the objective function together with some standard assumptions, we established that the iterates generated by BPDCA(e) globally converge to a limiting stationary point or a limiting critical point and derived their convergence rates.

We applied our algorithms to phase retrieval. The numerical experiments demonstrated that BPDCAe is faster than the other Bregman-type algorithms. For the Gaussian model, BPDCAe offered more stable results than Wirtinger flow [10]. We conclude that BPDCAe is a powerful method for solving large-scale and structured DC optimization problems. Although the kernel generating distance h (46) does not satisfy Assumption 4 (i), the sequences generated by BPDCA(e) converged in the numerical experiments. Therefore, we conjecture that most of the convergent results can be demonstrated under weaker conditions. As future work, since g in BPDCA does not need to be convex, we will attempt to prove the monotonicity of the auxiliary function of BPDCAe (Lemma 6) without assuming Assumption 7.

Other Bregman proximal-type algorithms have been proposed. Mukkamala et al.[20] chose the L-smad parameters by using a line search. As this parameter is generally difficult to estimate accurately, we can utilize this line search in our algorithms.

For constrained problems, Wang et al.[27] proposed the Bregman alternating direction methods with multipliers. Tu et al.[25] also developed a Bregman-type algorithm for solving linearly constrained DC optimization problems. These variational methods may inspire further improvements and extensions.