Throughout this paper, 〈x,y〉 denotes the inner product of \(x,y\in \mathbb {R}^{m}\), and ∥⋅∥ corresponds to the induced norm given by \(\|x\|=\sqrt {\langle x,x\rangle }\). For any extended real-valued function \(f:\mathbb {R}^{m}\to \mathbb {R}\cup \{+\infty \}\), the set \(\text {dom} f := \{x\in \mathbb {R}^{m} ~|~ f(x) < +\infty \}\) denotes the (effective) domain of f. A function f is proper if its domain is nonempty. The function f is coercive if \(f(x)\to +\infty \) whenever \(\|x\|\to +\infty \), and it is said to be convex if
$$ f(\lambda x+(1-\lambda)y)\leq \lambda f(x) +(1-\lambda)f(y) \quad \text{for all } x,y\in\mathbb{R}^{m} \text{ and } \lambda\in[0,1]. $$
Further, f is strongly convex with strong convexity parameter ρ > 0 if \(f-\frac {\rho }{2}\|\cdot \|^{2}\) is convex, i.e., when
$$ f(\lambda x+(1-\lambda)y)\leq \lambda f(x) +(1-\lambda)f(y)-\frac{\rho}{2}\lambda(1-\lambda)\|x-y\|^{2} $$
for all \(x,y\in \mathbb {R}^{m}\) and λ ∈ [0,1]. For any convex function f, the subdifferential of f at \(x\in \mathbb {R}^{m}\) is the set
$$ \partial{f}(x) := \{w\in\mathbb{R}^{m} ~|~ f(y) \geq f(x) + \langle w,y-x\rangle~\forall y\in\mathbb{R}^{m}\}. $$
If f is differentiable at x, then ∂f(x) = {∇f(x)}, where ∇f(x) denotes the gradient of f at x. The one-sided directional derivative of f at x with respect to the direction \(d\in \mathbb {R}^{m}\) is defined by
$$ f^{\prime}(x;d):=\lim_{t\searrow 0}\frac{f(x+td)-f(x)}{t}. $$
Before going to the main contribution of this paper in Section 3, we state our assumptions imposed on (𝓟). We also recall some preliminary notions and basic results which will be used in the sequel.
Basic Assumptions
Assumption 1
Both functions g and h in (𝓟) are strongly convex on their domain for the same strong convexity parameter ρ > 0.
Assumption 2
The function h is subdifferentiable at every point in domh; that is, ∂h(x) ≠ ∅ for all x ∈ dom h.
Assumption 3
The function g is continuously differentiable on an open set containing dom h and
$$ \phi^{\star} := \inf_{x\in\mathbb{R}^{m}}\phi(x) > -\infty. $$
Assumption 1 is not restrictive, as one can always rewrite the objective function as ϕ = (g + q) − (h + q) for any strongly convex function q (e.g., \(q=\frac {\rho }{2}\|\cdot \|^{2}\)). Observe that Assumption 2 holds for all x ∈ridomh (by [17, Theorem 23.4]). A key property for our method is the smoothness of g in Assumption 3, which cannot be in general omitted (see [3, Example 3.2]).
Optimality Conditions
Under Assumptions 2 and 3 the following well-known necessary condition for local optimality holds.
Fact 1
(First-order necessary optimality condition) If x⋆ ∈domϕ is a local minimizer of problem (𝓟), then
$$ \partial{h}(x^{\star})=\{\nabla{g}(x^{\star})\}. $$
(1)
Proof
See [16, Theorem 3]. □
Any point satisfying condition (1) is called a d(irectional)-stationary point of (𝓟). We say that x⋆ is a critical point of (𝓟) if
$$ \nabla{g}(x^{\star}) \in \partial{h}(x^{\star}). $$
Clearly, d-stationary points are critical points, but the converse is not true in general (see, e.g., [4, Example 1]). In our setting, the notion of critical point coincides with that of Clarke stationarity, which requires that zero belongs to the Clarke subdifferential at x⋆ (see, e.g., [5, Proposition 2]). The next result establishes that the d-stationary points of (𝓟) are precisely those points for which the directional derivative is zero for every direction.
Proposition 1
A point x⋆ ∈ dom ϕ is a d-stationary point of (𝓟) if and only if
$$ \phi^{\prime}(x^{\star}; d) = 0\quad\text{ for all } d\in \mathbb{R}^{m}. $$
(2)
Proof
If x⋆ is a d-stationary point of (𝓟), then by [17, Theorem 25.1] we know that h is differentiable at x⋆. Therefore, for any \(d\in \mathbb {R}^{m}\), we have
$$ \phi^{\prime}(x^{\star}; d) = \langle \nabla{g}(x^{\star}), d \rangle - \langle \nabla{h}(x^{\star}), d \rangle=0. $$
For the converse implication, pick any v ∈ ∂h(x⋆) ≠ ∅ (by Assumption 2) and observe that, for any \(d\in \mathbb {R}^{m}\), we have that
$$ \begin{array}{@{}rcl@{}} \phi^{\prime}(x^{\star}; d) & = & g^{\prime}(x^{\star};d)-h^{\prime}(x^{\star};d)\\ & = & \langle \nabla{g}(x^{\star}), d \rangle -\lim_{t\searrow 0}\frac{h(x^{\star}+td)-h(x^{\star})}{t}\\ & \leq & \langle \nabla{g}(x^{\star})-v, d \rangle. \end{array} $$
Hence, if x⋆ satisfies (2), one must have
$$ \langle \nabla{g}(x^{\star})-v, d \rangle \geq 0\quad \text{for all } d\in\mathbb{R}^{m}, $$
which is equivalent to ∇g(x⋆) − v = 0. As v was arbitrarily chosen in ∂h(x⋆), we conclude that ∂h(x⋆) = {∇g(x⋆)}. □
DCA and Boosted DCA
In this section, we recall the iterative procedure DCA and its accelerated extension, BDCA, for solving problem (𝓟). The DCA iterates by solving a sequence of approximating convex subproblems, as described next in Algorithm 1.
The key feature that makes the DCA work, stated next in Fact 2(a), is that the solution of (??) provides a decrease in the objective value of (𝓟) along the iterations. Actually, an analogous result holds for the dual problem, see [14, Theorem 3]. In [2], the authors showed that the direction generated by the iterates of DCA, namely dk := yk − xk, provides a descent direction of the objective function at yk when the functions g and h in (𝓟) are assumed to be smooth. This result was later generalized in [3] to the case where h satisfies Assumption 2. The following result collects these properties.
Fact 2
Let xkand ykbe the sequences generated by Algorithm 1 and set dk := yk − xkfor all\(k\in \mathbb {N}\). Then the following statements hold:
- (a)
ϕ(yk) ≤ ϕ(xk) − ρ∥dk∥2;
- (b)
\(\phi ^{\prime }(y_{k};d_{k})\leq -\rho \|d_{k}\|^{2}\);
- (c)
there exists some δk > 0 such that
$$ \phi(y_{k}+\lambda d_{k})\leq \phi(y_{k})-\alpha\lambda^{2}\|d_{k}\|^{2}\quad\text{for all } \lambda\in{[0,\delta_{k}[}. $$
Proof
See [3, Proposition 3.1]. □
Thanks to Fact 2, once yk has been found by DCA, one can achieve a larger decrease in the objective value of (𝓟) by moving along the descent direction dk. Indeed, observe that
$$ \phi(y_{k}+\lambda d_{k})\leq \phi(y_{k})-\alpha\lambda^{2}\|d_{k}\|^{2} \leq \phi(x_{k})-(\rho+\alpha\lambda^{2})\|d_{k}\|^{2}\quad\text{for all } \lambda\in{[0,\delta_{k}[}. $$
This fact is the main idea of the BDCA [2, 3], whose iteration is described next in Algorithm 2.
Algorithmically, the BDCA is nothing more than the classical DCA with a line search procedure using an Armijo type rule. Note that the backtracking step in Algorithm 2 (lines 6–9) terminates finitely thanks to Fact 2(c).
We state next the basic convergence results for the sequences generated by BDCA (for more, see [2, 3]). Observe that DCA can be seen as a particular case of BDCA if one sets \(\overline {\lambda }_{k}=0\), so the following result applies to both Algorithms 1 and 2.
Fact 3
For any\(x_{0}\in \mathbb {R}^{m}\), either Algorithm 2 (BDCA) with ε = 0 returns a critical point of (𝓟), or it generates an infinite sequence such that the following properties hold.
- (a)
{ϕ(xk)} is monotonically decreasing and convergent to some ϕ⋆.
- (b)
Any limit point of {xk} is a critical point of (𝓟). In addition, if ϕis coercive then there exists a subsequence of {xk} which converges to a critical point of (𝓟).
- (c)
It holds that\({\sum }_{k=0}^{+\infty }\|d_{k}\|^{2}<+\infty \). Furthermore, if there is some\(\overline {\lambda }\)such that\(\lambda _{k}\leq \overline {\lambda }\)for all k ≥ 0, then\({\sum }_{k=0}^{+\infty }\|x_{k+1}-x_{k}\|^{2}<+\infty \).
Proof
See [3, Theorem 3.6]. □
Positive Spanning Sets
Most directional direct search methods are based on the use of positive spanning sets (see, e.g., [1, Section 5.6.3] and [7, Chapter 7]). Let us recall this concept here.
Definition 1
We call positive span of a set of vectors \(\{v_{1},v_{2},\ldots ,v_{r}\}\subset \mathbb {R}^{m}\) the convex cone generated by this set, i.e.,
$$ \{v\in\mathbb{R}^{m}: v=\alpha_{1}v_{1}+\cdots+\alpha_{r}v_{r}, ~\alpha_{i}\geq 0, i=1,2,\ldots,r\}. $$
A set of vectors in \(\mathbb {R}^{m}\) is said to be a positive spanning set if its positive span is the whole space \(\mathbb {R}^{m}\). A set {v1,v2, … , vr} is said to be positively dependent if one of the vectors is in the positive span generated by the remaining vectors; otherwise, the set is called positively independent. A positive basis in \(\mathbb {R}^{m}\) is a positively independent set whose positive span is \(\mathbb {R}^{m}\).
Three well-known examples of positive spanning sets are given next.
Example 1
(Positive basis) Let e1,e2, … , em be the unit vectors of the standard basis in \(\mathbb {R}^{m}\). Then the following sets are positive basis in \(\mathbb {R}^{m}\):
$$ \begin{array}{@{}rcl@{}} D_{1} &:=&\{\pm e_{1},\pm e_{2},\ldots,\pm e_{m}\}, \end{array} $$
(3a)
$$ \begin{array}{@{}rcl@{}} D_{2} &:=& \left\{e_{1},e_{2},\ldots,e_{m}, - \sum\limits_{i=1}^{m} e_{i} \right\}, \end{array} $$
(3b)
$$ \begin{array}{@{}rcl@{}} D_{3} &:=&\left\{ v_{1},v_{2},\ldots,v_{m},v_{m+1}\in\mathbb{R}^{m},\quad \text{ with }~ \begin{array}{l} {v_{i}^{T}}v_{j} =\frac{-1}{m},~\text{ if } i\neq j, \\ \|v_{i}\|=1,~i=1,2,\ldots,m+1. \end{array}\right\}. \end{array} $$
(3c)
A possible construction for D3 is given in [7, Corollary 2.6].
Recall that the BDCA provides critical points of (𝓟) which are not necessarily d-stationary points (Fact 3). Theoretically, see [14, Section 3.3], if x⋆ is a critical point which is not d-stationary, one could restart BDCA by taking x0 := x⋆ and choosing y0 ∈ ∂h(x0) ∖{∇g(x0)}. Nonetheless, observe that this is only applicable when the algorithm converges in a finite number of iterations to x⋆, which does not happen very often in practice (except for polyhedral DC problems, where even a global solution can be effectively computed if h is a piecewise linear function with a reasonable small number of pieces, see [14, §4.2]). Because of that, our goal is to design a variant of BDCA that generates a sequence converging to a d-stationary point. The following key result, proved in [6, Theorem 3.1], asserts that using positive spanning sets one can escape from points which are not d-stationary. We include its short proof.
Fact 4
Let {v1, v2, … , vr} be a positive spanning set of\(\mathbb {R}^{m}\). A point x⋆ ∈domϕ is a d-stationary point of (𝓟) if and only if
$$ \phi^{\prime}(x^{\star}; v_{i})\geq 0\quad\text{for all } i=1,2,\ldots,r. $$
(4)
Proof
The direct implication is an immediate consequence of Proposition 1. For the reverse implication, pick any x⋆ ∈domϕ verifying (4) and choose any \(d\in \mathbb {R}^{m}\). Since {v1,v2, … , vr} is a positive spanning set, there are α1,α2, … , αr ≥ 0 such that
$$ d=\alpha_{1}v_{1}+\alpha_{2}v_{2}+\cdots+\alpha_{r}v_{r}. $$
According to [17, Theorem 23.1], we have that
$$ h^{\prime}(x^{\star};d)\leq \alpha_{1}h^{\prime}(x^{\star};v_{1})+\cdots+\alpha_{r}h^{\prime}(x^{\star};v_{r}). $$
Hence, we obtain
$$ \begin{array}{@{}rcl@{}} \phi^{\prime}(x^{\star};d) & = & g^{\prime}(x^{\star};d)-h^{\prime}(x^{\star};d)\\ & =& \langle \nabla{g}(x^{\star}), \alpha_{1}v_{1}+\alpha_{2}v_{2}+\cdots+\alpha_{r}v_{r} \rangle-h^{\prime}(x^{\star};d) \\ & \geq& \sum\limits_{i=1}^{r} \alpha_{i} \langle \nabla{g}(x^{\star}); v_{i}\rangle - \sum\limits_{i=1}^{r} \alpha_{i} h^{\prime}(x^{\star};v_{i})\\ & =& \sum\limits_{i=1}^{r} \alpha_{i}\phi^{\prime}(x^{\star};v_{i})\geq 0. \end{array} $$
Since d was arbitrarily chosen, then (2) holds and x⋆ is a d-stationary point of (𝓟). □