Using Positive Spanning Sets to Achieve d-Stationarity with the Boosted DC Algorithm

The Difference of Convex functions Algorithm (DCA) is widely used for minimizing the difference of two convex functions. A recently proposed accelerated version, termed BDCA for Boosted DC Algorithm, incorporates a line search step to achieve a larger decrease of the objective value at each iteration. Thanks to this step, BDCA usually converges much faster than DCA in practice. The solutions found by DCA are guaranteed to be critical points of the problem, but these may not be local minima. Although BDCA tends to improve the objective value of the solutions it finds, these are frequently just critical points as well. In this paper we combine BDCA with a simple Derivative-Free Optimization (DFO) algorithm to force the d-stationarity (lack of descent direction) at the point obtained. The potential of this approach is illustrated through some computational experiments on a Minimum-Sum-of-Squares clustering problem. Our numerical results demonstrate that the new method provides better solutions while still remains faster than DCA in the majority of test cases.


Introduction
In this paper, we are interested in solving the following unconstrained DC (difference of convex functions) optimization problem: where g : R m → R ∪ {+∞} and h : R m → R ∪ {+∞} are proper, closed and convex functions, and g is smooth, with the conventions: (+∞) − (+∞) = +∞, (+∞) − λ = +∞ and λ − (+∞) = −∞, ∀λ ∈ ] − ∞, +∞[.Problem (P) can be tackled by the well-known DC algorithm (DCA) [14,15].DC programming has become an active research field for the last few decades [9] and DCA has been successfully applied to many real-world problems arising in different fields (see, e.g., [10]).Although DCA performs well in practice, its convergence can be fairly slow for some particular problems.In order to speed up the scheme, an accelerated version of the algorithm, called Boosted DC algorithm (BDCA), has been recently proposed in [2,3].The BDCA performs a line search at the point generated by the classical DCA, which allows to achieve a larger decrease in the objective value at each iteration.In the numerical experiments reported in [2,3] it was shown that BDCA was not only faster than DCA, but also often found solutions with lower objective value.However, although both algorithms are proved to converge to critical points of (P), there is no guarantee that these points are local minima.For this reason, a simple trick to achieve better solutions consists in running the algorithms from different starting points.Another approach has been recently used in [13], where the authors incorporated an inertial term into the algorithm making it converge to better critical points.In the recent work [12], the author proposed a DC scheme which is able to compute d-stationary points.Although this algorithm permits to address problems where the function g is nonsmooth, the second component function h needs to be the pointwise maximum of finitely many differentiable functions.
The aim of this paper is to show that it is possible to combine BDCA with a simple DFO (Derivative-Free Optimization) routine to guarantee d-stationarity at the limit point obtained by the algorithm.As a representative application, we perform a set of numerical experiments on the Minimum Sum-of-Squares Clustering problem studied in [3] to illustrate this observation.This problem has many critical points, where both DCA and BDCA tend to easily get trapped in.As a byproduct of the DFO step, we observe that in some problems a single run of the new algorithm is able to provide better solutions than those obtained by multiple restarts of DCA.
The rest of this paper is organized as follows.In Section 2 we recall some preliminary results.We propose a new variant of BDCA, named BDCA+, in Section 3. The results of some numerical experiments are presented in Section 4, where we compare the performance of DCA, BDCA and BDCA+ on several test cases.We finish with some conclusions in Section 5.

Preliminaries
Throughout this paper, x, y denotes the inner product of x, y ∈ R m , and • corresponds to the induced norm given by x = x, x .For any extended real-valued function If f is differentiable at x, then ∂f (x) = {∇f (x)}, where ∇f (x) denotes the gradient of f at x.The one-sided directional derivative of f at x with respect to the direction d ∈ R m is defined by Before going to the main contribution of this paper in Section 3, we state our assumptions imposed on (P).We also recall some preliminary notions and basic results which will be used in the sequel.

Basic Assumptions
Assumption 1.Both functions g and h in (P) are strongly convex on their domain for the same strong convexity parameter ρ > 0.
Assumption 2. The function h is subdifferentiable at every point in dom h; that is, ∂h(x) = ∅ for all x ∈ dom h.Assumption 3. The function g is continuously differentiable on an open set containing dom h and φ := inf x∈R m φ(x) > −∞.Assumption 1 is not restrictive, as one can always rewrite the objective function as φ = (g + q) − (h + q) for any strongly convex function q (e.g., q = ρ 2 • 2 ).Observe that Assumption 2 holds for all x ∈ ri dom h (by [17,Theorem 23.4]).A key property for our method is the smoothness of g in Assumption 3, which cannot be in general omitted (see [3,Example 3.2]).

Optimality Conditions
Under Assumptions 2 and 3 the following well-known necessary condition for local optimality holds.
Any point satisfying condition (1) is called a d(irectional)-stationary point of (P).We say that x is a critical point of (P) if ∇g(x ) ∈ ∂h(x ).
Clearly, d-stationary points are critical points, but the converse is not true in general (see, e.g., [4,Example 1]).In our setting, the notion of critical point coincides with that of Clarke stationarity, which requires that zero belongs to the Clarke subdifferential at x (see, e.g., [5,Proposition 2]).The next result establishes that the d-stationary points of (P) are precisely those points for which the directional derivative is zero for every direction.
Proposition 2.1.A point x ∈ dom φ is a d-stationary point of (P) if and only if (2) Proof.If x is a d-stationary point of (P), then by [17,Theorem 25.1] we know that h is differentiable at x .Therefore, for any d ∈ R m , we have For the converse implication, pick any v ∈ ∂h(x ) = ∅ (by Assumption 2) and observe that, for any d ∈ R m , we have that which is equivalent to ∇g(x ) − v = 0.As v was arbitrarily chosen in ∂h(x ), we conclude that ∂h(x * ) = {∇g(x )}.

DCA and Boosted DCA
In this section, we recall the iterative procedure DCA and its accelerated extension, BDCA, for solving problem (P).The DCA iterates by solving a sequence of approximating convex subproblems, as described next in Algorithm 1.
The key feature that makes the DCA work, stated next in Fact 2.2(a), is that the solution of (P k ) provides a decrease in the objective value of (P) along the iterations.Actually, an analogous result holds for the dual problem, see [14,Theorem 3].In [2], the authors showed that the direction generated by the iterates of DCA, namely d k := y k − x k , provides a descent direction of the objective function at y k when the functions g and h in (P) are assumed to be smooth.This result was later generalized in [3] to the case where h satisfies Assumption 2. The following result collects these properties.
Algorithm 1: DCA (DC Algorithm) Input: An initial point x 0 ∈ R m and a desired tolerance ε ≥ 0; Thanks to Fact 2.2, once y k has been found by DCA, one can achieve a larger decrease in the objective value of (P) by moving along the descent direction d k .Indeed, observe that This fact is the main idea of the BDCA [2, 3], whose iteration is described next in Algorithm 2.
Algorithmically, the BDCA is nothing more than the classical DCA with a line search procedure using an Armijo type rule.Note that the backtracking step in Algorithm 2 (lines 6-9) terminates finitely thanks to Fact 2.2(c).
We state next the basic convergence results for the sequences generated by BDCA (for more, see [2,3]).Observe that DCA can be seen as a particular case of BDCA if one sets λ k = 0, so the following result applies to both Algorithms 1 and 2.
Fact 2.3.For any x 0 ∈ R m , either Algorithm 2 (BDCA) with ε = 0 returns a critical point of (P), or it generates an infinite sequence such that the following properties hold.
(b) Any limit point of {x k } is a critical point of (P).In addition, if φ is coercive then there exists a subsequence of {x k } which converges to a critical point of (P).
A set of vectors in R m is said to be a positive spanning set if its positive span is the whole space R m .A set {v 1 , v 2 , . . ., v r } is said to be positively dependent if one of the vectors is in the positive span generated by the remaining vectors; otherwise, the set is called positively independent.A positive basis in R m is a positively independent set whose positive span is R m .
Three well-known examples of positive spanning sets are given next.
Example 2.1 (Positive basis).Let e 1 , e 2 , . . ., e m be the unit vectors of the standard basis in R m .Then the following sets are positive basis in R m : A possible construction for D 3 is given in [7,Corollary 2.6].
Recall that the BDCA provides critical points of (P) which are not necessarily d-stationary points (Fact 2.3).Theoretically, see [14, Section 3.3], if x is a critical point which is not d-stationary, one could restart BDCA by taking x 0 := x and choose y 0 ∈ ∂h(x 0 ) \ {∇g(x 0 )}.Nonetheless, observe that this is only applicable when the algorithm converges in a finite number of iterations to x , which does not happen very often in practice (except for polyhedral DC problems, where even a global solution can be effectively computed if h is a piecewise linear function with a reasonable small number of pieces, see [14, §4.2]).Because of that, our goal is to design a variant of BDCA that generates a sequence converging to a d-stationary point.The following key result, proved in [6, Theorem 3.1], asserts that using positive spanning sets one can escape from points which are not d-stationary.We include its short proof.

Forcing BDCA to converge to d-stationary points
In this section we propose a new variant of BDCA to solve problem (P), called BDCA+.The idea is to combine BDCA with a basic DFO routine which uses positive spanning sets.The first scheme aims at achieving a fast minimization of the objective function φ, while the second one is used to avoid converging to critical points for which there is at least a descent direction (i.e., they are not d-stationary points and, thus, they cannot be local minima).Let us make some comments about the new scheme BDCA+, which is stated in Algorithm 3.
• Subproblem (P k ) in line 3 corresponds to the classical DCA step for solving (P).
• Lines 5 to 10 encode the boosting line search step used in BDCA.If the current iterate is (numerically) not a critical point, then the algorithm performs a line search step at y k along the direction d k to improve the objective values of (P).
• Line 11 to 19 correspond to a direct search DFO technique.It is run only when BDCA was stopped, in order to check if the point obtained is d-stationary.To this aim, it performs a backtracking search along each of the directions belonging to a positive spanning set D of R m .If it reaches a point whose objective value is smaller, then we move to that point and run BDCA again from there.Otherwise, there is not descent direction in D and, according to Fact 2.4, the point we have found must be (numerically) d-stationary.
• The choice λ k = 0 for all k is allowed, which corresponds to adding a direct search step to DCA.
In Figure 1 we show the iterations generated by DCA (Algorithm 1) and BDCA+ (Algorithm 3) from the same starting point x 0 = (0, 1).The DCA converges to the critical point (0, 0).The BDCA escapes from this point but still gets stuck at (0, −1), which is also a critical point which is not d-stationary.After applying once the DFO scheme (dashed line), we observe that BDCA successfully converges to the d-stationary point (−1, −1), which is in fact the global minimum of the problem.
To demonstrate the advantage of BDCA+ we compute the number of instances, out of one million random starting points uniformly distributed in [−1.5, 1.5] × [−1.5, 1.5], that each algorithm has converged to each of the four critical points.The results are summarized in Table 1.From Table 1, we observe that DCA converged to each of the four critical points with the same probability, while BDCA converged to the global minimum in 99.6% of the instances.The best results where obtained by BDCA+, which always converged to the global minimum (−1, −1).
We can rewrite the objective in (5) as a DC function (see [4,8,11]) with x j 2 ; where g and h satisfy Assumptions 1 to 3 for all ρ > 0 (in our tests, we took ρ =1 nk ).
Data Set and Experiments: Our data set is the same one considered in [3], which consists of the location of 4001 Spanish cities in the peninsula with more than 500 inhabitants 1 .In Figure 2 we compare the iterations generated by DCA and BDCA+ for finding a partition into 20 clusters from the same random starting point x 0 ∈ R 2×20 (marked with a black cross).We observe that DCA converges to a critical point which is far from being optimal, as there are three clusters without any cities assigned.On the other hand, although BDCA apparently converges to the same critical point, the DFO step allows BDCA+ to escape from points which are not d-stationary and reach a better solution.
To corroborate these results, we repeated the experiment for different number of clusters k ∈ {20, 40, 60, 80}.For each of these values, we run DCA and BDCA+ from 50 random starting points.The results are shown in Figure 3, where we can clearly observe that BDCA+ outperforms DCA, not only in terms of the objective value attained, but even in running time.Observe that it is not really fair to compare the running time of DCA and BDCA+, because DCA simply stops at a critical point without incorporating the time-consuming DFO step that guarantees d-stationarity.Nonetheless, the speedup obtained by the line search of BDCA allows BDCA+ to still converge faster than DCA in most of the instances.As expected, BDCA+ becomes slower as the size of the problem increases, due to the DFO step.Despite that, notice that for 80 clusters the best solution provided by DCA among the 50 instances is still worse than the worst solution obtained by BDCA+.That is, any of the runs of BDCA+ was able to obtain a better solution than 50 restarts of DCA.

Concluding remarks
We have proposed a combination between the Boosted DC Algorithm (BDCA) and a simple direct search Derivative-Free Optimization (DFO) technique for minimizing the difference of two convex functions, the first of which is assumed to be smooth.The BDCA is used for minimizing the objective function, while the DFO step permits to force the iteration to converge to d-stationary points (i.e. to points where there exists no descent direction), rather than just critical points.
The good behavior of the new algorithm, called BDCA+, has been demonstrated by numerical experiments in a clustering problem.The new scheme generates better solutions than the classical DCA in nearly all the instances tested.Moreover, this improvement in the quality of the solutions has not caused an important loss in the time spent by the algorithm.In fact, BDCA+ was faster than DCA in most of the cases, thanks to the large acceleration achieved by the line search boosting step of BDCA.For each of these values, both algorithms were run from 50 random starting points.We represent the objective value achieved in the limit point by each algorithm (left axis, in blue), as well as the ratio between the CPU time required by DCA with respect to the one needed by BDCA+ (right axis, orange crosses).Instances were sorted on the x-axis in descending order according to the gap between the objective values at the limit points found by the algorithms.

Fact 2 . 2 .
Let x k and y k be the sequences generated by Algorithm 1 and set d k := y k −x k , for all k ∈ N. Then the following statements hold:

5 stop and return y k ; 6 else 7 x k+1 = y k ; 8 end 9 k
← k + 1 and go to line 3; 10 end (c) there exists some δ k > 0 such that

Figure 2 :
Figure 2: Iterations and limit points generated by DCA and BDCA+ for grouping the Spanish cities in the peninsula into 20 clusters from the same random starting point.The DFO step in line 14 of Algorithm 3 was run 10 times (these steps are marked with a dashed line).

Figure 3 :
Figure3: Comparison between the DCA and the BDCA+ for classifying the Spanish cities in the peninsula into k clusters for k ∈ {20, 40, 60, 80}.For each of these values, both algorithms were run from 50 random starting points.We represent the objective value achieved in the limit point by each algorithm (left axis, in blue), as well as the ratio between the CPU time required by DCA with respect to the one needed by BDCA+ (right axis, orange crosses).Instances were sorted on the x-axis in descending order according to the gap between the objective values at the limit points found by the algorithms.
βλ k ; [7,nd 10 x k+1 ← y k + λ k d k ; 11 else 12 stop and return y k ; 13 end 14 k ← k + 1 and go to line 3; 15 end2.4PositiveSpanningSetsMostdirectional direct search methods are based on the use of positive spanning sets (see, e.g., [1, Section 5.6.3]and[7,Chapter 7]).Let us recall this concept here.Definition 2.1.We call positive span of a set of vectors {v 1 , v 2 , . . ., v r } ⊂ R m the convex cone generated by this set, i.e., {v ∈ R m

Table 1 :
For one million random starting points in [−1.5, 1.5] × [−1.5, 1.5], we count the sequences generated by DCA, BDCA and BDCA+ converging to each of the four d-stationary points