A clustering heuristic to improve a derivative-free algorithm for nonsmooth optimization

In this paper we propose an heuristic to improve the performances of the recently proposed derivative-free method for nonsmooth optimization CS-DFN. The heuristic is based on a clustering-type technique to compute a direction { which relies on an estimate of Clarke's generalized gradient} of the objective function. As such, this direction (as it is shown by the numerical experiments) is a good descent direction for the objective function. We report some numerical results and comparison with the original CS-DFN method to show the utility of the proposed improvement on a set of well-known test problems.


Introduction
We consider the following unconstrained minimization problem min x∈ n f (x). ( We assume that the objective function f , (though nonsmooth) is Lipschitz continuous and that first-order information is unavailable or impractical to obtain.We require the following assumption.
Assumption 1 The function f (x) is coercive, i.e. every level set is compact.
Useless to say that there is plenty of problems with the above features, especially coming from the engineering context.In the literature, many approaches have been proposed to tackle the nonsmooth problem (1) in the derivative-free framework.They can be roughly subdivided into two main classes: direct-type algorithms and model-based algorithms.
-Direct-type methods.The algorithms belonging to this class make use of suitable sampling of the objective function.They occasionally can heuristically use modeling techniques, but the convergence theory hinges on the sampling technique.In this class of methods, we cite the mesh adaptive direct search algorithm implemented in the software package NOMAD [2,10], the linesearch derivative-free algorithm CS-DFN proposed in [6] and the discrete gradient method [3].-Model-based methods.This class comprises all those algorithms whose convergence is based on the strategy used to build the approximating models.
Within this class we can surely cite the recent trust-region derivative-free method proposed in [11].
In the relatively recent paper [6], a method for optimization of nonsmooth black-box problems has been proposed, namely CS-DFN.CS-DFN is able to solve problems more general than problem (1) above since it can handle also nonlinear and bound constraints.It is based on a penalization approach, namely the nonlinear constraints are penalized by an exact penalization mechanism whereas (possible) bound constraints on the variables are handled explicitly.
In this paper, we propose an improvement of CS-DFN by incorporating into its main algorithmic scheme a clustering heuristic to compute efficient search directions.Starting from an approximation of the directional derivatives along a certain set of directions, we construct a polyhedral approximation of the subdifferential which in turn is used to calculate a search direction in the steepest descent fashion.Along such direction we implement a linesearch procedure with extrapolation just like the one adopted by CS-DFN to explore its directions.
To asses the potentialities of the proposed improvement, we carry out an experimentation and comparison of CS-DFN with and without the proposed heuristic.The results, in our opinion, clearly show the advantages of the improved method over the original one.
The paper is organized as follows.In section 2 we extend to a nonsmooth setting the steepest descent direction and a kind of Newton-type directions.
In section 3 we propose an heuristic to compute possibly efficient directions in a derivative-free context.Section 4 we describe an improved version of the CS-DFN algorithm which is obtained by suitably employing the improved directions just described.In section 5 we report the results of a numerical comparison between CS-DFN and the proposed improved version on a set of well-known test problems.Finally, section 6 is devoted to some discussion and conclusions.

Definitions and notations
Definition 1 Given a point x ∈ n and a direction d ∈ n , the Clarke directional derivative of f at x along d is defined as [4] Moreover the Clarke generalized gradient (or subdifferential) Ω f being the set (of zero measure) where f is not differentiable.
The following property holds: In the following, we denote by e i , i = 1, . . ., n, the i-th column of the canonical basis in n and by e a vector of all ones of appropriate dimensions.

Descent type directions
In the context of nonsmooth optimization, efficient search directions can be computed by using the information provided by the subdifferential of the objective function.In the following subsections, we describe how such directions can be obtained.

Steepest descent direction g S k
In this subsection we recall a classic approach [14] to compute a generalization to nonsmooth functions of the steepest descent direction for continuously differentiable functions.Let us consider the vector which minimizes the following "first order-type" model of the objective function.
Note that, in the case of continuously differentiable functions, we have that f For nonsmmoth functions, standard results [14] lead to the following proposition.
Proposition 1 Let d S be the solution of Problem (3).Then i) the vector d * is given by . iii) for any γ ∈ (0, 1) a ᾱ exists such that The above d S k direction is a first-order direction which (closely) resembles the steepest-descent direction for continuously differentiable case.

Newton-type direction d N k
In the nonsmooth case, obtaining a Newton-type direction is much more involved than in the differentiable case.In the latter case it suffices to premultiply the anti-gradient by the Hessian of the objective function.In the nonsmooth case instead of simply pre-multiplying direction g S k by any positive definite matrix, we resort to minimizing the following "second order-type" model. min where B k is a positive definite matrix.Let us call the solution of problem (5) d N k .For problem (5) the following proposition can be proved.
Proof.By repeating the similar arguments of proof of Theorem 5.2.8 in [14] we have that function Recalling Lemma 5.2.7 of [14] we have: The relations ( 7) and ( 8) imply that a vector g N k exists such that: and, hence, which proves point ii) by setting and (9) give: which implies Therefore, (11) shows that the vector g N k is the unique solution of Problem 6. Finally point iii) again follows from definition of f and (9).
3 An heuristic approach to define efficient directions At the base of the proposed heuristics is the hypothesis that nonsmoothness of the objective function is due to its finite max structure.Such hypothesis appear realistic as a wide range of nonsmooth optimization problems, coming from practical applications, are of the min max type.Drawing inspiration from the paper [13] (see also [1]), given points y j ∈ n , j = {1, 2, . . ., p}, sufficiently close to x, the (possibly) non-convex and non-smooth function f (x) is approximated by using the following piece-wise quadratic model function, where g j ∈ ∂f (y j ) and H j = H(y j ), j = 1, . . ., p.We remark that, while we assume that the model structure of f is a max of a finite number of functions, the number p of such functions is unknown and has to be estimated via a trial-and-error calculation process.We can write, Furthermore, by assuming that f (x) ≈ f (x), we have In the actual case, C(x) is the convex hull of a given number of generator vectors v j , j = 1, . . ., p.We can try and estimate those generators by using the quantities computed by the algorithm.
More in particular, let x k be the current iterate of the algorithm, d i ∈ n and α i > 0, i = 1, . . ., r, the directions sampled by the algorithm along with their respective stepsizes, and define By using (12), for i = 1, . . ., r, It is then possible to compute estimates of the generators v j , j = 1, . . ., p, by solving the problem min v1,...,vp The above problem is a hard, nonsmooth nonconvex problem of the clustering type.It can be put however in DC (Difference of Convex) form as in [9].Since it has to be solved many times during the proposed algorithm, we prefer to resort, in our implementation, to a greedy heuristic of the k-means-type [7,12,16].
Then, we can compute an estimate of direction d N k by solving problem (4) (or ( 6)) where ∂f (x k ) (or ∂ f (x k )) is approximated by conv(v i , . . ., vp ).More precisely, we define the following algorithm that computes a search direction.

7: end if 8: end for
In the following we give an example of how the heuristic works.
Example 1 Consider the (convex) nonsmooth function maxl [13], defined as Take point x, xi = 1, i = 1, . . ., n, where f exhibits a kink and it is f (x) = 1.Observe that none among the 2n (signed) coordinate directions ±e i is a descent one at x (it is in fact f • (x; −e i ) = 0 and f • (x; e i ) = 1, i = 1, . . ., n).Calculation of the 2n ratios s i as in (13), along the directions e i and −e i leads to s i = 1 and s i = 0 , respectively, for i = 1, . . ., n.It is easy to verify that, letting p = n in Algorithm 1, an optimal solution to problem ( 14) is vj = e j , j = 1, . . ., n. , which is indeed a descent direction at x.

The improved CS-DFN algorithm
This section is devoted to the definition of the improved version of algorithm CS-DFN which we call Fast-CS-DFN.The method is basically the CS-DFN Algorithm introduced in reference [6], a derivative-free linesearch-type algorithm for the minimization of black-box (possibly) nonsmooth functions.It works by performing derivative-free linesearches along the coordinate directions and resorting to the use of a further search direction when the stepsizes used to explore the coordinate directions are sufficiently small.The rationale behind this choice is connected with the observation that the coordinate directions might not be descent directions near a non-stationary point of non-smoothness.In such situations, a richer set of directions must be used to (at least asymptotically) be able to improve the non-stationary point.The convergence anaysis of CS-DFN carried out in [6] hinges on the use of asymptotically dense sequences of search directions so that, at non-stationary points, for sufficiently large k a direction of descent is used.
The algorithm that we propose, namely Fast-CS-DFN, is a modification of CS-DFN.The relevant differences between the two methods are: 1. for the sake of simplicity, problem (1) is unconstrained; hence in Fast-CS-DFN no control to enforce feasibility with respect to the bound constraints is needed; 2. after the deployment of the direction d k , Fast-CS-DFN makes use of Algorithm 2 to compute a direction that tries to exploits the information gathered during the optimization process to heuristically improve the last produced point.
The Fast-CS-DFN Algorithm is reporten in Algorithm 3.

8:
Set Some comments about Algorithm Fast-CS-DFN are in order.
1. Fast-CS-DFN except for steps 14-18 and for the mechanism used to produce G k+1 starting from G k , exactly is the CS-DFN method as described in [6]; 2. the new direction dN k is used when the stepsizes α i k and αi k , i = 1, . . ., n, are sufficiently small and after the deployment of the direction d k ; 3. the computation of the new direction dN k performed at step 15 hinges (a) on the matrix B k and (b) on the set of couples G n+2 k .(a) To build B k , we maintain a set of points Y k which is managed in just the same way as described in [6]; (b) As for the set G n+2 k , it stores information on the consecutive failures encountered up to the current point, i.e. in the deployment of the coordinate directions and the direction d k .This set is emptied every time a non-null step is computed by the algorithm along any direction; 4. the asymptotic convergence properties of Fast-CS-DFN are analogous to that of CS-DFN.The theoretical analysis follows quite easily from the results proved for CS-DFn in [6] when considering that the new iterate

Numerical results
The proposed Fast-CS-DFN algorithm has been implemented in Python 3.9 and compared with CS-DFN [6] (available through the DFL library).The comparison has been carried out on a set of 47 nonsmooth problems.In the following subsections we briefly describe the test problems collection, the metrics adopted in the comparison and, finally, the obtained results.

Test problems collection
In Table 1 description of the test problems is reported.In particular, each table entry gives the problem name, the number n of variables and the reference where the problem definition can be found.

Metrics
To compare our derivative-free algorithms we resort to the use of the wellknown performance and data profiles (proposed in [5] and [15], respectively).
In particular, let P be a set of problems and S a set of solvers used to tackle problems in P. Let τ > 0 be a required precision level and denote by t ps the performance index, that is the number of function evaluations required by solver s ∈ S to solve problem p ∈ P. Problem p is claimed to be solved when a point x has been obtained such that the following criterion is satisfied where f (x 0 ) is the initial function value and f L denotes the best function value found by any solver on problem p itself.Then, the performance ratio r ps is r ps = t ps min i∈S {t pi } .
Finally, the performance and data profiles of solver s are so defined where n p is the number of variables of problem p. Particularly, the performance profile ρ s (α) tells us the fraction of problems that solver s solves with a number of function evaluation which is at most α times the number of function evaluations required by the best performing solver on that problem.On the other hand, the data profile d s (κ) indicates the fraction of problems solved by s with a number of function evaluations which is at most equal to κ(n p + 1), that is the number of function evaluations required to compute κ simplex gradients.
When using performance and data profiles for benchmarking derivativefree algorithms, it is quite usual to consider (at least) three different levels of precision (low, medium and high) corresponding to τ = 10 −1 , 10 −3 , 10 −5 , respectively.

Results
Figure 1 reports the results of the comparison by means of performance and data profiles between Fast-CS-DFn and CS-DFN.
As we can see, the new algorithm Fast-CS-DFN is always more robust, namely it is able to solve the largest portion of problems within a given amount of computational effort.More in particular, from the performance profiles, we can also say that the new method is invariably more efficient than the original one since the profile curves always have higher values for α = 1.In the paper, we propose a strategy to compute (possibly) good descent directions that can be further heuristically exploited within derivative-free algorithms for nonsmooth optimization.In fact, we show that the use of the proposed direction within the CS-DFN algorithm [6] improves the performances of the method.Numerical results on a set of nonsmooth optimization problems from the literature show the efficiency of the proposed direction computation strategy.
As a final remark, we point out that the proposed strategy could be embedded in virtually any optimization algorithm as an heuristic to try and produce improving points.