A parameter-free unconstrained reformulation for nonsmooth problems with convex constraints

In the present paper we propose to rewrite a nonsmooth problem subjected to convex constraints as an unconstrained problem. We show that this novel formulation shares the same global and local minima with the original constrained problem. Moreover, the reformulation can be solved with standard nonsmooth optimization methods if we are able to make projections onto the feasible sets. Numerical evidence shows that the proposed formulation compares favorably against state-of-art approaches. Code can be found at https://github.com/jth3galv/dfppm.


Introduction
In this paper, we consider the optimization of a nonsmooth function f ∶ ℝ n → ℝ over a closed convex set, namely We assume that f is locally Lipschitz continuous and that first order information is unavailable or impractical to obtain.
The aim of the optimization, for nonsmooth problems, is to find Clarke-stationary points [6,9].
Literature on derivative-free methods for smooth constrained optimization (i.e. when f is differentiable even tough derivatives are not available) is wide. Several approaches, based on the pattern search methods dating back to [13], have been developed for bound and linearly constrained problems in [16] and [17] and more general type of constraints in [18]. Other works stem from research presented in [11,21] whose line-search approach has been extended for box and linearly constrained problems in [20] and [22] while more general constraints are covered in [19]. An interesting work in the field of global optimization is [8]. We refer the reader interested in derivative-based methods, which are not considered in this work, to [7].
The nonsmooth case has seen less development. One of the two major approaches that have emerged is represented by Mesh adaptive direct search (MADS) that dates back to [3,4] and that has been later modified in [1,5]. This method combines a dense search with an extreme barrier to deal with the constraints. A second main approach, proposed in [10], is instead based on an exact penalty function.
In this case, the feasible set is expressed by a possibly nonsmooth set of inequalities g ∶ ℝ n → ℝ m and the original problem is replaced by the penalized version, for a given > 0, It can be shown (see Proposition 3.6 in [10]) that, under suitable assumptions, a value * exists such that ∀ ∈ (0, * ] every Clarke-stationary point x of (2) is also a stationary point of the original problem. Thus, any algorithm for nonsmooth uncontrained optimization can be applied. In [10], however, a linesearch based algorithm that employs a dense set of directions alongside the 2n coordinate directions (CS-DFN) is proposed. This latter reformulation combined with CS-DFN is shown to be favorably comparable against state-of-art MADS based software like NOMAD [15].
The value of * is, however, in general unknown and choosing a proper value of can be a difficult task. Setting a wrong value of can be extremely harmful to the performance of the algorithm. For example, if f is unbounded outside the feasible set setting too high a value of can drive the algorithm towards minus infinity. On the other hand, too small a value ( < * ), even if theoretical convergence is assured, can yield extremely poor performances because the algorithms may be forced to take really small steps near the boundary of the feasible set.
In this work we propose a novel way of treating convex constraints that is not based on penalty functions. We assume that the feasible set X is a closed convex set and that a projection operator onto the feasible set is available. We do not require X to have an analytical expression nor make any other regularity assumptions. However, we make the assumption that the computational effort needed to compute the projection is negligible compared to the evaluation of the objective function. Indeed, the only computational cost we consider is the number of evaluations of the objective function.
The paper is organized as follows. In Sect. 2 we recall some necessary definitions and known results before introducing the proposed reformulation in Sect. 3. Then, in Sect. 4, we prove the equivalence between the novel formulation and the original problem. Some numerical results are given in Sect. 5. In particular, we propose a comparison between the exact penalty approach of [10] and the proposed reformulation. We make use of the CS-DFN algorithm, used in [10], for both formulations to make a fair comparison. Finally, we give some concluding remarks in Sect. 6.

Preliminary background
We recall that optimality conditions for nonsmooth problems can be given in terms of the Clarke generalized directional derivative [9]. In particular, in the unconstrained case, we have that where is the Clarke generalized directional derivative.
We follow [6] for the treatment of Clarke stationarity for constrained problems. First, we define the cone of the hyper-tangent directions.
Definition 2 (Hyper-Tangent Cone) A vector d ∈ ℝ n is said to be a hyper-tangent vector to the set X at x ∈ X if there exists > 0 such that The set of all hyper-tangent vector is called the hyper-tangent cone to X at x and is denoted T H X (x) . For a more detailed treatment we refer the reader to [6]. Figure 6.5 of [6] offers a graphical illustration of the hyper-tangent cone.
Then we can give a definition of Clarke-stationary points as expressed by the following.
We will make use of the following result which relates to the Clarke-derivative and classical directional derivative in the case of convex functions. For the proof we refer the reader to Theorem 3.42 in [14].
Theorem 1 Let f ∶ ℝ n → ℝ be a convex functional which is Lipschitz continuous at some x ∈ X . Then the Clarke derivative f at x coincides with the directional derivative of f at x that is

A novel formulation
Generalizing a bit, all the approaches that have been proposed in the literature to deal with general constraints, try to steer the search towards the feasible set by adding (maybe in a sequential manner) to the objective function some kind of penalty , which, in its most general form, can be described by Such function can be a smooth quadratic or an exact nonsmooth penalty or, also, a hard barrier that takes +∞ outside the feasible set. The problem is thus rewritten as where > 0 is a parameter that must be set.
Consider a local minimum of the original problem x * . We have that, for some neighborhood B(x * , ), The strategy of penalty-based approaches is making the penalty large enough (either by making large or by using a hard barrier) so that and, hence, x * is also a local minimum of the penalized problem.
The idea behind the proposed reformulation is, instead, to avoid penalties by "assigning" to a point x outside the feasible set the value of the objective function computed at its projection (x) . In this way, we do not ever compute f outside the .
feasible set and we do not need to "correct" f with a penalty for points outside the feasible set. Let be the projector operator over X defined as Notice that since X is compact and convex the projection has a unique solution. A proof of the uniqueness of the projection for convex sets alongside other properties of the projection can be found in Proposition 2.1.3 of [7]. One property that will be extensively used in the following is the non expansiviness of the projection i.e. | (x) − (y)| ≤ |x − y| ∀x, y ∈ X . Dealing with non convex sets would require a more complex treatment since the projection may not be unique. We leave the study of a possible extension to future work. We can thus define the problem where each point outside the feasible set assumes the value of its projection. In the latter formulation it is guaranteed that no point outside the feasible set can take a value lower than some point in X. We have however that all the points that share the same projection (consider a ray perpendicular to the constraints) share the same function value. To overcome this issue is sufficient to add to the previous formulation a term that penalizes the distance of a point from its projection. We thus propose to replace the original problem by where d X (x) = ‖x − (x)‖ is the distance from x to the feasible set X. We note also that since the projection operator is continuous we have that f is continuous. Moreover, if f is bounded from below on the feasible set X then f is bounded on . On the contrary in the penalty approach f (x) + 1 ∑ m i=1 max{0, g i (x)} can be unbounded. Consider, as a simple example, the problem HS224 from the test suite [26].
The level curves of the original objective function f and of the modified problem f are shown in Fig. 1.
Notice how the solution to the problem x * = (4, 4) becomes an unconstrained global minimum in the proposed formulation.

Equivalence of the formulations
In this section, we prove the equivalence between the original constrained problem (1) and the proposed formulation (3) in terms of both local/global minima and stationary points. We also show that by modifying the objective function we do not lose Lipschitz continuity so that if f is (locally) Lipschitz f is (locally) Lipschitz too. We start with the latter.

Lemma 1 Let f be locally Lipschitz continuous. Then the modified function
Proof Let x 0 ∈ ℝ n . Since f is locally Lipschitz there exists L 0 and 0 so that where we have used the local Lipschitz continuity of f and the non expansiveness property of the projection operation. Now, by the triangular inequality we have The same reasoning applies to the opposite sign − � We now consider the relationship between the global and local minimum of the two formulations. We first prove that each global (local) minimum of the original problem is also a global (local) minimum of the proposed formulation in Proposition 1 and 2. (1) is also a global minimum for problem (3).

Proposition 1 Every global minimum of problem
Proof Let x * ∈ X be a global minimum for problem (1). Suppose by contradiction that there exists x ∈ ℝ n such that f (x) <f (x * ) . Then which is a contradiction. ◻ (1) is also a local minimum for problem (3).

Proposition 2 Every local minimum of problem
Proof Let x * ∈ X be a local minimum for problem (1). Then there exists a ball B(x * , ) with > 0 s.t.
Let x ∈ B(x * , ) and suppose by contradiction that f (x) <f (x * ) . Thus we have Let y = (x) ∈ X . We have, by the properties of the projection operator that so that y ∈ B(x * , ) ∩ X. Moreover, it holds that which is a contradiction. ◻ Furthermore, since f and f take the same values on X it holds that any global (local) minimum of the modified problem which belongs to the feasible set is also o global (local) minimum of the original problem. (3) is also a global (local) minimum of problem (1).

Proposition 3 Every global (local) minimum x ∈ X for problem
We now show, in Lemma 2, that no minimal point does exist outside the feasible region X so that we have a perfect equivalence between global and local minima in the two formulation as remarked in Corollary 1 Lemma 2 Suppose x ∈ ℝ n ⧵ X then x is not a global or local minimum for problem (3). In particular, d = (x) −x is a descent direction for f at x.

Proof
We will prove the thesis by showing that there exists a descent direction at x and hence x cannot be a minimum.
Consider the projection of x onto the feasible set x = (x) . Let d =x −x and consider a point x + d with ∈ (0, 1] . For every y we have where the latter holds because x = (x) . Since the projection has a unique solution, from (4) we conclude that x = (x) = (x + d) ∀ ∈ [0, 1] . Thus we have that so that d is a descent direction at x . ◻ By putting together Proposition 1, 2, 3 and Lemma 2 we establish the perfect equivalence of the two formulations in terms of local and global minima as expressed by the following. To conclude the discussion we investigate on the relationship between stationary points. In particular, it is interesting to check if Clarke-stationary points of the modified problem are also Clarke-stationary for the original problem.
We are able to prove the latter under the following assumption.

3
A parameter-free unconstrained reformulation for nonsmooth… Assumption 1 We assume that X is such that for every hyper-tangent direction T H X (x) and every feasible point x ∈ X.
We start by showing that any Clarke-stationary point of the modified problem must belong the feasible set.

Lemma 3 Let x be a Clarke-stationary point of problem (3). Then x ∈ X.
Proof Since x is Clarke-stationary we have, by definition, that In particular the latter must hold also for direction d = (x) −x.
Now, let us suppose by contradiction that x ∉ X . Letting d = (y) − y we can write where we have used, that d is a descent direction for f at y (Lemma 2) for the first term and that f is Lipschitz continuous (Lemma 1) for the second one. Now for every y →x we have that Thus we have that which contradicts (5). ◻

Proposition 4 Let x be a Clarke-stationary point of problem (3). Then, under Assumption 1, x is also a Clarke-stationary point of problem (1).
Proof Let x be a Clarke-stationary point for problem (3 Since d X is a convex function we have, by Theorem 1, that d • X = d � X . Thus Hence Thus, because of Assumption 1, we conclude that ◻

Numerical experiments
In the following, we propose some numerical experiments to investigate advantages of the proposed formulation comparing it against the exact penalty approach of [10].
To make a fair comparison we used the same algorithm to solve both formulations.
In particular, we used the CS-DFN algorithm proposed in [10]. In the following, we call solver an algorithm applied to a particular formulation of a given problem. So we compare the exact penalty solver, i.e. the CS-DFN algorithm applied to the penalized formulation, and the Projection-based Penalty Method (PPM) solver, i.e. the CS-DFN algorithm applied to the proposed formulation. Test problems We set up a benchmark composed of 28 problems belonging to different classes: general nonlinear functions subjected to (1) non-degenerate linear constraints from the collection [12,26]; (2) degenerate linear constraints from [2]; (3) general convex constraints again from [12,26]; and minmax programs under linear constraints from [23]. The problems are listed in Table 1.
Performance metric To compare the results we employ data profiles. Data profile for benchmarking derivative free algorithms have been proposed in [24]. They take into where n p is the number of variables of problem p and nf ( ) is the number of function evaluations needed to satisfy the convergence criterion (6). Data profiles are extracted for different values of to compare the solvers against different balances of speed versus accuracy.
In the constrained case, however, the proposed scheme is not readily applicable. We propose to modify condition (6) by considering also the constraint violation as follows: where ∈ (0, 1) is a new parameter which balances function value and constraint violation.
Note that, by violating the constraints, the function values can be lower than the f * . Naturally, by choosing a high value for this situation can be arbitrarily penalized as long as f does not go to −∞ . These cases must be removed before computing the profiles.
We extract different data profiles for different values of both and . In particular we extract the curves for −k with k ∈ {1, 3, 5, 7} and ∈ {0.9, 0.99} Solvers details As already mentioned for both solvers we employ the CS-DFN proposed in [10]. We report the pseudo-code of the method in Algorithm 1.
For the exact penalty solver we employ the adaptive strategy for tuning which is proposed in [10] and is deemed to be a better choice.
The implementation of the algorithms, alongside the code needed to reproduce all the following experiments is available as python code at https:// github. com/ jth3g alv/ dfppm.

A first comparison
We start by comparing the PPM solver against the exact penalty. We let both solvers run for up to 10 4 function evaluations and then we extract the data profiles, which are reported in Fig. 2. From the plots we can see that the PPM solver enjoys generally better performance both in terms of speed and robustness (number of problems eventually solved). We note however that neither solver manages to solve more than the 65% of the test problems within the budget of function evaluations when a relatively high precision ( = 10 −7 ) is required.

A parametrization
In this section, we introduce a scale factor that controls how much to penalize points outside the feasible region. Namely we modify our formulation as To understand the effect of we start with a qualitative analysis. In Fig. 3 we show the iterates of the algorithm when run on the same problem with different values of . From Fig. 3 we can see that the iterates for greater values of stay closer to the feasible set while for small values of the algorithm is allowed to stay far from it.
To understand whether staying closer to the feasible set has a good or bad effect on the overall optimization process we measure the performance on the solver for different values of . Namely we try ∈ {0.1, 1, 10, 100} . We use the same setup of the previous experiments. In Fig. 4 we give the data profiles for the different solvers (we include, for later reference, another configuration ( = 10, = 2) which is explained later in the manuscript).
From Fig. 4 we can see that choosing a high value of yields good performances when low precision is required although they quickly lessen for higher values of . For instance for = 10 −7 , = 0.99 the algorithm with = 100 manages to satisfy the convergence criterion only in roughly 40% of the test problems. The opposite is true for small values of . The algorithm is generally slower but can achieve very good solutions if a greater number of function evaluations is allowed.
It is, thus, natural to ask if employing an adaptive strategy for may be advantageous. For example, one could start with set to a large number and gradually decrease it to get to accurate solutions.
Coming up with a good schedule for that works well for all problems can be a hard task. However, the CS-DFN algorithm offers a good way to understand in what regime the algorithm is working by looking at the length of the steps that the algorithm takes at each iteration. We thus propose to set the value of as a function of the set length k . More precisely we set In this way, at the beginning of the algorithm we can start with a large value of 0 and then let it decreasing as the algorithm takes smaller steps.  We found, by manually tuning, that good performances can be obtained by setting 0 = 10, = 2 although other configurations perform similarly well. As we can see, again in Fig. 4, this configuration performs almost equally well when low or high precision is required. Moreover, notice that when high precision is required we go from less than 70% of solved problems to more than 90%.

Final comparison
To end the discussion we propose a final comparison of the PPM solver, in its parameterized version, equipped with the adaptive strategy for tuning against the exact penalty approach. The results are reported in Fig. 5.
We can see that the PPM solver enjoys better performance for every threshold of accuracy although the exact penalty can be faster for some problems when a relatively low precision is required. We note, furthermore, that the exact penalty fails to reach accurate solutions for a large portion of the test problems whether, as already noticed, the PPM solver manages to reach more than the 90% of solved problems even when high precision is required. We also report, for completeness, in Table 2 the distance of the objective function from the optimum value and the constraint violation after the total budget of function evaluations has been used.

Conclusion
In this work we proposed to rewrite a nonsmooth optimization problem subjected to convex constraints as an unconstrained parameter-free problem. Such formulation is proven to be equivalent to the original problem, in terms of global and local minima. Furthermore we were able to prove, under suitable assumptions, that any Clarkestationary point of the proposed formulation is also a Clarke-stationary point of the original problem. The formulation can be solved by any optimization algorithm for nonsmooth optimization. We compared the proposed formulation against a state-of-art approach for constrained nonsmooth optimization. In particular we compared it against the exact penalty method. We used the same solver that is shown to deliver state-of-art performances for the penalized problem to solve the proposed formulation. The results clearly show the advantages of the proposed formulation.
Future work will be devoted to (1) handle a mix convex and non-convex constraints by combining the proposed formulation with an exact penalty to deal with the non convex part of the constraints and (2) to study cases where the projection operation is expensive so that a truncated projection is to be employed.
Funding Open access funding provided by Università degli Studi di Firenze within the CRUI-CARE Agreement. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/.