On the relation between the extended supporting hyperplane algorithm and Kelley’s cutting plane algorithm

Recently, Kronqvist et al. (J Global Optim 64(2):249–272, 2016) rediscovered the supporting hyperplane algorithm of Veinott (Oper Res 15(1):147–152, 1967) and demonstrated its computational benefits for solving convex mixed integer nonlinear programs. In this paper we derive the algorithm from a geometric point of view. This enables us to show that the supporting hyperplane algorithm is equivalent to Kelley’s cutting plane algorithm (J Soc Ind Appl Math 8(4):703–712, 1960) applied to a particular reformulation of the problem. As a result, we extend the applicability of the supporting hyperplane algorithm to convex problems represented by a class of general, not necessarily convex nor differentiable, functions.


Introduction
A mixed-integer convex program (MICP) is a problem of the form where C is a closed convex set, c ∈ R n , and p denotes the number of variables with integrality requirement.The use of a linear objective function is without loss of generality given that one can always transform a problem with a convex objective function into a problem of the form (1). We can represent the set C in different ways, one of the most common being as the intersection of sublevel sets of convex differentiable functions, that is, Here, J is a finite index set and each g j is convex and differentiable.* Zuse Institute Berlin, Takustr.7, 14195 Berlin, Germany, serrano@zib.de† Zuse Institute Berlin, Takustr.7, 14195 Berlin, Germany, schwarz@zib.de‡ Zuse Institute Berlin, Takustr.7, 14195 Berlin, Germany, gleixner@zib.de 1 Several methods have been proposed for solving MICP.When the problem is continuous and represented as (2), one of the first proposed methods was Kelley's cutting plane algorithm [8].This algorithm exploits the convexity of a constraint function g in the following way.The convexity and differentiability of g imply that g(y) + ∇g(y)(x − y) ≤ g(x) for every x, y ∈ R n .Since every feasible point x must satisfy g(x) ≤ 0, it follows that g(y) + ∇g(y)(x − y) ≤ 0, for a fixed y, is a valid linear inequality.If x ∈ R n does not satisfy the constraint g(x) ≤ 0, that is, if g(x) > 0, then separates x from the feasible solution.In the non-differentiable case, is also a separating valid inequality.We will call both inequalities (3) and ( 4) gradient cut of g at x.
The idea of Kelley's cutting plane algorithm is to approximate the feasible region with a polytope, solve the resulting linear program (LP) and, if the LP solution is not feasible, separate it using gradient cuts to obtain a new polytope which is a better approximation of the feasible region and repeat, see Algorithm 1.
Algorithm 1: Kelley's cutting plane algorithm x ← arg min x∈LP c T x 2 while max j∈J g j (x) > do 3 forall j such that g j (x) > 0 do 4 LP ← LP ∩ {x : g j (x) + ∇g j (x)(x − x) ≤ 0} 5 x ← arg min x∈LP c T x 6 return x Kelley shows that the algorithm converges to the optimum and it converges in finite time to a point close to the optimum.By solving integer programs (IP) using Gomory's cutting plane [6] instead of LP relaxations, Kelley shows that his cutting plane algorithm solves purely integer convex programs in finite time.
The same algorithm works just as well for MICP.However, Kelley did not have access to a finite algorithm for solving mixed integer linear programs (MILP).
In an attempt to speed up Kelley's algorithm, Veinott [15] proposes the supporting hyperplane algorithm (SH).A possible issue with Kelley's algorithm is that, in general, gradient cuts do not support the feasible region, see Figure 1.Therefore, it is expected that better relaxations can be achieved by using supporting cutting planes.
In order to construct supporting hyperplanes, Veinott suggests to build gradient cuts at boundary points of C.He uses an interior point of C to find the point on the boundary, x, that intersects the segment joining the interior point and the solution of the current relaxation.Of course, these cuts are automatically supporting hyperplanes of C.However, since the cut is computed at x which is in C, it might happen that the gradient of the constraints active at x vanishes.
For this reason, Veinott requires as a further hypothesis that the functions representing C have non-vanishing gradients at the boundary.This is immediately implied by, e.g., Slater's condition.Veinott also identifies that one can use his algorithm to solve (1) when representing C by quasi-convex functions, that is, functions whose sublevel sets are convex.
Recently, Kronqvist et al. [9] rediscovered and implemented Veinott's algorithm [15].They call their algorithm the extended supporting hyperplane algorithm (ESH).They discuss the practical importance of choosing a good interior point and propose some improvements over the original method, such as solving LP relaxations during the first iterations instead of the more expensive MILP relaxation.As a result, they present a computationally competitive solver implementation for MICPs defined by convex differentiable constraint functions.
In this paper, we would like to understand when, given a convex differentiable function g, gradient cuts of g are supporting to the convex set S = {x ∈ R n : g(x) ≤ 0}.This question is motivated by the fact that in this case Kelley's algorithm automatically becomes a supporting hyperplane algorithm.In Theorem 1 we give a necessary and sufficient condition for a gradient cut of g at a given point to be a supporting hyperplane of S. In particular, this condition suggests to look at sublinear functions, i.e., convex and positively homogeneous functions.As it turns out, this naturally leads to Veinott's algorithm.
Sublinear functions and convex sets are deeply related.When the origin is in the interior of a convex set S, then we can represent S via its gauge function ϕ S , which is sublinear [13].We give the formal definition of the gauge function in Section 4, but for now it suffices to know that we can represent S as S = {x ∈ R n : ϕ S (x) ≤ 1} and that, in particular, for every x = 0 a gradient cut of ϕ S at x supports all of its sublevel sets.The following example illustrates this.
Example 1.Consider the convex feasible region given by where g(x, y) = x 2 + y 2 − 1.We show through an example that gradient cuts of g are not necessarily supporting to S, explain why this happens, and show that changing the representation of S to use its gauge function solves the issue.
Separating the infeasible point x = ( 3 2 , 3 2 ) by a gradient cut of g at x gives This cut does not support the circle S, see Figure 1.Alternatively, the gauge function of the circle S is given by ϕ S (x, y) = x 2 + y 2 and S = {(x, y) : x 2 + y 2 ≤ 1}.The gradient cut of ϕ S at x is x + y ≤ √ 2, which is supporting.
From the previous discussion it is a natural idea to represent C via its gauge function, namely, C = {x ∈ R n : ϕ C (x) ≤ 1}.However, as mentioned before, Figure 1: The feasible region S and the infeasible point x to separate.On the left we see that the separating hyperplane is not supporting to S. On the right we see why this happens: the linearization of g at x is tangent to the epigraph of g (shown upside-down for clarity) at (x, g(x)).However, when this hyperplane intersects the x-y-plane, it is already far away from the epigraph, and in consequence, from the sublevel set.The intersection of the hyperplane with the x-y-plane is the gradient cut.
C is usually given by (2).Our main contribution is to show that reformulating (2) to the gauge representation will naturally lead to the ESH algorithm, see Section 4.2.As a consequence, the convergence proofs of Veinott [15] and Kronqvist et al. [9] follow directly from the convergence proof of Kelley's cutting plane algorithm [8,7], see Section 5.In other words, we show that the ESH algorithm is Kelley's cutting plane algorithm applied to a different representation of the problem.
Motivated by this approach of representing C by its gauge function, we are able to show that the ESH algorithm applied to (1) converges even when C is not represented by convex functions.This is related to recent work of Lasserre [10] that tries to understand how different techniques behave when the convex set C is not represented via (2).Lasserre considers sets C = {x : g j (x) ≤ 0, j ∈ J} where g j are only differentiable, but not necessarily convex.Under the assumption that is, if the gradients of active constraints do not vanish at the boundary of S, Lasserre shows that the KKT conditions are not only necessary but also sufficient for global optimality.In other words, every minimizer is a KKT point and every KKT point is a minimizer.Later, Lasserre [11] proposes an algorithm to find the KKT point via log-barrier functions.He shows that the algorithm converges to the KKT point if (5) holds.
Dutta and Lalitha [3] generalized the previous result to the case when C is represented by locally Lipschitz functions, not necessarily differentiable nor convex.We show that the ESH also converges to the global optimum in the setting when C is described by differentiable functions, under (5).This result extends the applicability of the SH algorithm of Veinott.
Finally, we provide a characterization of convex functions whose linearizations are supporting to their sublevel sets.Although elementary, the authors are not aware of its presence in the literature.In particular, this result allows us to identify some families of functions for which gradient cuts are never supporting (see Example 4) and some for which they are always supporting (see Examples 2 and 3).
Overview of the paper.In the remainder of this section we introduce the notation that will be used throughout the paper.Section 2 provides a literature review on cutting plane approaches and efforts on obtaining supporting valid inequalities.In Section 3, we characterize functions whose linearizations are supporting hyperplanes to their 0-sublevel sets.Section 4 introduces the gauge function and shows how to use evaluation of the gauge function for building supporting hyperplanes.We note that evaluating the gauge function is equivalent to the line search step of the ESH algorithm [15,9].This equivalence provides the link between the ESH and Kelley's cutting plane algorithm In Section 5, we show that the cutting planes generated by the ESH algorithm can also be generated by Kelley's algorithm when applied to a reformulation of the problem.This implies that the convergence of the ESH algorithm follows from Kelley's.
In Section 6, we show that we can apply the ESH algorithm to problem (1) when the convex set C is represented via arbitrary differentiable functions as long as their gradients do not vanish at the boundary of C. Finally, Section 7 presents our concluding remarks.
Notation and definitions.The boundary and the interior of a set S are denoted by ∂S and S, respectively.The epigraph of a function g is denoted by epi g.The subdifferential of a convex function g at x is denoted by ∂g(x).
Recall that the subdifferential is the set of all subgradients of g at x, We say that an inequality α T x ≤ β is valid for a set S if every x ∈ S satisfies α T x ≤ β.Furthermore, we say that it is a supporting hyperplane of S, or that it supports S, if there is an x ∈ ∂S such that α T x = β.

Literature review
We can think of the algorithms of Kelley [8] and Veinott [15] as a mixture of two ingredients: which relaxation to solve and where to compute the cutting plane.Indeed, at each iteration, we have a point x k we would like to separate with a linear inequality β + α T (x − x 0 ) ≤ 0. For Kelley's algorithm, x 0 = x k , while for Veinott's algorithm, x 0 ∈ ∂C and for both α ∈ ∂g(x 0 ) and β = g(x 0 ).Choosing different relaxations and different points where to compute the cutting planes yields different algorithms.This framework is developed in Horst and Tuy [7].
Following the previous framework, Duran and Grossmann [2] propose the, socalled, outer-approximation algorithm for MICP.The idea is to solve an MILP relaxation but instead of computing a cutting plane at the MILP optimum, or at the boundary point on the segment between the MILP optimum and some interior point, they suggest to compute cutting planes at a solution of the nonlinear program (NLP) obtained after fixing the integer variables to the integer values given by the MILP optimal solution.This is a much more expensive algorithm but has the advantage of finite convergence.Of course, this does not work in complete generality and we need some assumptions, for example, requiring some constraint qualifications.Moreover, we must tak care when obtaining an infeasible NLP after fixing the integer variables in order to prevent the same integer assignment in future iterations.To handle such case, Duran and Grossmann propose the use of integer cuts.However, Fletcher and Leyffer [5] point out that this is not necessary and that we can use the solution of a slack NLP to build a "continuous" cut that separates the integer assignment.
Westerlund and Pettersson [16] proposed the so-called extended cutting plane algorithm.This algorithm is the extension of Kelley's cutting plane to MICP and they show that the algorithm convergences.Further extensions and convergence proofs of cutting plane and outer approximation algorithms for nonsmooth problems are given in [4].
Yet another technique for producing tight cuts is to build gradient cuts at the projection of the point to be separated onto C [7].In the same reference, Horst and Tuy show that this algorithm converges.
Finally, there have been attempts at building tighter relaxations by ensuring that gradient cuts are supporting, in a more general context than convex mixedinteger nonlinear programming.Belotti et al. [1] consider bivariate convex constraints of the form f (x) − y ≤ 0, where f is a univariate convex function.
They propose projecting the point to be separated onto the curve y = f (x) and building a gradient cut at the projection.However, their motivation is not to find supporting hyperplanes, but to find the most violated cut.Indeed, as we will see, gradient cuts for these type of constraints are always supporting (Example 3).Other work along this lines includes [12], where the authors derive an efficient procedure to project onto a two dimensional constraint derived from a Gaussian linear chance constraint, thus building supporting valid inequalities.

Characterization of functions with supporting linearizations
We now give necessary and sufficient conditions for the linearization of a convex, not necessarily differentiable, function g at a point x to support the region S = {x ∈ R n : g(x) ≤ 0}.In order for this to happen, the supporting hyperplane has to support the epigraph on the whole segment joining the point of S where it supports and (x, g(x)).In other words, the function must be affine on the segment.This is due to the convexity of g.
Theorem 1.Let g : R n → R be a convex function, S = {x ∈ R n : g(x) ≤ 0} = ∅, and x / ∈ S.There exists a subgradient v ∈ ∂g(x) such that the valid inequality supports S, if and only if, there exists Proof.(⇒) Let x 0 ∈ ∂S be the point where (6) supports S. The idea is to show that the affine function x → g(x) + v T (x − x) coincides g at two points, x and x 0 .Then, by the convexity of g, it should coincide with g on the segment joining both points.
The set A is a convex nonempty subset of epi g that does not intersect the relative interior of epi g.Hence, there exists a supporting hyperplane, to epi g containing A ([13, Theorem 11.6]).
Since g(x 0 ) ≤ 0 and g(x) > 0, it follows that A is not parallel to the x-space.Therefore, H is also not parallel to the x-space and so v = 0. Since A is not parallel to the z-axis, it follows that a = 0. We assume, without loss of generality, that a = −1.

The point (x, g(x)) belongs to
Given that H supports the epigraph, then v is a subgradient of g, in particular, Let z(x) be the affine function whose graph is H, that is, z(x) = g(x)+v T (x− x).We now need to show that g(x) + v T (x − x) ≤ 0 supports S by exhibiting an x ∈ S such that g(x) + v T (x − x) = 0.By construction, z(x 0 + λ(x − x 0 )) = g(x 0 + λ(x − x 0 )).Since z(x 0 + λ(x − x 0 )) is non-positive for λ = 0 and positive for λ = 1, it has to be zero for some λ 0 .Let x = x 0 + λ 0 (x − x 0 ).Therefore, g(x) = z(x) = 0 and we conclude that x ∈ S and g(x) + v T (x − x) = 0.
Specializing the theorem to differentiable functions directly leads to the following: Corollary 2. Let g : R n → R be a convex differentiable function, S = {x ∈ R n : g(x) ≤ 0}, and x / ∈ S. Then the valid inequality supports S, if and only if, there exists Proof.Since g is differentiable, the subdifferential of g consists only of the gradient of g.
A natural candidate for functions with supporting gradient cuts at every point are functions whose epigraph is a translation of a convex cone.
Example 2 (Sublinear functions).Let h(x) be a sublinear function, that is, convex and positively homogeneous function, i.e., h(λx) = λh(x) for any λ ≥ 0. For this type of functions, gradient cuts always support S = {x : h(x) ≤ c}, for any c ≥ 0. This follows directly from Theorem 1, since 0 ∈ S and h(λx) is affine for any x.
However, these are not the only functions that satisfy the conditions of Theorem 1 for every point.The previous theorem implies that linearizations always support the constraint set if a convex constraint g(x) ≤ 0 is linear in one of its arguments.
Example 3 (Functions with linear variables).Let f : R m ×R n → R be a convex function of the form f (x, y) = g(x)+a T y+c, with a = 0 and g : R m → R convex.
Consider separating a point (x 0 , z 0 ) from a constraint of the form z = g(x) with g : R → R and convex, with z 0 < g(x 0 ) (that is, separating on the convex constraint g(x) ≤ z).As mentioned earlier, in [1] the authors suggest projecting (x 0 , z 0 ) to the graph z = g(x) and computing a gradient cut there.This example shows that this step is unnecessary when the sole purpose is to obtain a cut that is supporting to the graph.
In contrast, if g(x) is strictly convex, linearizations at points x such that g(x) = 0 are never supporting to g(x) ≤ 0. This follows directly from Theorem 1 since λ → g(x + λv) is not affine for any v.We can also characterize convex quadratic functions with supporting linearizations.
Example 4 (Convex quadratic functions).Let g(x) = x T Ax + b T x + c be a convex quadratic function, i.e., A is an n by n symmetric and positive semidefinite matrix.We show that gradient cuts support First notice that l v (λ) := g(x + λv) is affine linear, if and only if, v ∈ ker(A).
Let v ∈ ker(A) and x / ∈ S. Clearly, there is a λ ∈ R such that x + λv ∈ S if and only if l v is not constant.Thus, gradient cuts are not supporting, if and only if, l v is constant for every v ∈ ker(A).But l v is constant for every In particular, if b = 0, i.e., there are no linear terms in the quadratic function, then gradient cuts are never supporting hyperplanes.Also, if A is invertible, b ∈ R(A) and gradient cuts are not supporting.This is to be expected since in this case g is strictly convex.

The gauge function
Given a MICP like (1), we can reformulate it to an equivalent MICP with a unique constraint for which every linearization supports the continuous relaxation of the feasible region.For this, we can use any sublinear function whose 1-sublevel set is C.Each convex set C has at least one sublinear function that represents it, namely, the gauge function [13] of C.
The following basic properties of gauge functions make them appealing for generating supporting hyperplanes.
Example 2 tells us that sublinear functions always generate supporting hyperplanes.

Using the gauge function for separation
Even though the gauge function is exactly what we need to ensure supporting gradient cuts, in general, there is no closed-form formula for it.Therefore, it is not always possible to explicitly reformulate a constraint g(x) ≤ 0 as ϕ(x) ≤ 1.
Furthermore, if one is interested in solving mathematical programs with a numerical solver, performing such a reformulation might introduce some numerical issues one would have to take care of.Solvers usually solve up to a given tolerance, that is, they solve g j (x) ≤ ε for some ε > 0.Then, even though In fact, even simple constraints show this behavior.Consider C = {x : x 2 − 1 ≤ 0}.In this case, ϕ C (x) = |x| and for x 0 = 1 + ε, we have ϕ(x 0 ) = 1 + ε.Then, x 0 would be ε-feasible for ϕ C (x) ≤ 1, although it would be infeasible for Luckily, one does not need to reformulate in order to take advantage of the gauge function for tighter separation.The next propositions show how to use the gauge function and a point x / ∈ C to obtain a boundary point of C and that linearizing at that boundary point gives a supporting valid inequality that actually separates x.For ensuring the existence of a supporting hyperplane we need the following condition ∀j ∈ J, ∀x ∈ ∂C, ∇g j (x) = 0 ( For example, this condition is satisfied whenever Slater's condition is satisfied for (1) with C represented by (2), that is, when there exists x 0 such that g j (x 0 ) < 0 for every j ∈ J.
Before we state the propositions we start with a simple lemma.
Lemma 5. Let C ⊆ R n be a closed convex set such that 0 ∈ C, let x ∈ ∂C and x / ∈ C. Let α T x ≤ β be a valid inequality for C that supports C at x.If the segment joining 0 and x contains x, then the inequality separates x from C.
Let J 0 (x) be the set of indices of the active constraints at x, i.e., J 0 (x) = {j ∈ J : g j (x) = 0}.
Proposition 7. Let C = {x : g j (x) ≤ 0, j ∈ J} be such that 0 ∈ C and let ϕ C be its gauge function.Assume that (8) holds.Given x / ∈ C, define x = x ϕ C (x) .Then, for any j ∈ J 0 (x), the gradient cut of g j at x yields a valid supporting inequality for C that separates x.
Proof.By the previous proposition, we have that x ∈ ∂C.Let j ∈ J 0 (x).Clearly, the gradient cut of g j at x yields a valid supporting inequality.The fact that it separates follows from Lemma 5.
Hence, we can get supporting valid inequalities separating a given point x / ∈ C by using the gauge function to find the point x = x ϕ C (x) ∈ ∂C.Then, Proposition 7 ensures that the gradient cut of any active constraint at x will separate x from C. But, how do we compute ϕ C (x)?

Evaluating the gauge
Let C = {x : g j (x) ≤ 0, j ∈ J} be a closed convex set such that 0 ∈ C and consider f (x) = max j∈J g j (x).
One can solve such an equation using a line search.Note that the line search is looking for a point x ∈ ∂C on the segment between 0 and x.This is exactly what the (extended) supporting hyperplane algorithm performs when it uses 0 as its interior point.
We would also like to remark that a closed-form formula expression for the gauge function of C is equivalent to a closed-form formula for the solution of (10).It is possible to find such a formula for some functions, e.g., when f is a convex quadratic function.
Next, we briefly discuss what happens when 0 is not in the interior of C and when C has no interior.In the next section we discuss the implications of the fact that evaluating the gauge function is equivalent to the line search step of the supporting hyperplane algorithm.

The case C = ∅ and using a nonzero interior point
When C = ∅, we can still use the methods discussed above using a trick from [9].
Theorem 9. Consider a MICP given by (1) with C represented as (2) such that 0 ∈ C and (8) holds.Let f be defined as in (9) and let x / ∈ C be the current relaxation solution to separate.Let f ( ), be the inequality generated by the ESH algorithm using 0 as the interior point.Then KCP applied to min{c T x : ϕ C (x) ≤ 1} can generate the same inequality.
Proof.Let us manipulate the inequality obtained by the ESH algorithm.First, notice that f ( x ϕ C (x) ) = 0 and so the inequality reads as x) .Since (8) holds, v = 0. Furthermore, by Lemma 5, x is cut off by the inequality, i.e., v T x > v T x ϕ C (x) This, together with the fact that ϕ C (x) > 1, implies that v T x > 0. Summarizing, the inequality obtained by the ESH algorithm can be rewritten as Hence, the same cut can be generated by KCP algorithm applied to min{c T x : ϕ C (x) ≤ 1} when separating x.

Convex programs represented by non-convex functions
In this section we consider problem (1) with C represented as where the functions g j are differentiable, but not necessarily convex.As mentioned in the introduction, convex problems represented by non-convex functions have been considered in [3,10,11].
The next proposition shows that, under (8), the ESH algorithm works without modification in this context.Therefore, its convergence is guaranteed by the convergence of KCP algorithm.Essentially, we show that with the given representation of C it is possible to evaluate its gauge function and its subgradients.
Proposition 10.Let C = {x : g j (x) ≤ 0, j ∈ J} such that 0 ∈ C and the function g j are differentiable.Let ϕ C be the gauge function of C. For x / ∈ C, define x = x ϕ C (x) and assume that (8) holds.Then, any gradient cut of g j at x for any j ∈ J 0 (x) yields a valid supporting inequality for C that separates x.
Proof.By Proposition 6 we have that x ∈ ∂C.Let j ∈ J 0 (x).The gradient cut of g j at x is ∇g j (x)(x − x) ≤ 0.
We first show it is valid, that is, ∀y ∈ C, ∇g j (x)(y − x) ≤ 0. If this is not the case, then there is y 0 ∈ C for which ∇g j (x)(y 0 − x) > 0, i.e., the directional derivative of g j at x in the direction y 0 − x is positive.Then, there is a small enough λ > 0 such that g j (x + λ(y 0 − x)) > 0. However, the convexity of C implies that x + λ(y 0 − x) ∈ C for λ ∈ [0, 1].This contradicts the fact that g j (x + λ(y 0 − x)) > 0.
The fact that it separates follows from Lemma 5.This result extends the algorithm of Veinott [15] to further representations of the set C. The proof of the validity of the cut is the same as the 'only if' part of [10,Lemma 2.2].
Remark 11.Any representation of a convex set C as {x ∈ R n : g j (x) ≤ 0, j ∈ J} yields a way to evaluate its gauge function, namely, ϕ C (x) = inf t > 0 : max j g j ( x t ) = 0 .
This can be solved using a line search.However, what is more important is to be able to compute subgradients.Given any method to compute subgradients of the gauge function, we can apply KCP algorithm using the implicitly defined gauge function.This allows us, for example, to drop the requirement that the gradients of the active constraints do not vanish at the boundary for solving the problem considered in this section.This algorithm is more general than the one proposed by Lasserre [11], but it will not necessarily converge to a KKT point of the original problem.

Concluding remarks
In this paper, we have shown that the extended supporting hyperplane algorithm studied by Veinott [15] and Kronqvist et al. [9] is identical to Kelley's classic cutting plane algorithm applied to a suitable reformulation of the problem.We used this new perspective in order to prove the convergence of the method for the larger class of problems with convex feasible regions represented by nonconvex differentiable constraints.More generally, the algorithm extends to any representation of a convex set that allows to compute subgradients of its gauge function.These theoretical results bear relevance in practice, as the experimental results in [9] have already demonstrated the computational benefits of the supporting hyperplane algorithm in comparison to alternative state-of-the-art solving methods.