A Simple Method for Convex Optimization in the Oracle Model

We give a simple and natural method for computing approximately optimal solutions for minimizing a convex function $f$ over a convex set $K$ given by a separation oracle. Our method utilizes the Frank--Wolfe algorithm over the cone of valid inequalities of $K$ and subgradients of $f$. Under the assumption that $f$ is $L$-Lipschitz and that $K$ contains a ball of radius $r$ and is contained inside the origin centered ball of radius $R$, using $O(\frac{(RL)^2}{\varepsilon^2} \cdot \frac{R^2}{r^2})$ iterations and calls to the oracle, our main method outputs a point $x \in K$ satisfying $f(x) \leq \varepsilon + \min_{z \in K} f(z)$. Our algorithm is easy to implement, and we believe it can serve as a useful alternative to existing cutting plane methods. As evidence towards this, we show that it compares favorably in terms of iteration counts to the standard LP based cutting plane method and the analytic center cutting plane method, on a testbed of combinatorial, semidefinite and machine learning instances.


Introduction
We consider the problem of minimizing a convex function f : R n → R over a compact convex set K ⊆ R n .We assume that K contains an (unknown) Euclidean ball of radius r > 0 and is contained inside the origin centered ball of radius R > 0, and that f is L-Lipschitz.We have first-order access to f that yields f (x) and a subgradient of f at x for any given x.Moreover, we only have access to K through a separation oracle (SO), which, given a point x ∈ R n , either asserts that x ∈ K or returns a linear constraint valid for K but violated by x.
Convex optimization in the SO model is one of the fundamental settings in optimization.The model is relevant for a wide variety of implicit optimization problems, where an explicit description of the defining inequalities for K is either too large to store or not fully known.The SO model was first introduced in [29] where it was shown that an additive ε-approximate solution can be obtained using O(n log(LR/(εr))) queries via the center of gravity method and O(n 2 log(LR/(εr))) queries via the ellipsoid method.This latter result was used by Khachiyan [27] to give the first polynomial time method for linear programming.The study of oracle-type models was greatly extended in the classic book of Grötschel, Lovász, and Schrijver [23], where many applications to combinatorial optimization were provided.Further progress on the SO model was given by Vaidya [36], who showed that the O(n log(LR/(εr))) oracle complexity can be efficiently achieved using the so-called volumetric barrier as a potential function, where the best current running time for such methods was given very recently [28,25].
From the practical perspective, two of the most popular methods in the SO model are the standard linear programming (LP) based cutting plane method, independently discovered by Kelley [26], Goldstein-Cheney [9] as well as Gomory [22] (in the integer programming context), and the analytic center cutting plane method [34] (ACCPM).
The LP based cutting plane method, which we henceforth dub the standard cut loop, proceeds as follows: starting with finitely many linear underestimators of f and linear constraints valid for K, in each iteration it solves a linear program that minimizes the lower envelope of f subject to the current linear relaxation of K.The resulting point x is then used to query f and the SO to obtain a new underestimator for f and a new constraint valid for K.Note that if f is a linear function, it repeatedly minimizes f over linear relaxations of K.While it is typically fast in practice, it can be unstable, and no general quantitative convergence guarantees are known for the standard cut loop.
To link to integer programming, in that context K is the convex hull of integer points of some polytope P and the objective is often linear, and the method is initialized with a linear description of P .A crucial difference there is that the separator SO is generally only efficient when queried at vertices of the current relaxation.
ACCPM is a barrier based method, in which the next query point is the minimizer of the barrier for the current inequalities in the system.ACCPM is in general a more stable method with provable complexity guarantees.Interestingly, while variants of ACCPM with O(n log(RL/(rε)) 2 ) convergence exist, achieved by judiciously dropping constraints [1], the more practical variants have worse guarantees.For instance, if K is the ball of radius R, the standard variant of ACCPM is only shown to achieve O(n(RL/ε) 2 log(RL/ε)) convergence [30].
In this paper, we describe a new method for convex optimization in the SO model that computes an additive ε-approximate solution within O( R 4 L 2 /r 2 ε 2 ) iterations.Our algorithm is easy to implement, and we believe it can serve as a useful alternative to existing methods.In our experimental results, we show that it compares favorably in terms of iteration counts to the standard cut loop and the analytic center cutting plane method, on a testbed of combinatorial, semidefinite and machine learning instances.
Before explaining our approach, we review the relevant work in related models.To begin, there has been a tremendous amount of work in the context of first-order methods [5,3], where the goal is to minimize a possibly complicated function, given by a gradient oracle, over a simple domain K (e.g., the simplex, cube, 2 ball).These methods tend to have cheap iterations and to achieve poly(1/ε) convergence rates.They are often superior in practice when the requisite accuracy is low or moderate, e.g., within 1% of optimal.For these methods, often variants of (sub-)gradient descent, it is generally assumed that computing (Euclidean) projections onto K as well as linear optimization over K are easy.If one only assumes access to a linear optimization (LO) oracle on K, K can become more interesting (e.g., the shortest-path or spanning-tree polytope).In this context, one of the most popular methods is the so-called Frank-Wolfe algorithm [19] (see [24] for a modern treatment), which iteratively computes a convex combination of vertices of K to obtain an approximate minimizer of a smooth convex function.
In the context of combinatorial optimization, there has been a considerable line of work on solving (implicit) packing and covering problems using the so-called multiplicative weights update (MWU) framework [33,31,20].In this framework, one must be able to implement an MWU oracle, which in essence computes optimal solutions for the target problem after the "difficult" constraints have been aggregated according to the current weights.This framework has been applied for getting fast (1 ± ε)-approximate solutions to multicommodity flow [33,20], packing spanning trees [8], the Held-Karp approximation for TSP [7], and more, where the MWU oracle computes shortest paths, minimum cost spanning trees, minimum cuts respectively in a sequence of weighted graphs.The MWU oracle is in general just a special type of LO oracle, which can often be interpreted as a SO that returns a maximally violated constraint.While certainly related to the SO model, it is not entirely clear how to adapt MWU to work with a general SO, in particular in settings unrelated to packing and covering.
A final line of work, which directly inspires our work, has examined simple iterative methods for computing a point in the interior of a cone Σ that directly apply in the SO model.The application of simple iterative methods for solving conic feasibility problems can be traced to Von Neumann in 1948 (see [15]), and a variant of this method, the perceptron algorithm [32] is still very popular today.Von Neumann's algorithm computes a convex combination of the defining inequalities of the cone, scaled to be of unit length, of nearly minimal Euclidean norm.The separation oracle is called to find an inequality violated by the current convex combination, and this inequality is then used to make the current convex combination shorter, in an analogous way to Frank-Wolfe.This method is guaranteed to find a point in the cone in O(1/ρ 2 ) iterations, where ρ is the so-called width of Σ (the radius of the largest ball contained in Σ centered at a point of norm 1).Starting in 2004, polynomial time variants of this and related methods (i.e., achieving log(1/ρ) dependence) have been found [6,17,10], which iteratively "rescale" the norm to speed up the convergence.These rescaled variants can also be applied in the oracle setting [4,11,14] with appropriate adaptations.The main shortcoming of existing conic approaches is that they are currently not well-adapted for solving optimization problems rather than feasibility problems.
Our approach.In this work, we build upon von Neumann's approach and utilize the Frank-Wolfe algorithm over the cone of valid inequalities of K as well as the subgradients of f in a way that yields a clean, simple, and flexible framework for solving general convex optimization problems in the SO model.For simpler explanation, let us assume that f (x) = c, x is a linear function and that we know an upper bound UB on the minimum of f over K. Given some linear inequalities a i , x ≤ b i that are valid for all x ∈ K, our goal is to find convex combinations p of the homogenized points (c, UB) and (a i , b i ) that are "close" to the origin.Note that if p = 0, the fact that K is full-dimensional implies that (c, UB) appears with a nonzero coefficient and hence (−c, −UB) is a nonnegative combination of the points (a i , b i ), which in turn shows that UB is equal to the minimum of f over K.In view of this, we will consider a potential Φ : R n+1 → R + with the property that if Φ(p) is sufficiently small, then the convex combination will yield an explicit certificate that UB is close to the minimum of f over K.
Given a certain convex combination p, note that the gradient of Φ at p provides information about whether moving towards one of the known points will (significantly) decrease Φ(p).However, if no such known point exists, it turns out that the "dehomogenization" of the gradient (a scaling of its projection onto the first n coordinates) is a natural point x ∈ R n to query the SO with.In fact, if x ∈ K, it will have improved objective value with respect to f .Otherwise, the SO will provide a linear inequality such that moving towards its homogenization decreases Φ(p).
In this work, we will show that the above paradigm immediately yields a rigorous algorithm for various natural choices of Φ and scalings of inequalities.We will also see that general convex functions can be directly handled in the same manner by simply replacing (c, UB) with all subgradient cuts of f learned throughout the iterations.The same applies to pure feasibility problems for which we set f = 0.The convergence analysis of our algorithm is simple and based on standard estimates for the Frank-Wolfe algorithm.
Besides its conceptual simplicity and distinction to existing methods for convex optimization in the SO model, we also regard it as a practical alternative.In fact, in terms of iterations, our vanilla implementation in Julia4 performs similarly and often even better than the standard cut loop and the analytic center cutting plane method evaluated on a testbed of oracle-based linear optimization problems for matching problems, semidefinite relaxations of the maximum cut problem, and LPBoost.Moreover, the flexibility of our framework leaves several degrees of freedom to obtain optimized implementations that outperform our naive implementation.

Algorithm
Recall that we are given first-order access to a convex function f : R n → R that we want to minimize over a convex body K ⊆ R n .In the case where f is not differentiable, with a slight abuse of notation we interpret ∇f (x) to be any subgradient of f at x.We can access K by a separation oracle that, given a point x ∈ R n , either asserts that x ∈ K or returns a point (a, b) ∈ A ⊆ R n+1 with a, x > b such that a, y ≤ b holds for all y ∈ K. Here, •, • denotes the standard scalar product and we assume that all points in A correspond to linear constraints valid for K.To state our algorithm, let • denote any norm on R n+1 and • * its dual norm.Moreover, let Φ : R n+1 → R + be any strictly convex and differentiable function with min x∈R n+1 Φ(x) = Φ(0) = 0. Our method is given in Algorithm 1, in which we denote the number of iterations by T for later reference.However, T does not need to be specified in advance, and the algorithm may be stopped at any time, e.g., when a solution or bound of desired accuracy has been found.
In Line 5, ∇Φ(p t )[1 : n] denotes the first n components of ∇Φ(p t ), and ∇Φ(p t )[n + 1] denotes the last component of ∇Φ(p t ).The sets A t and G t denote the already known/separated inequalities and objective gradients during iteration t.
Proof.Since p t minimizes Φ over conv(A t ∪ G t ), for every q ∈ conv(A t ∪ G t ) we have ∇Φ(p t ), q − p t ≥ 0. If p t = 0 then from strict convexity of Φ and min x∈R n+1 Φ(x) = Φ(0) = 0 we get ∇Φ(p t ), q ≥ ∇Φ(p t ), p t > 0. (1) First, apply (1) to q = (0, 1)/ (0, 1) * ∈ A t and conclude ∇Φ(p t )[n + 1] > 0. This makes sure that x t can be computed.Second, we apply Inequality (1) to Note that, for the sake of presentation, in Line 3 we require p t to be the convex combination of minimum Φ-value.However, it is usually not necessary to compute such a minimum.The same convergence rates can be obtained if, in every iteration, p t is a suitable convex combination of p t−1 and some (c, d) ∈ A t ∪ G t with ∇Φ(p t−1 ), (c, d) < 0. If the last coordinate of p t−1 , as discussed in the above proof, is not positive, then such an update can be made towards (0, 1)/ (0, 1) * ∈ A t .Any such update will significantly decrease Φ(p t ), and the computation in Line 3 is guaranteed to make at least that much progress.This shows that simple updates of p t , which may be more preferable in practice, still suffice to achieve the claimed convergence rates.
Proof.Let x * ∈ K minimize f (x) over x ∈ K and let F ⊂ [T − 1] be the set of iterations (except the last one) in which x t ∈ K. Now write the point p T as a convex combination where λ ≥ 0, γ ≥ 0 and (λ, γ) 1 = 1.Then we have Here, the inequalities respectively arise from convexity of f , that x * ∈ K satisfies (a, b), (−x * , 1) ≥ 0 for every (a, b) ∈ A T , and the Cauchy-Schwarz inequality.In particular, we find that min t∈F f (x t ) − f (x * ) ≤ 2 p T * t∈F γt whenever t∈F γ t > 0. To lower bound this latter quantity, we use the assumptions on z to derive the inequalities Now observe that (∇f (x t ), ∇f (x t ), x t ) * ≤ 1 for every t ∈ F and divide through by (−z, 1) to find α(1 − t∈F γ t ) ≤ p T * + t∈F γ t .Hence, if p T * ≤ α 2 then α/2 ≤ (α + 1) t∈F γ t .This lower bound on t∈F γ t suffices to prove the lemma.
Combining the previous two lemmas, we obtain the following convergence rate of our algorithm: * for all x ∈ R n+1 .Under the assumptions of Lemmas 2 and 3, Algorithm 1 computes, for every Proof.After T iterations, we have .
Let us now apply the previous findings to a concrete setting, in which we assume that the objective function given by a separation oracle A, and let f : R n → R be an L-Lipschitz convex function given by a subgradient oracle.
We now claim that our choice of input satisfies the conditions of Theorem 1 with β = 1/2 and α = r/4.Given the claim, Theorem 1 directly proves the result.To prove the claim, apart from verifying that the bounds on β and α hold, we must verify smoothness of Φ with respect to the dual norm, a bound of 2 on the norm of (−x, 1) for x ∈ K, as well as a dual norm bound of 1 on (∇f (x), ∇f (x), x ) for x ∈ K.

Computational experiments
In this section, we provide a computational comparison of our method with the standard cut loop, the ellipsoid method, and the analytic center cutting plane method on a testbed of linear optimization instances.For comparison purposes, all four methods are embedded into a common cutting plane framework such that the same termination criteria apply.
Framework.Each method has access to a separation oracle that is equipped with a set of initial linear inequalities valid for K (such as bounds on variables), which are incorporated within each method in a straightforward way.For instance, we initialize our algorithm by adding these constraints to the set A 1 .Moreover, for each instance, we will be given a finite upper bound UB and incorporate the linear inequality f (x) ≤ UB in a similar way.This upper bound gets updated whenever a feasible solution of better objective value was found.Our framework collects all inequalities queried by the current method and computes the resulting lower bound on the optimum value in every iteration.Each method is stopped whenever the difference of upper and lower bound is below 10 −3 .
We will also inspect the possibility of a smart oracle that, regardless of whether a given point x is feasible, may still provide a valid inequality as well as a feasible solution (for instance, by modifying x in a simple way so that it becomes feasible).Such an oracle is often automatically available and can have a positive impact on the performance of the considered algorithms.For the problems we consider, the actual implementation of a smart oracle will be specified below.
Implementation.The framework has been implemented in julia 1.6.2using JuMP and Gurobi 9.1.1.To guarantee a fair comparison, all four methods have been implemented in a straightforward fashion.We use the textbook implementation of the ellipsoid method, and Badenbroek's implementation of the analytic center cutting plane method [2].Our method is implemented5 in the spirit of Theorem 2, where p t is computed using Gurobi.
Test sets.We use three problem classes in our experiments: linear programming formulations of the maximum-cardinality matching problem, semidefinite relaxations of the maximum cut problem, and LPBoost instances for classification problems.
For the maximum-cardinality matching problem, we consider the linear program for all U ⊆ V with |U | odd , due to Edmonds [18], where G = (V, E) is a given undirected graph, δ(v) is the set of all edges incident to v, and E[U ] is the set of all edges with both endpoints in U .The latter constraints are handled within an oracle that computes an inequality minimizing (|U | − 1)/2 − e∈E[U ] x e , whereas the other inequalities are provided as initial constraints.For the above problem, the smart version of the oracle does not provide a feasible point since there is no obvious way of transforming a given point into a feasible one.However, the smart version always provides the minimizing inequality.
We consider 16 random instances with 500 nodes, generated as follows.For each r ∈ {30, 33, . . ., 75} we build an instance by sampling r triples of nodes {u, v, w} and adding the edges of the induced triangles to the graph, forming the test set matching.We believe that these instances are interesting because the r triangles give rise to many constraints to be added by the oracle.Moreover, we selected all 13 instances from the Color02 symposium [12] with less than 300 edges, yielding the test set matching02.
Our second set of instances is based on the semidefinite relaxation of Goemans and Williamson [21] for the maximum cut problem max {v,w}∈E where c are edge weights on the edges of (V, E).We add the box constraints X ∈ [−1, 1] V ×V to the initial constraints and handle the semidefiniteness constraint by a separation oracle that, given X, computes an eigenvector h of X of minimum eigenvalue and returns the inequality hh , X ≥ 0.
Within the smart version of the oracle, this constraint is returned regardless of the feasibility of X.If X is not feasible, the semidefinite matrix 1 λ−1 X − λ λ−1 I is returned, where λ denotes the minimum eigenvalue and I the identity matrix.We generated 10 complete graphs on 10 nodes with edge weights chosen uniformly at random in [0, 1].Our third set of instances arises from LPBoost [16], a classifier algorithm based on column generation.To solve the pricing problem in column generation, the following linear program is solved: where Ω is a set of parameters, for i ∈ [m], x i is a data point labeled as y i = ±1, h(•, ω) is a classifier parameterized by ω ∈ Ω that predicts the label of x i as h(x i , ω) ∈ {−1, +1}, and D > 0 is a parameter.In our experiments, we restrict h(•, ω) to be a decision tree of height 1, so-called tree stumps, and choose D = 5 n .To separate a point (γ , λ ), we use julia's DecisionTree module to compute a decision stump with score function λ that weights the data points, whose corresponding inequality classifies (γ , λ ) as feasible or not.A smart oracle always returns the computed inequality and decreases γ until (γ , λ ) becomes feasible according to the found decision stump.
We extracted all data sets from the UC Irvine Machine Learning Repository [35] that are labeled as multivariate, classification, ten-to-hundred attributes, hundred-to-thousand instances.Data sets with alpha-numeric values or too many missing values have been discarded.
Results.In what follows, we report on the number of iterations, i.e., oracle calls, each method needs to obtain a gap (upper bound minus lower bound) below 10 −3 .We impose a limit of 500 iterations per instance.Since we are testing naive implementations of each method, we do not report on running time.
To get more insights on the primal and dual performance of the tested methods, we also report on their primal and dual integrals.Note that we are solving maximization problems in this section, as opposed to minimization problems in Section 2. That is, primal (dual) solutions provide lower (upper) bounds on OPT.If i is the lower bound on the optimal objective value OPT in iteration i, the primal integral is 500 i=1 OPT− i OPT− 1 .The dual integral is computed analogously.If an integral is small, this indicates quick progress in finding the correct value of the corresponding bound.
Table 1 summarizes our results without smart oracles, where all numbers are average values.Here, "matching" refers to the random instances and "match-ing02" to the instances from the Color02 symposium.The standard cut loop is referred to as "LP", the ellipsoid method as "ellipsoid", the analytic center method as "analytic", and Algorithm 1 as "our".Note that Table 1 does not report on the primal integral of "LP" since the standard cut loop is a dual method.
We see that the ellipsoid and analytic center methods are struggling with solving any instance within 500 iterations independent from the problem class.Our algorithm solves the instances of the matching and max-cut problem much faster than the standard cut loop.Only for LPBoost, the standard cut loop  clearly dominates our algorithm.To better understand this behavior, the integrals reveal that our algorithm is better in improving the primal bound than the dual bound, with the only exception being LPBoost.The analytic center method, however, performs significantly worse than our algorithm in improving the primal bound.Regarding the dual bound, it performs better than our algorithm (with the exception of matching02).The ellipsoid method is much worse in improving the primal bound in comparison with the analytic center method and our algorithm.Regarding the dual bound, a similar trend can be observed with LPBoost being an exception.In summary, the analytic center cutting plane method improves the dual bound more quickly than our algorithm.It can find a good primal solution early as the primal integral is small, however it fails to close the remaining gap within the iteration limit.Our algorithm is able to close the primal gap faster, with the trade-off of a slightly slower dual convergence.A typical plot of the of the relative primal and dual gaps is given in Figure 1.
In a second experiment, we investigate the effect of smart oracles.As Table 2 shows, the algorithms mostly benefit from having access to a smart oracle in the case of LPBoost.A reason might be in the particular structure of these instances: the objective just consists of γ and every truncated convex combination λ is feasible.However, there is no impact of smart oracles on the matching and maxcut instances, respectively.

Table 1 :
Comparison of iterations and dual/primal integral without smart oracles.

Table 2 :
Comparison of iterations and dual/primal integral with smart oracles.