A Preconditioned Iterative Interior Point Approach to the Conic Bundle Subproblem

The conic bundle implementation of the spectral bundle method for large scale semidefinite programming solves in each iteration a semidefinite quadratic subproblem by an interior point approach. For larger cutting model sizes the limiting operation is collecting and factorizing a Schur complement of the primal-dual KKT system. We explore possibilities to improve on this by an iterative approach that exploits structural low rank properties. Two preconditioning approaches are proposed and analyzed. Both might be of interest for rank structured positive definite systems in general. The first employs projections onto random subspaces, the second projects onto a subspace that is chosen deterministically based on structural interior point properties. For both approaches theoretic bounds are derived for the associated condition number. In the instances tested the deterministic preconditioner provides surprisingly efficient control on the actual condition number. The results suggest that for large scale instances the iterative solver is usually the better choice if precision requirements are moderate or if the size of the Schur complemented system clearly exceeds the active dimension within the subspace giving rise to the cutting model of the bundle method.


Introduction
In semidefinite programming the ever increasing number of applications [3,39,7,26] leads to a corresponding increase in demand for reliable and efficient solvers for linear programs over symmetric cones. In general, interior point methods are the method of choice. Yet, if the order of some semidefinite matrix variables gets large and the affine matrix functions involved do not allow to use decomposition or factorization approaches such as proposed in [30,5,8], general interior point methods are no longer applicable. The limiting factors are typically memory requirements and computation times connected with forming and factorizing a Schur complemented system matrix of the interior point KKT system. Large scale second order cone variables do not cause such problems, this is indeed specific to semidefinite settings. In such cases, the spectral bundle method of [25] offers a viable alternative.
The spectral bundle method reformulates the semidefiniteness condition via a penalty term on the extremal eigenvalues of a corresponding affine matrix function and assumes these eigenvalues to be efficiently computable by iterative methods. In each step it selects a subspace close to the current active eigenspace. Then the next candidate point is determined as the proximal point with respect to the extremal eigenvalues of the affine matrix function projected onto this subspace. The proximal point is the optimal solution to a quadratic semidefinite subproblem whose matrix variable is of the order of the dimension of the approximating subspace. If the subspace is kept small, this allows to find approximately optimal solutions in reasonable time. In order to reach solutions of higher precision it seems unavoidable to go beyond the full active eigenspace [22,10]. In the current implementation within the callable library ConicBundle [19], which also supports second order cone and nonnegative variables, the quadratic subproblem is solved by an interior point approach. Again for each of its KKT systems the limiting work consists in collecting and factorizing a Schur complement matrix whose order is typically the square of the dimension of the active eigenspace. The main question addressed here is whether it is possible to stretch these limits by developing a suitably preconditioned iterative solver that allows to circumvent the collection and factorization of this Schur complement. The focus is thus not on the spectral bundle method itself but on solving KKT systems of related quadratic semidefinite and more generally quadratic conic programs by iterative methods. While the motivating and most general semidefinite case dominates in this work, natural extensions to second order and nonnegative cones will also be mentioned, because future applications may well expect and require support for arbitrary combinations of conic variables. Even though the methodology will be developed and discussed for low rank properties that arise in the ConicBundle setting, some of the considerations and ideas should be transferable to general conic quadratic optimization problems whose quadratic term consists of a positive diagonal plus a low rank Gram matrix or maybe even to general positive definite systems of this form.
Here is an outline of the paper and its main contributions. Section 2 provides the necessary background on the bundle philosophy underlying ConicBundle and derives the KKT system of the bundle subproblem. The core of the work is presented in Section 3 on low rank preconditioning for a Gram-matrix plus positive diagonal. For slightly greater generality, denote the cone of positive (semi)definite matrices of order m by S m ++ (S m + ) and let the system matrix be given in the form where it is tacitly assumed that D −1 times vector and V times vector are efficiently computable. Typically n ≤ m but whenever n is sizable one would like to approximate V by a matrixV ∈ R m×k with significantly smaller k < n to obtain a preconditionerĤ = D +VV whose inverse, by a low rank update, readsĤ −1 = D −1 − D −1V (I k +V D −1V ) −1V D −1 . Comparing this to the inverse of H, the goal is to capture the large eigenvalues of V D −1 V , more precisely the directions belonging to large singular values of D − 1 2 V . By the singular value decomposition (SVD) this can be achieved by the projection onto a subspace, say D − 1 2 V P for a suitably chosen orthogonal P ∈ R n×k . Because the full SVD is computationally too expensive, two other approaches will be developed and analyzed here. In the first, Section 3.1, the orthogonal P is generated by a Gaussian matrix Ω ∈ R n×k . In the second, Section 3.2, some knowledge about the interior point method leading to V will be exploited in order to form P deterministically.
The projection onto a random subspace may be motivated geometrically by interpreting the Gram matrix V V as the inner products of the row vectors of V . The result of Johnson-Lindenstrauss, cf. [1,9], allows to approximate this with low distortion by a projection onto a low dimensional subspace. In matrix approximations this idea seems to have first appeared in [34]. In connection with preconditioning a recent probabilistic approach is described in [27] in the context of controlling the error of a LU preconditioner. [17] gives an excellent introduction to probabilistic algorithms for constructing approximate matrix decompositions and provides useful bounds. Building directly on their techniques we provide deterministic and probabilistic bounds on the condition number of the random subspace preconditioned system in theorems 6 and 7. In comparison to the moment analysis of the Ritz values of the preconditioned matrix presented in Theorem 4, the bounds seem to fall below expectation and are maybe still improvable. Random projections do not require any problem specific structural insights, but it remains open how to choose the subspace dimension in order to obtain an efficient preconditioner.
In contrast, identifying the correct subspace seems to work well for the deterministic preconditioning routine. It exploits structural properties of the KKT system's origin in interior point methods. Within interior point methods iterative approaches have been investigated in quite a number of works, in conjunction with semidefinite optimization see e. g. [38,31]. These meth-ods were mostly designed for exploiting sparsity rather than low rank structure. During the last months of this work an approach closely related to ours appeared in [16]. It significantly extends ideas of [40] for a deterministic preconditioning variant. It assumes the rank of the optimal solution to be known in advance and provides a detailed analysis for this case. Their ideas and arguments heavily influenced the condition number analysis of our approach presented in theorems 2 and 9. In contrast to [16], our algorithmic approach does not require any a priori knowledge on the rank of the optimal solution. Rather, Theorem 9 and Lemma 12 motivate an estimate on the singular value induced by certain directions associated with active interior point variables, that seems to offer a good indicator for the relevance of the corresponding subspace.
In Section 4 the performance of the preconditioning approaches is illustrated relative to the direct solver on sequences of KKT systems that arise in solving three large scale instances within ConicBundle. The deterministic approach turns out to be surprisingly effective in identifying a suitable subspace. It provides good control on the condition number and reduces the number of matrix vector multiplications significantly. The selected instances are also intended to demonstrate the differences in the potential of the methods depending on the problem characteristics. Roughly, the direct solver is mainly attractive if the model is tiny, if significant parts of the Schur complement can be precomputed for all KKT systems of the same subproblem or if precision requirements get exceedingly high with the entire bundle model being strongly active. In general, however, the iterative approach with deteriministic preconditioner can be expected to lead to significant savings in computation time in large scale applications. In order to demonstrate that this KKT systems based analysis suitably reflects the performance of the solvers within the bundle method, the section closes with reporting preliminary experiments on randomly generated Max-Cut instances where ConicBundle is run for each solver separately with exactly the same parameter settings that were developed and tuned for the direct solver. In Section 5 the paper ends with some concluding remarks.
Notation. For matrices or vectors A, B ∈ R m×n the (trace) inner product is denoted by A, B = tr B A = ij A ij B ij . A • B = (A ij B ij ) denotes the elementwise or Hadamard product. The eigenvalues of A are denoted by λ max (A) = λ 1 (A) ≥ · · · ≥ λ n (A) = λ min (A). The norm · refers to the Euclidean norm for vectors and to the spectral norm for matrices. I n (I) denotes the identity matrix of order n (or of appropriate size), the canonical unit vectors e i refer to the i-th column of I. Unless stated explicitly otherwise, 1 denotes the vector of all ones of appropriate size. E refers to the expected value of a random variable, Var to its variance and N (µ, σ 2 ) to the normal or Gaussian distribution with mean µ and standard deviation σ.

The KKT system of the ConicBundle Subproblem
The general setting of bundle methods deals with minimizing a (typically closed) convex function f : R m → R := R ∪ {∞} over a closed convex ground set C ⊆ dom f of simple structure like R m , a box or a polyhedron, minimize f (y) subject to y ∈ C.
Typically, f is given by a first order oracle, i. e., a routine that returns for a givenȳ ∈ C the function value f (ȳ) and an arbitrary subgradient g ∈ ∂f (ȳ) from the subdifferential of f inȳ. Value f (ȳ) and subgradient g give rise to a supporting hyperplane to the epigraph of f in (ȳ, f (ȳ)). The algorithm collects these affine minorants in the bundle to form a cutting model of f . It will be convenient to arrange the value at zero and the gradient in a pair ω = (γ = f (ȳ) − g,ȳ , g) and to denote, for y ∈ R m , the minorant's value in y by ω(y) := γ + g, y .
Let W f = {ω = (γ, g) ∈ R 1+m : γ + g, y ≤ f (y), y ∈ R m } denote the set of all affine minorants of f . For closed f we have f (y) = sup ω∈W f ω(y). Any compact subset W ⊆ W f gives rise to a minorizing cutting model of f , At the beginning of iteration k = 0, 1, . . . the bundle method's state is described by a current stability centerŷ k ∈ R m , a compact cutting model W k ⊆ W f , and a proximity term, here the square of a norm · 2 H k := ·, H k · with positive definite H k (this Fraktur H will form the core of the final system matrix H). The method determines the next candidate y k+1 ∈ R m as minimizer of the augmented model or bundle subproblem Solving this bundle subproblem may be viewed as determining a saddle point (y k+1 ,ω k+1 = (γ k+1 ,ḡ k+1 )) ∈ C × conv W k , which exists for any closed convex C by [36], Theorems 37.3 and 37.6, due to the strong convexity in y and the compactness of W k , and their interplay. While most bundle implementations employ polyhedral cutting models combined with a suitable active set QP approach, the ConicBundle callable library [19] is primarily designed for (nonnegative combinations of) conic cutting models built from symmetric cones. In particular the cone of positive semidefinite matrices and the second order cone lead to nonpolyhedral models that change significantly in each step. In solving (1) for these models, interior point methods are currently the best option available, but with these methods the cost of assembling the coefficients and solving the subproblem dominates the work per iteration for most applications. This paper explores possibilities to replace the classical Schur complement approach for computing the Newton step by an iterative approach in order to improve applicability to large scale problems.
As the main focus of this work is on solving problem (1) for a single iteration, we will refrain from giving the iteration index k in the following. In order to describe the main structure of the primal dual KKT system for (1), let us briefly sketch the conic cutting models employed. These build on combinations of the cone of nonnegative vectors, the second order cone and the cone of positive semidefinite matrices, each with a specific trace vector for measuring the "size" of its elements, Cartesian products of these are described by a triple t = (n, q, s) with n ∈ N 0 , q ∈ N nq , s ∈ N ns for some n q ∈ N 0 , n s ∈ N 0 specifying the cone The cone S t + will be regarded as a full dimensional cone in R t := R n(t) with n(t) = n + 1 q + ns i=1 si+1 2 . Whenever convenient, an x ∈ S t + or R t will be split into x = (ξ , (x (1) ) , . . . , (x (nq) ) , svec(X (1) ) , . . . , svec(X (ns) ) ) in the natural way. Indices or elements may be omitted if the corresponding counters n, n q , n s happen to be 0 or 1. The trace tr(·) of an element x ∈ S t is defined to be In ConicBundle each cutting model may be considered to be specified by a tuple M = (t, τ, K, where t = (n, q, s) specifies the cone as above, τ > 0 gives the trace value or trace upper bound, y → B(y) = B 0 + By represents the bundle as affine function, ω = (γ, g) provides a constant offset subgradient.
For example, the standard polyhedral model for h subgradients ω i = (γ i , g i ), i = 1, . . . , h is maximizing B(y), x over ξ ∈ R h + with 1 ξ = 1 finds the best convex combination of the subgradients at y. In polyhedral models B may well be sparse (consider e. g. Lagrangean relaxations of multi-commodity flow problems like in the train time-tabling application described in [13,14]), but in combination with positive semidefinite models large parts of B will be dense (see [25]; for concreteness, consider f (y) = λ max ( m i=1 A i y i ) with A i ∈ S n and let the spectral bundle be described by h orthonormal columns P ∈ R n×h that approximately span the eigenspace to large eigenvalues, then h = h +1 2 and B •,i = svec(P A i P ) for i = 1, . . . , m). Therefore we will not assume any specific structure in the bundle B.
Depending on the variable metric heuristic in use, see [22,23] for typical choices in ConicBundle, the proximal term may either be a positive multiple of I m or it may be of the form where D H is a diagonal matrix with strictly positive diagonal entries and V H ∈ R m×h H specifies a rank h H contribution. If V H is present, it typically consists of dense orthogonal columns.
The final ingredient is the basic optimization set C, which may have a polyhedral description of the form given data. If A is employed in applications, we expect the number of rows h A of A to be small in comparison to m. The set C is tested for feasibility in advance, but no preprocessing is applied. In order to reduce the indexing load in this presentation, we consider problem (1) for a single model (t, τ, K, B, ω = (γ, g)). Putting b = −Hŷ + g and δ = 1 2 Hŷ,ŷ + γ the bundle problem may be written in the form Hy, y + b, y + B 0 , x + By, x + δ .
The existence of saddle points is guaranteed by compactness of the x-set and by strong convexity for y due to H 0, see e. g. [28]. For the purpose of presentation, the primal dual KKT system for solving the saddle point problem (4) is built for a < a, y < y and K = R + . The extension to the equality cases follows quite naturally and will be commented on at appropriate places. Throughout we will assume that A, A and B, B are given by matrix-vector multiplication oracles. A is assumed to have few rows, B may actually have a large number of columns, H is a positive diagonal plus low rank, but no further structural information is assumed to be available. The original spectral bundle approach of [25] was designed for the unconstrained case C = R m which allows direct elimination of y by convex optimality. Setting up the maximization problem for x then requires forming the typically dense Schur complement BH −1 B . For increasing bundle sizes this is in fact the limiting operation within each bundle iteration. The aim of developing an iterative approach for (4) is therefore not only to allow for general C but also to circumvent the explicit computation of this Schur complement.
For setting up a primal-dual interior point approach for solving (4), the dual variables to constraints on the minimizing side will be denoted by s ∈ R h A , s a , s a ∈ R h A + , s y , s y ∈ R m + , the dual variables to the constraints on the maximizing side will be z ∈ S t + and ζ ∈ K * = R + . 6 In this, "•" denotes the componentwise Hadamard product and "• t " a canonical generalization to the cone S t + , employing the arrow operator for second order cone parts and (typically symmetrized) matrix products for semidefinite parts.
In solving this by Newton's method, the linearization of the first perturbed complementarity line yields For writing the difference ∆s a − ∆s a compactly it is advantageous to introduce Likewise, for the second perturbed complementarity line, introduce to obtain ∆s y − ∆s y = ∆y • d y + s y − s y − µc y .
For dealing with the conic complementarity "• t " we employ the symmetrization operators E t and F t of [37] in diagonal block form corresponding to S t + , which give rise to a symmetric positive definite X t = E −1 t F t 0 with diagonal block structure according to S t ++ . With this the last perturbed complementarity line results in Employing the linearization of the defining equation for s, ∆s a − ∆s a = ∆s + s + s a − s a , the variable ∆w may now be eliminated via Put D y := Diag(d y ) > 0 and D w := Diag(d w ) > 0, then the Newton step is obtained by solving the system  with right hand side The right hand side may be modified as usual to obtain predictor and corrector right hand sides, but this will not be elaborated on here. Note, for K = {0} the same system works with σ = 0 and without the centering terms associated with ζ in r ζ . Likewise, whenever line i of A corresponds to an equation, i. e., a i = a i , the respective entry of d −1 w has to be replaced by zero. Eq. (6) is a symmetric indefinite system, that could be solved by appropriate iterative methods directly. So far, however, we were not able to conceive suitable general preconditioning approaches for exploiting the given structural properties in the full system. Surprisingly, a viable path seems to be offered by the traditional Schur complement approach after all. The resulting system allows to perform matrix vector multiplications at minimal additional cost and is frequently positive definite.
To see this, first take the Schur complement with respect to the X −1 t block, Note that the multiplication of V with a vector requires only a little bit more than half the number of operations of multiplying H (or the full KKT matrix) with a vector. This suggests to explore possibilities of finding low rank approximations of V for preconditioning.
3 Low rank preconditioning a Gram-matrix plus positive diagonal with a positive definite matrix D ∈ S m ++ and V ∈ R m×n . In our application D is diagonal, but the results apply for general D 0. This is applicable in practice as long as D −1 can be applied efficiently to vectors. Matrix V is assumed to be given by a matrix-vector multiplication oracle, i. e., V and V may be multiplied by vectors but the matrix does not have to be available explicitly.
For motivating the following preconditioning approaches, first consider (without actually computing it) the singular value decomposition of . . , σ n ) ordered nonincreasingly by σ 1 ≥ · · · ≥ σ n ≥ 0 and orthogonal P H ∈ R n×n (for convenience, it is assumed that n ≤ m). Then When Σ is replaced by the k largest singular values, this gives rise to a good "low rank" preconditioner, see Theorem 1 below. Computing the full matrix D − 1 2 V and its singular value decomposition will in general be too costly or even impossible. Instead the general idea is to work with D − 1 2 V Ω for some random or deterministic choice of Ω ∈ R n×k . Multiplying by a random Ω may be thought of as giving rise to a subspace approximation in the style of Johnson-Lindenstrauss, cf. [1,9], and this formed the starting point of this investigation. The actual randomized approach and analysis, however, mainly builds on [17] and the bounding techniques presented there. For the deterministic preconditioning variant the recent work [16] provided strong guidance for analyzing the condition number.
Here, Ω will mostly consist of orthonormal columns. Yet it is instructive to consider more general cases, as well. An arbitrary Ω ∈ R n×k gives rise to the preconditioner PuttingĜ(Ω) := D .
In this, the equations follow from BB and B B having the same eigenvalues for B ∈ R n×n .
Theorem 1 Let H = D +V V ∈ S m ++ with positive definite D ∈ S m ++ and V ∈ R m×n with n < m and singular value decomposition For Ω ∈ R n×k the preconditionerĤ(Ω) of (9) results in condition number In particular, for 0 ≤ k < n and Ω = (P H ) •, [1,...,k] the condition number's value is 1 + σ 2 k+1 .
Proof. ForĜ(Ω) as above direct computation yieldŝ The eigenvalues of (I n + ΣΩΩ Σ) − 1 2 (I n + Σ 2 )(I n + ΣΩΩ Σ) − 1 2 coincide with those of (I n + Σ 2 ) Consider now a subspace spanned by k orthonormal columns collected in some matrix P ∈ R n×k which hopefully generates most of the large directions of D − 1 2 V . In this orthonormal case a simpler bound on the condition number may be obtained by following the argument of Th. 5.1 in [16].
The second summand is positive semidefinite with minimum eigenvalue 0 if and only if rank(V P ) < m. Thus, the result is proved.
Building on these two theorems we first analyze randomized approaches that do not make any assumptions on structural properties of D − 1 2 V but only require a multiplication oracle. Afterwards we present a deterministic approach that exploits some knowledge of the bundle subproblem and the interior point algorithm. The corresponding routines supply aV = V Ω. The actual preconditioning routine, Alg. 3 below, does not useĤ(Ω) directly, but a truncated preconditioner H(ΩP ) that drops all singular values of D − 1 2 V Ω that are less than one. The inverse is then formed via a Woodbury-formula, see [29, §0.7.4]. Note, depending on the expected number of calls to the routine and the structure preserved inV , it may or may not pay off to also precomputeVP . For diagonal D and denseV the cost of applying this preconditioner is O(m + mk + kk). Algorithm 3 (Preconditioning by truncatedĤ(Ω) = D + V Ω(V Ω) ) Input: v ∈ R m , D ∈ S n ++ , precomputedV = V Ω ∈ R n×k and, forV D −1V = P Diag(λ 1 ≥ · · · ≥ λ k )P ,k = max{0, i :λ i ≥ 1},Λ = Diag(λ 1 , . . . ,λk),P = P •, [1,...,k] .

Preconditioning by Random Subspaces
For the random subspace approach fix some k ∈ N with 2 ≤ k < n. At first consider Ω to be an n × k random matrix whose elements are independently identically distributed by the normal distribution N (0, 1 k ). For this Ω consider the low rank approximation Because the normal distribution is invariant under orthogonal transformations, we may assume P H = I and analyze the setting Q H Σ 0 Ω giving rise to the low rank approximation by the random matrixĤ In view of Theorem 1 such a preconditioner is good if (I + Σ 2 ) − 1 2 (I + ΣΩΩ Σ)(I + Σ 2 ) − 1 2 is close to the identity. Based on the Johnson-Lindenstrauss interpretation, it seems likely that large portions of the spectrum will be close to one. This can be justified to some extent by studying the moments of the Ritz values.
The expected value of the quadratic form evaluates to For determining the variance, the second moment may be computed as follows.
In the cases of h = h only terms with i = i and j = j are not zero. These evaluate to Summing up these three expressions yields The result now follows from the usual Var X = E(X 2 ) − (EX) 2 for any random variable X.
This suggests that even for relatively small k the behavior of the preconditioned system may be expected to be reasonably close to the identity for a large portion of the directions. The result, however, does not seem to open a path towards good bounds on the condition number. A first possibility is offered by Theorem 2. Recall that for an arbitrary matrix A ∈ R m×n the projector projects any vector of R m onto the range space of A and P depends only on this range space.
Computationally it may be determined by a QR-decomposition of The formula allows one to verify P A = P A , P A P A = P A and P A A = A by direct computation. Furthermore, for any B ∈ R n×h there holds P AB P A I m because of the containment relations between the ranges. In the following ΩΩ inĤ(Ω) will be replaced by the projector P Ω . The random low rank approximation to be considered readŝ The following deterministic result holds for any matrix Ω ∈ R n×k .
Proof. Let P Ω =PP withP ∈ R n×k for some k ≤ k andP P = I k . Add orthonormal columns P so that P = [P , P ] satisfies P P = I n . Note, I − P Ω = P P is the projector onto the orthogonal complement. Use this choice in Theorem 2, thenĤ( While this bound is rather straight forward to derive, it does not seem strong enough to observe a reduced influence of the largest singular values of D − 1 2 V . Indeed, in its derivation only the diagonal ofĤ(P Ω ) was considered and the influence of V Ω is lost.
In order to obtain stronger bounds, the rather involved techniques laid out in [17] seem to be required. The next steps and results follow their arguments closely. This time theĤ(P Ω )-part is kept inverted in the analysis of the condition number.
Because I + ΣP Ω Σ I + Σ 2 there holds By Theorem 1 the condition number is bounded by and will attain it, whenever n < m. In terms of Ω, the best possible outcome is an event resulting in P Ω = I k 0 0 0 (see, e. g. [29, 7.4.52]). It corresponds to the truncated SVD and . Aiming for something more realistic, one hopes for a good coverage of the first k singular values when oversampling with p additional columns.
The first step in the analysis is to obtain a deterministic bound for a fixed Ω ∈ R n×(k+p) as outlined in [17, §9.2].
Theorem 6 Given σ 1 ≥ · · · ≥ σ n ≥ 0 and a matrix Ω ∈ R n×(k+p) with k ≤ n so that the first k rows of Ω are linearly independent, split Σ = Σ 1 0 0 Σ 2 into blocks Σ 1 = Diag(σ 1 , . . . , σ k ) and . . , σ n ) and Ω = Ω 1 Ω 2 into the first k rows Ω 1 ∈ R k×(k+p) and the last n − k rows Ω 2 ∈ R (n−k)×k+p . Then Proof. By assumption Ω 1 has full row rank and the range space of the matrix is contained in the range space of Ω. Hence P Z P Ω and 13 The projector P Z computes to Use this in the Woodbury-formula for inverses of rank adjustments [29, §0.7.4] for (I + ΣP Z Σ) −1 to obtain The last line follows, because (I + Σ 2 2 ) , so the relation is implied by semidefinite scaling. Put Λ = (I + Σ 2 1 +λI) and note that the second diagonal block of (11) asserts Σ 2 F Λ −1 F Σ 2 I, then Employing [17,Prop. 8.3] now results in The last term evaluates to λ max (I + Σ 2 2 ) = 1 + σ 2 k+1 . For the second last term substituting in the definitions of Λ andλ yields The current bound falls somewhat short of expectation because of the identity in (I+Σ 2 2 ) 1 2 Ω 2 Ω † 1 2 . By Theorem 1 and I n I n + ΣP Ω Σ I n + Σ 2 , the use of projectors will never result in condition numbers larger than 1 + σ 2 1 , so the influence of the dimension seems to be too dominant in this. Maybe a better bound is achievable by a more sophisticated argument. The deterministic bound allows to also make use of the probabilistic bounds on (I+Σ 2 2 ) for standard Gaussian n × (k + p) matrices Ω (i. e., matrix elements are independently N (0, 1) distributed) given in [17]. These shed some light on the advantage of employing oversampling by p additional random vectors in Ω. In our application, oversampling corresponds to computing the singular values of D − 1 2 V P Ω for k + p columns in order to get better control on the k largest singular values of D − 1 2 V by the preconditionerĤ(P Ω ).

Theorem 7
In the setting of Theorem 6 let Ω be drawn as a standard Gaussian n×(k +p) matrix. Then Furthermore, if p ≥ 4 then for all u, t ≥ 1 the probability for The same bounds hold for the condition number κ PΩ .

Proof. A central and complex step in [17, proof of Th. 10.2] is to establish the relation
which directly yields the bound on the expected value via Theorem 6. Likewise, in [17, proof of Th. 10.8] the authors derive for p ≥ 4 and u, t ≥ 1 Again, the presence of the identity in the deterministic bound of Theorem 6 has a major impact also in these probabilistic bounds. Indeed, one would hope that a better deterministic bound helps to prove stronger decay. Without some a priori knowledge of the singular values of D − 1 2 V ∈ R m×n it is difficult to determine a suitable number of columns for Ω, i. e., a suitable dimension of the random subspace. For huge m the Johnson-Lindenstrauss result as presented in [9] suggests that for k at most 4(ε 2 /2 − ε 3 /3) −1 ln m a suitably chosen random Ω ∈ R n×k results in a distortion of 1 ± ε with sufficiently high probability. This roughly translates to that each matrix element of differs by at most this factor. When considering the sizes of m aimed for here -the dimension of the design space will be a few thousands to a few hundred thousands -this number is still too large for efficient computations even for a moderate ε = 0.1. Indeed, the burden of forming the preconditioner and of applying it would exceed the gain by far. [17] propose an algorithmic variant for identifying a significant drop in singular values, but this requires successive matrix vector multiplications and these are quite costly in practice. In preliminary experiments with a number of tentative randomized variants, those relating the number of columns to the number of matrix-vector multiplications of the previous solve seemed reasonable. It will turn out, however, that even the cost of this is too high and the gain too small in comparison to the deterministic approach of the next section. The latter appears to capture the important directions quite well and offers better possibilities to exploit structural properties of the data. Due to the rather clear superiority of the deterministic approach, the numerical experiments of Section 4 will only present results for one particular randomized variant that performed best among the tentative versions. It attempts to identify the most relevant subspace by storing and extending the important directions generated in the previous round. For completeness and reproducability, its details are given in Alg. 8, but in view of the rather disencouraging results we refrain from further discussions.
Algorithm 8 (Randomized subspace selection formingV = VP ) Input: V ∈ R m×n , D ∈ S n ++ , previous relevant subspace P old ∈ R n×k (initially k = 0), previous number of multiplications n mult , previousk of Alg. 3 Output:V (and stores P old )

A Deterministic Subspace Selection Approach
In the conic bundle method, H = D + V V of Theorem 2 is of the form described in (8). An inspection of the column blocks of this V suggests to concretize the bound of Theorem 2 for interior point related applications to Theorem 9 below. In this, B X t with suitably adapted B and note that in the resulting bound each diagonal block of X is added separately.
For H = D + V V and preconditionerĤ(P ) = D + VPP V the condition number is bounded by Note, the proof weakensĤ(P ) to D, so the bound cannot be expected to be strong. Yet, it provides a good rule of thumb on which columns of In interior point methods the spectral decomposition and the size of the eigenvalues of X of Theorem 9 strongly depend on the current iteration, in particular on the value of the barrier parameter µ. Therefore it is worth to set up a new preconditioner for each new KKT system. In order to do so in a computationally efficient way, the following dynamic selection heuristic forP with respect to V of (8) tries to either pick columns of V directly by including unit vectors inP or to at least take linear combinations of few columns of V in order to reduce the cost of matrixvector multiplications and to preserve potential structural proprieties. So instead of formingP , the heuristic buildsV = VP directly by appending (linear combinations of selected) columns of V toV . Also, it will often only employ approximations of the eigenvalues λ i together with approximations p i of the eigenvectors of the X described in Theorem 9. Generally, it will include those inV for which an estimate of λ i B p i Concerning the third group, it will become clear in the discussion of the semidefinite part that in practice it is advantageous to replace the square root X 1 2 t in the factorization of X t by a more general, possibly nonsymmetric factorization X t = F t F t . The matrix F t will have the same block structure and leads to a similar rank one correction by the transformed trace vector F t 1 t , Algorithm 10 (Deterministic column selection heuristic formingV ) Input: D H , V H , D y , A, D w , B, X t , ζ, σ specifying D and V ∈ R m×n of (8) Output:V ∈ R m×n for some n ≤ n withV = VP ,P P = I n .
B F t 1 t and for each conic diagonal block of X t call append "cone" columns(V ) with corresponding parameters. 5. ReturnV .
The first group of columns V H ∈ R m×h H matches, in the notation of Theorem 9, (a subblock of) B = V H and (a diagonal block) X = I h H . The heuristic appends those columns j toV that For the second group of columns A D 1 2 w , Theorem 9 applies to B = A and With the comment above regarding F t , the third column group is formed by a term 2 for each cutting model (we assume just one here). With respect to Theorem 9, B is just right and X is the positive Recall that X t is a block diagonal matrix with the structure of the diagonal blocks governed by the linearization of the perturbed complementarity conditions of the various cones. The overarching rank one modification by F t 1 t couples the blocks within the same cutting model and reappears in some form in each block together with η = ζ −1 σ For each column p ofP computing Thus, by keeping the support of p restricted to single blocks, the proper column computations can be kept restricted to the respective block. This also holds for the coefficient F t 1 t , p . The overarching vector B X t 1 t needs to be evaluated only once and can be added to the columns afterwards. The latter step only requires the respective coefficients but not the vectors ofP . This allows to speed up the process of formingV considerably. Therefore, when forming the conceptionalP in the heuristic, the influence of F t 1 t on eigenvalues and eigenvectors of the blocks will mostly be considered as restricted to each single block. Next the actual selection procedure is described for X t blocks corresponding to cones R h + (Alg. 11) and S h + (Alg. 13) with Nesterov-Todd-scaling [32,37].
For Alg. 11 consider, within the cone specified by t, a block with indices J ∈ N h representing a nonnegative cone R h + with primal dual pair (x, z). The corresponding "diagonal block" in X t is of the form Diag(x • z −1 ) and for F t it is Diag( Considering the influence of the trace vector as restricted to this block alone gives Diag( Note that in interior point methods x i z i ≈ µ for barrier parameter µ 0 and x i → x opt i , z i → z opt i . Due to η ≥ xi zi with η mostly much larger, the estimated value roughly behaves like D −1 and, indeed, by experience it seems that columns are almost exclusively included only for active x opt i > 0 and only as µ gets small enough. When computing high precision solutions with small µ, the rank of the preconditioner can thus be expected to match the number of active subgradients in the cutting model. Theorem 9 suggests that in iterative methods these columns have to be included in some form in order to obtain reliable convergence behavior. Alg. 13 below deals with a positive semidefinite cone S h + with Nesterov-Todd-scaling. For the current purposes it suffices to know that the diagonal block of X t indexed by appropriate J ∈ N ( h+1 2 ) is of the form W ⊗ s W for a positive definite W ∈ S h ++ ; see [37] for its efficient computation and for an appendix of convenient rules for computing with symmetric Kronecker products. The next result derives the eigenvectors and eigenvalues when considering the rank one correction restricted to this block.
To see this e. g.
For semidefinite blocks, numerical experience indicates that it is indeed worth to determine the eigenvalue decomposition of U as in Lemma 12. Finding the eigenvalues and eigenvectors roughly requires the same amount of work as forming W and is of no concern. With J ∈ N ( h+1 2 ) denoting the column indices of this block within B , columns to corresponding eigenvectors are computed by (Λ In order to also account for the possibly overarching contribution of F t 1 t it is advantageous to find a representation equivalent to B X 1 2P with orthonormal columns inP as in Theorem 9 for a suitable factorization of X other than its square root. For this, let , the notation of Lemma 12 and its proof allows to rephrase the semidefinite block of This suggests to put V = B P ⊗s F and to derive the columns corresponding toP via the singular value decomposition of F = Q F Σ F P F . Lemma 12 provides the squared singular values Σ 2 F and the eigenvectors give the left-singular vectors in Q F . For e ij := 1 √ 2 svec(e i e j + e j e i ) there holds svec(λ W ) e ij = 0, so the right-singular vectors corresponding to λ i λ j read (P F ) •,ij = e ij . The remaining right-singular vectors of P F may be computed via P F = F Q F Σ −1 F . In this it is sufficient and convenient to consider only the U block, i. e., the support restricted to the ii-coordinates. Denote the columns of P U = [u 1 , . . . , u h ] in Lemma 12 by u j for j = 1, . . . , h, then the corresponding right-singular vectors u F j ∈ R h read for Λ W = Diag(λ W ) and By expanding the U block to the correct positions, the right-singular vector to singular value λ U j is (P F ) •,jj = svec(Diag(u F j )) for j = 1, . . . , h. With these preparations the selected semidefinite columns are appended toV as follows. First note that the semidefinite block with coordinates J of the factor F t is (V W ⊗ s V W ), which is non- is selected forP by the heuristic, the column to be appended toV reads If the selected indices satisfy i < j, the vector p F ij is just e ij = 1 √ 2 svec(e i e j + e j e i ). By svec(w i w j + w j w i ) = λ W i λ W j w ij and svec Λ W , e ij = 0 the column computation simplifies to Typically, several mixed eigenvectors w ij have the same index i corresponding to a large value λ W i , so it quickly pays off to precompute w i svec −1 ([B ] k,J ) and to use these h-vectors for each w j . For ease of presentation this implementational detail is not described in Alg. 13. Also, this is not helpful for the non-mixed vectors p F jj = svec(Diag(u F j )), because consists of a linear combination over all w ii . Fortunately, throughout our experiments, only few of the non-mixed vectors are among those selected for preconditioning. A possible explanation for this might be that with respect to the selected bundle subspace the large non-mixed terms reflect the rank of the currently strongly active eigenspace while large mixed terms reflect its ongoing interaction with the eigenspace of moderately active or inactive eigenvalues. The transformed trace vector coefficient for p F jj evaluates to Λ W , Diag(u F j ) = λ W , u F j . With this, the algorithm for appending semidefinite columns reads as follows.
Algorithm 13 (append S h + columns(V )) Input: column indices J ∈ N ( h+1 2 ) and Nesterov-Todd scaling matrix W 0 of this block in X t , (13) and set

ReturnV .
As for the linear case it can be argued that for small barrier parameter µ the number of selected columns corresponds at least to the order of the active submatrix in the cutting model. Thus if h ≤ h eigenvalues of X ∈ S h + converge to positive values in the optimum, the heuristic will end up with selecting at least ĥ +1 2 columns once µ gets small. For second order cones Q h the structural properties of the arrow operator and the Nesterov-Todd-direction allow to restrict considerations to just two directions per cone for preconditioning, but as the computational experiments do not involve second order cones this will not be discussed here.

Numerical Experiments
The purpose of the numerical experiments is to explore and compare the behavior and performance of the pure and preconditioned iterative variants to the original direct solver on KKT instances that arise in the course of solving large scale instances by the conic bundle method.
It has to be emphasized that the experiments are by no means designed and intended to investigate the efficiency of the conic bundle method with internal iterative solver. Indeed, many aspects of the ConicBundle code [19] such as the cutting model selection routines, the path following predictor-corrector approach and the internal termination criteria have been tuned to work reasonably well with the direct solver. As the theory suggests and the results support, the performance of iterative methods depends more on the size of the active set than on the size of the model. Thus somewhat larger models might be better in connection with iterative solvers. Also, the predictor-corrector approach is particularly efficient if setting up the KKT system is expensive. For iterative methods with deterministic preconditioning this hinges on the cost of forming the preconditioner which gets expensive once the barrier parameter gets small. Furthermore iterative methods might actually profit from staying in a rather narrow neighborhood of the central path. Therefore many implementational decisions need to be reevaluated for iterative solvers. This is out of scope for this paper. Hence, the experiments only aim to highlight the relative performance of the solvers on sequences of KKT systems that currently arise in ConicBundle. For the sole purpose of demonstrating the relevance of this KKT system based analysis, Section 4.4 will present a comparison on the performance of ConicBundle when employing the KKT solver variants without any further adaptations of parameters.
The KKT system oriented experiments will report on the performance for three different instances: the first, denoted by MC, is a classical semidefinite relaxation of Max-Cut on a graph with 20000 nodes as described in [15,25], the second, BIS, is a semidefinite Minimum-Bisection relaxation improved by dynamic separation of odd cycle cutting planes on the support of the Boeing instance KKT traj33 giving a graph on 20006 nodes explained in [18], and the third, MM-BIS, refers to a min-max-bisection problem problem shifting the edge weights so as to minimize a restricted maximum cut on a graph of 12600 nodes. All three have a single semidefinite cutting model which consists of a semidefinite cone with up to one nonnegative variable, so the model cone S t + of (2) typically has t = (1, [], [h]) for some h ∈ N. In the Max-Cut instance the design variables are unconstrained, in the Bisection instance the design variables corresponding to the cutting planes are sign constrained (D y is needed) and in the min-max-bisection problem some design variables have bounds and there are linear equality and inequality constraints (D y , D w and A appear). Throughout, the proximal term is a multiple of the identity for a dynamic weight, i. e., H k = u k I with u k > 0 controlled as in [20].
In each case ConicBundle is run with default settings for the internal constrained QP solver with direct KKT solver for the bundle subproblems. Whenever a new KKT system arises, it is solved consecutively but independently on the same machine by • (DS) the original direct solver, • (IT) MINRES without preconditioning (the implementation follows [12]), • (RP) MINRES with randomized preconditioning (Alg. 3 with Alg. 8), • (DP) MINRES with deterministic preconditioning (Alg. 3 with Alg. 10). Only the results of the direct solver are then used to continue the algorithm. Note, for nonsmooth optimization problems tiny deviations in the solution of the subproblem may lead to huge differences in the subsequent path of the algorithm. Therefore running the bundle method with different solvers would quickly lead to incomparable KKT systems. That the chosen approach does not impair the validity of the conclusions regarding the performance of the solvers within the bundle method will be demonstrated in Section 4.4.
The details of the direct solver DS are of little relevance at this point. Suffice it to say that its main work consists in Schur complementing the H and ζ −1 σ blocks of the KKT system (6) into the joined Diag(D −1 w , X −1 t ) block and factorizing this. In the Max-Cut setting (no D y ), the H block is constant throughout each bundle subproblem. In this case the Schur complement is precomputed once for each bundle subproblem -thus for several KKT systems -and this makes this approach extremely efficient as long as the order h of the semidefinite model is small. Precomputation is no longer possible if D y is needed which is the case in the two other instances. Finally, if A is also present, the system to be factorized in every iteration gets significantly larger. These differences motivated the choice of the instances and explain part of the strong differences in the performance of the solvers.
For Max-Cut and Bisection the iterative solver could exploit the positive definiteness of the system by employing conjugate gradients instead of MINRES. The min-max-bisection problem comprises equality constraints in A, so the system is no longer positive definite and conjugate gradients are not applicable. Employing MINRES for all three facilitates the comparison, in particular as MINRES seemed to perform numerically better on the other instances as well. MINRES computes the residual norm with respect to the inverse of the preconditioner and the implementation uses this norm for termination. To safeguard against effects due to the changes in this norm, the relative precision requirement min{10 −6 , 10 −2 µ} of ConicBundle is multiplied, in the notation of Alg. 3, by the factor The results on the three instances will be presented in eight plots per instance. The first four 22 compare all four solvers, the last four plots are devoted to information that is only relevant for iterative solvers, so DS will not appear in these. 1. Plot "time per subproblem (seconds)" gives for each of the four methods a box plot on the seconds (in logarithmic scale) required to solve the subproblems. For each subproblem this is the sum of the time required for initializing/forming and solving all KKT systems of this subproblem. This is needed, because in the case of Max-Cut instance MC, the direct solver DS forms the Schur complement of the H-block only once per subproblem and this is also accounted for here.

Max-Cut (Instance MC, Figure 1)
The graph was randomly generated ( [35], call rudy -rnd graph 20000 1 1 for 20000 nodes, edge density one percent, seed value 1). The semidefinite relaxation gives rise to an unconstrained problem with 20000 variables. Each variable influences one of the diagonal elements of the Laplace matrix of the graph with cost one and the task is to minimize the maximum eigenvalue of the Laplacian times the number of nodes, see [25] for the general problem description. For graphs of this type but smaller size like 5000 or 10000 nodes the direct solver DS still seemed to perform better, so rather large sizes are needed to see some advantage of iterative methods. Other than that the relative behavior of the solvers was similar also for the smaller sizes. The jaggies within subproblem time in the second plot are due to the reduction of the model to its active part after each descent step while the model typically increases in size during null steps. During the very first iterations the bundle is tiny and DS is the best choice. Once the bundle size increases sufficiently, the iterative methods dominate. Over time, as precision requirements get higher and the choice of the bundle subspace converges, the advantage of iterative methods decreases. In the final phase of high precision the direct solver may well be more attractive again.
The plots also show that for this instance (and presumably for most instances of this random type) the performance of IT (MINRES without preconditioning) is almost as good as DP (deterministic preconditioning) while RP (randomized preconditioning) is not competitive. Note that the condition number does not grow excessively for IT in this instance. Deterministic preconditioning succeeds in keeping the condition number almost exactly at the intended value 10. For smaller values of µ, so for higher precision requirements, DP requires distinctly fewer matrix-vector multiplications, but it then also selects a large number of columns. In comparison to no preconditioning DP helps to improve stability but does not lead to significantly better computation times except maybe for the very last phase of the algorithm with high precision requirements.

Minimum Bisection (Instance BIS, Figure 2)
The semidefinite relaxation of minimum bisection is similar in nature to max-cut, but in addition to the single diagonal elements there is a variable with coefficient matrix of all ones. Furthermore, variables with sparse coefficient matrices corresponding to odd cycles in the underlying graph are added dynamically in rounds, see [18] for the general framework and also for the origin of the instance KKT traj33 with 20006 nodes and roughly 260000 edges.
Again, after the very first iterations the iterative methods turn out to perform distinctly better in the initial phase of the algorithm. Iterative methods get less attractive as precision requirements increase. The model size is often rather small (a bit larger than the active set of about 150 columns) which is favorable for DS. Indeed, additional output information of the log file indicates that the performance of DS drops off whenever the cutting model is significantly larger than that.
While for this instance RP is better than IT, the advantage of DP over the other iterative variants is quite apparent and its superiority also increases with precision requirements and smaller µ. In fact, for DP the condition number and the number of matrix-vector multiplications decrease again for smaller µ. Possible causes might be that the active set is easier to identify correctly. Due to the reduction in matrix-vector multiplications, computation time does not increase for DP in spite of a growing number of columns in the preconditioner.

A Min-Max-Bisection Problem (Instance MMBIS, Figure 3)
This problem arose in the context of an unpublished attempt 1 to optimize vaccination rates for five population groups N 1∪ . . .∪N 5 = N in a virtual town of n = |N | inhabitants. Briefly, within the town k anonymous people are assumed to be infectious. There is vaccine for at most n people. The aim is to reduce the spreading rate of the disease by vaccinating each person with the respective group's probability. The task of determining these vaccination rates motivated   the following ad hoc model which would be hard to justify rigorously. In a graph G = (N, E) each edge ij = {i, j} ∈ E with i ∈ Nî, j ∈ N has a weightŵî representing the infectiousness of the typical contact for these two persons of the respective groups. It will be convenient to define the weighted Laplacians Lî =ŵî ij∈E,i∈Nî,j∈N (e i − e j )(e i − e j ) . In this simplified approach, vaccination rates vî, v of the node groups reduce a nominal infectiousnessŵî between these groups by the factor yî ≥ max{0, 1 − vî − v}. The spreading probability to be minimized is considered proportional to the restricted max-cut value max I⊂N,|I|=k For determining the vaccination rates the combinatorial problem is replaced by the usual (dual) semidefinite relaxation minimize n In this case the resulting KKT system also has an equality and several inequality constraints in the block A. Preconditioning results are presented for the KKT systems of an instance with n = 12600 inhabitants splitting into groups of sizes 5770, 6000, 600, 30, 200, with k = 126 infectious persons and n = 1260 available vaccinations.
In the actual computations the bundle size grows surprisingly fast. This not only entails enormous memory requirements but also excessive computation times for DS; indeed, computations of DS may exceed those of DP by a factor of 70. In consequence comparative results can only be reported for a very limited number of subproblem evaluations. In particular, the precision requirements remain rather moderate throughout these iterations. Still, the same initial behavior can be observed as for the previous two instances. For very small bundle sizes DS is best. Once the bundle size grows, the iterative methods take over. Among the iterative solvers RP is better than IT, but DP is the method of choice. It succeeds in tightly controlling the condition number by selecting rather few columns. With this DP requires the fewest matrix vector multiplications which seems to pay of quickly on this instance.

Performance within the Bundle Method for Max-Cut
The purpose of this section is to provide evidence for the reliability of the KKT oriented evaluations when the iterative solvers are employed within the bundle method directly. As explained in the introductory remarks to this Section 4, a full assessment of the use of iterative solvers within conic bundle methods is out of scope and beyond the possibilities of this work. Therefore results will only compare, without any further adaptations, the direct replacement of DS with the solvers IT, RP and DP, within the current ConicBundle implementation that was developed and tuned for DS. Note, however, that the evaluation of bundle methods requires a statistical approach.
In oracle based nonsmooth optimization it is typical that even slight numerical changes in the computation of candidates bring along significant differences in the actual trajectories. Indeed, candidates are generically close to ridges. Which subgradient is returned depends on which side of the ridge the candidate ends up. In particular, the use of different KKT solvers quickly leads to considerable differences in the models and subproblems and therefore also in the sequence of KKT problems. This erratic behavior is intrinsic at any level of precision, therefore it may be expected that the average number of bundle iterations (descent and null steps) does not depend too much on the actual KKT solver in use. Yet, due to this incomparability of trajectories, any attempt to assess the scope of the iterative solvers in comparison to the direct solver needs to be based on a reasonable collection of comparable instances. Their choice should help to illustrate the effects of parameters, that can be expected to be influential in the current context,  In order to highlight the dependence on the barrier parameter µ with corresponding precision requirements the results for KKT systems are grouped into µ-value ranges (∞, 100], (100, 1], (1, 0.01], (0.01, 0).
• the cost of matrix-vector multiplications, • the size of the model, • precision requirements, • the use or non-use of a predictor corrector approach, • the number of KKT instances and solves per subproblem. In order to cover these aspects with manageable effort, results will be presented for eight methods and four groups of 25 randomly generated Max-Cut instances. The methods without predictor corrector approach are denoted by DS, IT, RP, DP and those with predictor corrector approach by DSp, ITp, RPp, DPp. The names refer to using the respective direct or iterative solver for the KKT systems of the internal interior point method of ConicBundle for solving the subproblems. The four instance classes arise by generating five instances per number of nodes n ∈ {10000, 20000} and per density out of two edge density groups, one with smaller densities d ∈ {0.1, 0.2, 0.3, 0.4, 0.5} and one with higher density d ∈ {1, 2, 3, 4, 5} ( [35], call rudy -rnd graph n d s for seed s ∈ {1, 2, 3, 4, 5}). The instances were solved with ConicBundle [19] on computers having QUAD-Core-processors INTEL-Core-I7-4770 with 4× 3400MHz, 8 MB Cache, 32 GB RAM and operating system Ubuntu 18.04. The code was run in sequential mode with each instance solved en suite for all methods on the same machine and all time measurements refer to user time. Mandated by limited resources, some volatility may have been caused by running two instances on each machine at the same time as well as by occasional further jobs. As instances and methods were randomly affected by this, influence on the conclusions should be marginal in view of the number of examples.
The max-cut instances serve the purpose well for the following reasons. First, as explained before, the direct solver DS is particularly efficient for Max-Cut instances, because the Schur complement needs to be computed only once at the beginning of each bundle step for all interior point iterations / KKT systems associated with this subproblem. Thus, if iterative solvers are competitive for Max-Cut this should also hold for more general cases. Likewise, the iterative solver IT without preconditioning performed better on the KKT instances for Max-Cut than on those of the two other examples, therefore the limits of preconditioning are best discussed for Max-Cut. Second, for Max-Cut even large scale instances can be solved to reasonably high precision in manageable time which allows to compare the performance on several precision levels. To make this comparison reasonably efficient and fair in view of the weaknesses of the lack of progress stopping criterion of bundle methods, the comparisons use for each level of relative precision 10 −3 , 10 −4 , 10 −5 and 10 −6 the first descent step that produces a value below an instance dependent common relative reference value. For each instance this reference value is obtained by taking the minimum objective value obtained over all methods by running ConicBundle with termination precision 10 −6 . Third, random max-cut instances having the same number of nodes and similar edge density can be expected to have similar parameters and properties in terms of model size, cost of matrix-vector multiplications, and precision requirements. Note that a higher edge density increases the cost of matrix-vector multiplications but also causes a larger offset in objective value which entails somewhat reduced precision requirements for the KKT systems to reach the same relative precision. As the experiments will show, the model size -it is selected by ConicBundle on basis of the active rank, starts with roughly twice this size after descent steps for reasons investigated in [10] and increases further over null steps -seems to be less dependent on the edge densities but grows markedly with the number of nodes.
The first aspect to address is the dependence of the bundle method on the solvers. For this figures 4 and 5 display for each group and solver the average development of the model sizes, the number of KKT systems solved per subproblem together with the last µ-value occuring there (it reflects the precision requirement of the final KKT system within the subproblem) and also the precision requirements on the subproblems themselves. This development is recorded in averages over groups of 10 steps and all 25 instances for each of the four instance groups. This rather detailed view will also help to explain differences in the computation times of the solvers for various precision levels.
The plots of figures 4 and 5 exhibit a natural separation between the methods with and without predictor corrector approach. In comparison, the differences between the solvers are almost negligible except for a few final iterations, which also suffer from stronger volatility due to a reduced number of samples. It is worth noting that the same holds for the overall number of KKT systems and bundle steps throughout all relative precision levels. A helpful visualization to illustrate such comparisons are performance profiles [11] for the total number of steps and KKT systems for each precision level. These turn out to have a shape similar to that displayed in Figure 6 which presents the two profiles for bundle steps and KKT systems for the sparser case on 20000 nodes and relative precision 10 −6 . While the smaller number of KKT systems (i.e. interior point iterations) of the predictor corrector variants is an expected outcome, it is rather surprising that the variants without predictor corrector seem to need a few bundle steps less on average to reach the required precision. A comparison with the model size plots of 4 and 5 suggests, that there is a distinct difference in the nature of these interior point solutions that also has its effect on the bundle selection mechanism, but so far this lacks a mathematically sound explanation. Still, a first conclusion might read that the behavior of the bundle method itself is, on average, independent of the choice of the four solvers. Figures 7 and 8 display computation time performance profiles of the eight methods on the four classes of 25 instances for relative precision levels 10 −3 , 10 −4 , 10 −5 and 10 −6 . These largely match the results on individual KKT systems. First consider the direct solvers DS and DSp. Again, a good explanation is lacking for the fact that DS dominates DSp in many cases with higher number of edges and higher precision. Whether DS and DSp are attractive compared to iterative methods depends on the ratio of the time invested into forming the Schur complement to the number of KKT steps required for solving the subproblem, i. e., they are preferable if the model size is small or the number of interior point iterations becomes large enough due to increasing precision requirements. In cases of strong initial growth of the model size (see the model size plots of figures 4 and 5) iterative solvers are quickly better. The seemingly good performance of the direct solvers on the denser instances for precision level 10 −3 is mostly due to the large constant offset that causes the methods to reach this precision often within ten steps (compare this to the asymptotic analysis in [21]); at this point model sizes are still small. Iterative solvers dominate precision levels 10 −4 and 10 −5 with reasonably low accuracy and few interior point iterations. The influence of the cost of a matrix-vector multiplication is visible in the difference of the initial head start to direct solvers between sparser and denser instances for precision 10  Figure 8: The profile is taken with respect to the time needed to reach a relative precision of 10 −3 , 10 −4 , 10 −5 , and 10 −6 relative to the best value obtained over all methods. might again be caused by the constant offset, that is larger for random Max-Cut instances with larger number of edges. Indeed, an inspection of the last µ plots and KKT systems per subproblem in figures 4 and 5 suggests that for max-cut instances with a larger number of edges the subproblem solutions require less absolute accuracy which favors iterative solvers and compensates somewhat the higher cost of the matrix-vector multiplications. Note that the relative precision requirements for the solution of the subproblems (see figures 4 and 5) are almost identical.
For the iterative methods the predictor corrector variant seems faster on lower precision levels but again the methods without predictor corrector catch up or may even dominate higher precision levels. For IT (no preconditioning) this should largely be due to the fact that predictor corrector requires two solves per KKT system. Thus, taking twice the number of KKT systems per problem for predictor corrector variants in figures 4 and 5 as the number of required solves provides a satisfactory explanation for IT. For RP and DP the situation is less clear cut, because the preconditioner is formed only once per KKT system, but the line of argument is similar. For the moderate accuracy levels 10 −3 and 10 −4 Max-Cut instances could do without preconditioning, but the preconditioned variants do a good job. For 10 −5 the advantage begins to show and for 10 −6 the DP variants are almost consistently better than the other iterative methods.
Based on this analysis, a hybrid approach seems advisable that switches dynamically between the solvers depending on precision, model size and number of interior point iterations. In implementing these ideas a number of further design aspects would have to be reconsidered as outlined before. The true advantage of iterative solvers, however, is that dynamic model adaptations become feasible during the solution of the subproblem, because there is no need to recompute the Schur complement each time. This allows for entirely new strategies such as combining the ideas of [33,4] and [24] in order to cut down on the number null steps at an early stage. This remains to be addressed in future work.

Conclusions
In search for efficient low rank preconditioning techniques for the iterative solution of the internal KKT system of the quadratic bundle subproblem two subspace selection heuristics -a randomized and a deterministic variant -were proposed. For the randomized approach the results are ambivalent in theory and in practice; obtaining a good subspace this way seems to be difficult and the cost of exploratory matrix-vector multiplications quickly dominates. In contrast, the deterministic subspace selection approach allows to control the condition number (and with it the number of matrix vector multiplications) at a desired level without the need to tune any parameters in theory as well as on the test instances. On these instances, for low precision requirements (large barrier parameter) the selected subspace is negligible small. For high precision requirements (small barrier parameter) the subspace grows to the active model subspace. If the bundle size is close to this active dimension, the work in forming the preconditioner may be comparable to forming the Schur complement for the direct solver. Still, for large scale instances the deterministically preconditioned iterative approach seems to be preferable.
Conceivably it is possible to profit in ConicBundle from the advantages of the deterministic iterative and the direct solver by switching dynamically between both. The current experiments relied on a predictor-corrector approach that was tuned for the direct solver. In view of the properties of the iterative approach it may well be worth to devise a different path following strategy for the iterative approach, in particular for the initial phase of the interior point method when the barrier parameter is still comparatively large and the work invested in forming the preconditioner is still negligible. Similar ideas should be applicable to interior point solvers for solving convex quadratic problems with low rank structure.

A Tables
For each box plot of figures 1-3 for the three instances MC, BIS and MMBIS the following tables list the number of instances and the values of the parameters minimum, lower quartile (Q 1 ), median, upper quartile (Q 3 ), maximum. For each of the three instances an additional table gives the statistics on the Euclidean norm of the resulting residual of (6) achieved by the respective solver for the KKT systems grouped by the usual value ranges of the barrier parameter.