Computing minimum-volume enclosing ellipsoids

For a given multidimensional data set (point cloud), we investigate methods for computing the minimum-volume enclosing ellipsoid (MVEE), which provides an efficient representation of the data that is useful in many applications, including data analysis, optimal design, and computational geometry. Contrary to conventional wisdom, we demonstrate that careful exploitation of problem structure can enable high-order (Newton and Newton-like) methods with superlinear convergence rates to scale to very large MVEE problems. We also introduce a hybrid method that combines the benefits of both high-order and low-order methods, along with new initialization schemes that further enhance performance. Observing that computational cost depends significantly on the particular distribution of the data, we demonstrate that kurtosis serves as an excellent indicator of problem difficulty and provides useful guidance in choosing an appropriate solution algorithm and initialization.

where H ∈ R n×n is a symmetric positive definite matrix.The volume of the ellipsoid is given by where κ n is a constant (the volume of the n-dimensional unit sphere) for a given value of n.Given m data points x i ∈ R n , i = 1, . . ., m, we seek the ellipsoid E(H , c) of minimum volume that encloses all of the data points, i.e., ( This minimum-volume enclosing ellipsoid (MVEE) is also known as a Löwner ellipsoid or Löwner-John ellipsoid [11].Published proofs of its existence and uniqueness include [6] and [23].
As shown in [19], any MVEE problem in n dimensions can be reduced to the problem of finding an ellipsoid in n + 1 dimensions centered about the origin, and thus for simplicity we will henceforth make the latter assumption.As such, the ellipsoids we deal with will be of the form For notational convenience, we will take the given data points x i , i = 1, . . ., m, to form the columns of an n × m matrix X .In typical applications the number of data points far exceeds the dimension of the space in which they reside, i.e., m n.

Primal and dual forms
To minimize the volume given by Eq. ( 2), we must maximize det(H ).Rather than maximize the determinant directly, however, we follow [17] and define the log-determinant function The −logdet function is strictly convex on the space of symmetric matrices, and it takes the value ∞ for any non-positive-definite matrix, which suits our purpose because H must be positive definite to satisfy the MVEE problem.With this new objective function, the primal form of the MVEE optimization problem becomes minimize The corresponding dual form of the MVEE problem, as derived in [20], is maximize u g(u) := logdet(XU X ) subject to e u = 1, u ≥ 0, (7) where u ∈ R m is a vector of Lagrange multipliers, the matrix U = diag(u), e is a vector with each component equal to 1, and u ≥ 0 holds componentwise.The m dual variables u i correspond to the m data points x i , and as we will see, at optimality a given data point x i either lies in the interior of the ellipsoid if u i = 0, or on the surface of the ellipsoid if u i > 0. Typically, the set of dual variables is very sparse in that relatively few of them have nonzero values (i.e., most of the data points lie in the interior), and exploiting that sparsity will be key to deriving efficient algorithms for the MVEE problem.
If ū solves the dual problem, then the matrix H that solves the primal problem can be recovered using the relationship so henceforth we define H (u) := (XU X ) −1 .(9)

Optimality conditions
Following [1], we have the following necessary and sufficient optimality conditions for a feasible solution u to the dual problem: 1. H (u) := (XU X ) −1 must be feasible for the primal problem, i.e., x i H (u)x i ≤ 1, i = 1, . . ., m, and 2.
The algorithms presented later will make use of two analogous approximate optimality conditions, introduced in [1] and [13], to determine the quality of an approximate solution: 1. H (u) := (XU X ) −1 satisfies 2.
A feasible u is -primal feasible if it satisfies the first condition above.If u also satisfies the second condition, then it is -approximately optimal.If u is -primal feasible, then the dual objective function g(u) is within n of the optimal value.Because of this guaranteed quality of fit, we will use -approximate optimality as the stopping criterion for MVEE algorithms.

Core set
It is intuitively clear that the only data points x i that affect the optimal solution H are those that lie on the surface of the ellipsoid.To see this more rigorously, note that Eq. ( 9) can be rewritten as a sum of outer products from which we see that only the points x i corresponding to nonzero u i are involved in determining H , and from the second of the dual optimality conditions in Eq. (10), such points must have x i H x i = 1, i.e., they lie on the surface.Such a set of points on the surface that determine the entire ellipsoid is called a core set [14].
How large is the core set for a given MVEE problem likely to be?At least n of the rank-one matrices summed in Eq. ( 12) must be nonzero in order for H to be nonsingular, so the core set must contain at least n points.The space of symmetric n × n matrices is of dimension n(n + 1)/2, so at most n(n + 1)/2 points are needed to determine H . Thus, the core set size will be between n and n(n + 1)/2, independent of the value for m.As we will soon see, identifying the correct core set is a crucial stage in computing the MVEE for a given data set.

Prior state of the art
The prevailing state of the art for computing MVEEs is nicely summarized in the recent book by Todd [20].The scope of the book is rather narrow, however, in that -only one formulation of the problem is pursued, the dual; -only one type of algorithm is developed, based on coordinate ascent; -only one nontrivial scheme is employed for initializing the algorithm; -only one data distribution is used in generating random test problems.
Although these choices seemed well founded for the purposes of the book and led to effective results, nevertheless we felt that they left open a number of potentially promising avenues for further investigation.In particular, like most prior work on the MVEE problem, the book dismisses high-order methods out of hand as being too expensive for very large problems, despite their faster convergence rates.We will show, however, that by carefully exploiting problem structure high-order methods, including Newton's method, can be made highly competitive even for very large problems having many thousands or millions of data points.In addition, we will also develop and test a hybrid algorithm that leverages the respective strengths of both low-order and high-order methods, as well as new and more broadly effective initialization schemes and a simple characterization of data distributions that aids in choosing an effective initialization and solution method for a given data set.
The results that we present here are based largely on the unpublished PhD thesis [4], which should be consulted for additional background and details, including additional pseudocode for many of the algorithms and far more computational results than we have space to display here.

Algorithms
For a typical MVEE problem with m n, the number of dual variables, m, greatly exceeds the number of primal variables, n(n + 1)/2, which might seem to suggest that the primal problem would yield more efficient solution algorithms.This is not the case in practice, however, because the vast majority of the dual variables are zero, namely those corresponding to data points interior to the ellipsoid, while the relatively few nonzero dual variables correspond to the relative handful of points lying on its surface (the core set).This sparsity of the dual constraints, if aggressively exploited, enables highly efficient dual algorithms that effectively operate in far fewer than m dimensions.To confirm these assertions, we have implemented a variety of both primal and dual algorithms and have found the dual algorithms to be consistently more effective, and thus we will focus on the latter henceforth.

Coordinate ascent algorithm
As it will serve as our benchmark for comparisons, we first briefly describe the current state-of-the-art algorithm, a variant of the Frank-Wolfe method [8], proposed for the MVEE problem in [2] (in the equivalent context of D-optimal designs), analyzed in [1], and implemented in MATLAB in [20, Appendix B] (which our Python implementation [3] closely mimics).The algorithm is based on coordinate ascent, meaning that only one solution component is changed in any given iteration.At each iteration, the most strongly violated optimality condition is determined and an optimal step is taken along that coordinate, normalizing to maintain feasibility.Three features make this algorithm especially well suited to the MVEE problem: -the gradient required at each iteration (to determine the most strongly violated constraint) can be updated in O(mn) operations rather than the O(mn 2 ) cost of computing a new gradient from scratch; -the Cholesky factorization required at each iteration (to implement Equation ( 9)) can be updated in O(n 2 ) operations rather than the O(n 3 ) cost of computing a new factorization from scratch; -there is a closed form for the optimal step length, so no line search is necessary.
Though these features keep the cost per iteration low, the algorithm's linear convergence rate can be exceedingly slow, often taking thousands of iterations to converge for typical MVEE problems.

Higher-order algorithms
We next investigate a family of algorithms having very different tradeoffs from the coordinate ascent algorithm.In particular, we consider higher-order (Newton and Newton-like) methods with significantly greater cost per iteration but significantly faster asymptotic convergence rates.Given the large-scale, highly sparse, linear constraints of the dual form of the MVEE problem given in Eq. ( 7), a natural framework is the approach of Murtagh and Saunders [15].We chose this approach specifically because it requires only projected versions of the gradient and Hessian, which we will show can be computed at a cost that is independent of the number of data points m for the MVEE problem.
The algorithm of Murtagh and Saunders [15] begins with a feasible initial guess for the solution, which is then revised step-by-step through successive iterations, while maintaining feasibilty, until the optimal solution is reached.An active set strategy is employed, meaning that at each iteration the inequality constraints are divided into those that are binding (the current active set) and those that are not binding (and therefore do not affect the computation of the direction for the next step).This division of the constraints is updated as iterations proceed until eventually the constraints that are binding at the solution are identified, at which point we effectively have an equalityconstrained problem.
For the dual form of the MVEE problem given in Eq. ( 7), the sum-to-one equality constraint is always active, and the remaining active constraints correspond to the data points that lie in the interior of the current approximate ellipsoid, i.e., the complement of the current approximate core set.Thus, we can alternatively think of this process as beginning with an initial guess for the core set (see Sect. 4) and then iteratively revising it (see Sect. 2.2.5) until the correct optimal core set is eventually identified.

Basis for feasible steps
To ensure feasibility at each iteration, we need a basis matrix Z such that any step along a direction in the column space of Z will remain feasible.Following [15], we let the number of dual variables u i not fixed at their bounds (i.e., those corresponding to data points in the approximate core set) at a given iteration be r + 1.We then label the dual variables as follows: N : The m − r − 1 variables fixed at their bounds are called nonbasic variables.B: One of the nonfixed variables must be selected to enable the sum constraint in Eq. ( 7) to be satisfied; that variable is called the basic variable.S: The remaining r variables, which can be modified freely, are termed superbasic variables.
As in [15], we rearrange the variables so that the basic variable is the first entry of the vector, the nonbasic variables are the last entries of the vector, and the superbasic variables are in the middle.[We do not actually rearrange the variables in practice, but it makes the exposition easier to follow.]With this arrangement of the variables, we can write the constraint system from Eq. ( 7) in matrix form as and define the m × r matrix where the r -vector e of all ones comes from the sum-to-one constraint and the r × r identity matrix comes from the nonnegativity constraints involving the superbasic variables.Then the columns of Z are orthogonal to the rows of the constraint matrix, and thus if any computed step is projected onto the span of Z , then the resulting movement will be parallel to the constraint surface, thereby maintaining feasibility.
The simple structure of Z shown in Eq. ( 14) enables cheap matrix-matrix multiplication.When computing Z A for any m ×n matrix A, each row of the resulting matrix is calculated by subtracting the first row from one other row.Since there are r rows in the product, the operation takes only nr scalar subtractions, a substantial savings over the O(mnr) cost for dense matrix-matrix multiplication.Avoiding dependence on m is especially beneficial, given that m n in typical MVEE problems.Computing Z B for an r × p matrix B is similarly cheap.The first row of the output is simply the negated sum of each column of B, and the remaining rows are either copied from the corresponding rows of B or are rows of zeros.The cost of computing the product Z B is therefore r p scalar subtractions, far less than the O(mr p) cost to multiply dense matrices.
Cheap multiplication by Z and Z allows for cheap projection of an arbitrary vector v onto the span of Z by computing Z (Z v).One downside is that Z Z is not an orthogonal projector.However, the orthogonal projector Z = Z (Z Z ) −1 Z onto the span of Z is just a small constant factor more expensive to multiply with than the non-orthogonal projector Z .In the same way that we do not actually form Z , we can compute the effect of a matrix-vector multiplication Z v with an arbitrary vector v without constructing Z .
Because we would rather work with smaller r -vectors than the full m-vectors whenever possible, the projection procedure stops short of actually computing Z v for an input vector v, instead computing v z = (Z Z ) −1 Z v.If the full projection is needed, it can be computed as Z v z .Algorithm 1 demonstrates how to compute v z .For a single Algorithm 1 Orthogonal Projection vector v of dimension m, the cost of this projection is about 3r operations.Because of the large number of active dual bounds, projection of an m-vector requires far fewer than O(m) operations for most problems.

Projected gradient
All of the algorithms we will implement require the projected gradient, which we next show how to compute efficiently using the projectors developed in Sect.2.2.1.
The gradient ω of the dual objective function evaluated at u is given by where H = (XU X ) −1 .Evaluating the full gradient would require computing and factorizing XU X and performing m forward-and back-substitutions and dotproducts, at a total cost of O(mn 2 ).Fortunately, the methods we will employ do not need the full gradient because they are restricted to a subspace of R m by the active constraints.These methods rely instead on the projected gradient Z ω for a non-orthogonal projection (Z Z ) −1 Z ω for an orthogonal projection (16) to determine the next step.It might appear that computing H −1 costs O(mn 2 ) operations, but there are only r + 1 nonzero diagonal elements of U , which means H −1 is a sum of only r + 1 outer products of columns of X .Thus, computing H −1 = XU X costs only O(rn 2 ) operations.The cost to compute a Cholesky factorization of Those same r + 1 columns are the only columns necessary for computing the projected gradient.First, the elements of ω corresponding to the superbasic variables are computed and stored in a vector, ω s .Next, the single element of ω corresponding to the basic variable, ω b , is computed.If we are using the non-orthogonal projector, the projected gradient is then computed as Alternatively, the orthogonally projected gradient is computed as Thus, computing either of these projections of the gradient costs only O(rn 2 ) operations, including computing and factorizing H −1 , as opposed to the O(mn 2 ) cost of computing the full gradient, thereby enabling highly scalable implementations of high-order methods for the MVEE problem, as we will show.

Projected Hessian
In addition to the projected gradient, implementation of Newton's method will require the projected Hessian matrix as well, which we next show how to compute efficiently using the projectors developed in Sect.2.2.1.
The Hessian matrix of the dual objective function is given by where the lower triangular matrix L is the Cholesky factor of H . Let which is an element-wise square root of the negative Hessian.The first step toward calculating the projected Hessian Z T G Z is to calculate the projected square-root Hessian, Thus, Z T S Z can be computed without an m term in its cost even though X is of size n × m.Computing Z T X T costs O(nr).A forward substitution with L costs an additional O(n 2 r ) operations.Finally, multiplying the resulting matrix with itself costs O(r 2 n).Thus, the total cost to compute the projected square-root Hessian is O(r 2 n).
It will be useful later to understand the structure of the projected matrix in terms of the entries of S. We first split S into blocks as follows: The entries s 00 s T 10 s 10 S 11 (23) form the leading (r +1)×(r +1) submatrix of S. The rest of the entries are unimportant as they will be removed by the projections.We now refer to the projected square-root Hessian as M Z Z = Z T S Z.Using the block form of S above and simplifying the matrix-matrix multiplication, M Z Z can be written as Now that we can compute the projected square-root Hessian efficiently, we will use it to solve for the projected Hessian.The projected Hessian takes a form similar to M Z Z , namely 11 − (s 2 10 )e T + e(s 2 10 ) T + s 2 00 ee T ).(element-wise squaring) (25) However, we will need more information to get from M Z Z to Z T G Z. Note that we cannot simply square the entries of the matrix M Z Z , and we do not actually have the matrix S. Also, we cannot use the matrices Z T S or S Z in the solution.Because they lack a second projection, they cannot be computed as cheaply as Z T S Z, and both involve an m term in their cost.To get around this, we will introduce a new matrix, Ẑ , that has the following structure: Just as with Z , Ẑ can be used to perform multiply operations cheaply.Using a procedure similar to Eq. ( 21), we can combine Z and Ẑ in various ways to obtain matrices with known relationships.We introduce three new matrices using this scheme: Just as with M Z Z in Eq. ( 24) (repeated here for convenience), these matrices can be written as With these additional equations, we can now solve for the parts of the projected Hessian.First, we find that This allows us to solve for the remaining values: With these values, we can use Eq. ( 25) to construct the projected Hessian.Note that at no point in the construction of the projected Hessian did we perform an operation with cost proportional to m, the number of data points.This independence of m that we have just demonstrated will enable us to develop a highly scalable implementation of Newton's method for the MVEE problem, as we will see.

Algorithm template
We now have all the tools we need to implement the algorithms we will consider, all of which fit the template given in Algorithm 2. Most of the algorithms differ only in the computation of the step direction, represented in Algorithm 2 by the function compute_direction().Some methods require additional auxiliary work or storage that, if necessary, would take place in the optional code block at the end of the function.The project() routine is shown in Algorithm 1. Unless stated otherwise for a particular method, this routine is used to perform an orthogonal projection.If a non-orthogonal projection is needed, the vector is simply multiplied by Z Z .The matrix returned by compute_projector() is described in Sect.2.2.1.For efficiency, an actual implementation would not store the full matrix Z , but would instead compute the effect of Z on a given vector when needed.An overview of modify_constraints() is given Sect.2.2.5.Choose feasible initial guess u

6:
while not converged do Solution is optimal to specified tolerance.Stop.[. . .method-specific auxiliary code . . .] We see in Algorithm 2 that the entire gradient is computed in order to determine convergence and to update the active constraint set, but only the projected gradient is necessary to compute the direction of the next step for all of the methods.We saw in Sect.2.2.2 that computing the projected gradient directly is possible and leads to a significant savings in operations.This suggests that an alternative stopping criterion or some cheaper way of determining which constraints are most strongly violated could speed up the constrained algorithms considerably.(Recall that the coordinate ascent algorithm also computes the gradient for this same purpose.)Although we currently have no such alternative criterion or approximation, this is a possible direction for future research.

Active set
To determine valid directions for stepping, the algorithms need a way to determine when constraints are active, i.e., when they affect the choice of step direction, and when they are inactive.The sum constraint in Eq. ( 7) is always active, but the ith bound is active only when u i is 0, or, for practical purposes, when u i is very small.The active set refers to the set of all active bounds.Conceptually, we can use the matrix Z to determine implicitly which bounds are active, which is how the methods are stated here for expository purposes.In practice, however, it is much more efficient to maintain a separate representation of the active set.
As the current step and corresponding gradient are modified, points will be added to or removed from the active set.When determining whether to add or remove constraints, we follow the procedure in [15], which relies on approximations to the Lagrange multipliers.
Before removing a constraint from the active set, we want to be reasonably sure that doing so will be helpful in the next step.A large Lagrange multiplier is an indication that the corresponding constraint is preventing a step that would improve the objective function.However, the size of the current projected gradient must be taken into account as well because the sizes of the Lagrange multipliers are more important when the current iterate is nearly optimal in the current active set, which is indicated by a small ||ω z || 2 .Thus, if the largest Lagrange multiplier is λ i , we remove the ith constraint from the active set if Adding constraints to the active set is straightforward: a constraint is added to the active set when it has restricted the length a step.Once the step direction is computed in Algorithm 2, the maximum step length in that direction that can be taken without violating a constraint is computed.A line search is then performed to minimize the objective function along the step direction.If the line search determines that it should take the maximum allowed step size, then the limiting constraint is added to the active set.

Dropping points
As mentioned in Sect.1.3, the MVEE of the core set is the same as the MVEE of the entire set of points, so the other points are unnecessary for computing the MVEE.This leads to an important implementation detail that applies to all of the algorithms, including coordinate ascent.Under certain conditions, we can be sure that a point will not be in the core set, and we can drop such points entirely.A procedure to do so, first described in [9], is given next.

Let
δ(u) := g(u * ) − g(u) (30) be the difference between the objective function at the current iterate and the unknown optimal objective function.In [9], it is shown that any point x i satisfying where u is feasible and δ = δ(u), is interior to the optimal ellipsoid and thus can be removed.Although we cannot calculate δ(u) directly, we noted in Sect.1.2 thatfeasibility gives us a bound on δ(u).We can use this bound and Eq.(31) to determine which points are no longer necessary.This check is made and the relevant points removed once every fixed number of iterations, usually chosen in this work to be 50 or 100.Performing this pruning is especially important for coordinate ascent because that algorithm takes many iterations and thus is strongly benefited by speeding up each iteration.Indeed, dropping points in this manner is reported in [20, page 47] to have resulted in speedups by a factor of four to seven, depending on problem size.

Specific methods
Using the template given in Algorithm 2, we have implemented the following specific optimization methods for the MVEE problem (see [10] or [16] for an overview of these methods): -Newton's method -Truncated Newton -Truncated Newton FD (finite difference) -BFGS -L-BFGS (limited memory) -Conjugate gradient -Gradient ascent Because it computes the projected Hessian at each iteration and solves the linear system for the step by factorization, Newton's method is the most costly of these methods per iteration, but it has the fastest asymptotic convergence rate, namely quadratic.The other methods (listed roughly in decreasing order of convergence rate) try to reduce one or both of these per-iteration costs, but do so at some sacrifice in the convergence rate (superlinear for the first few, down to linear for gradient ascent).
It should also be noted, however, that the first three methods (Newton and its closest variants) can use non-orthogonal projections (because it makes no difference in the computed step), and typically they do not need a line search.The remaining methods all use orthogonal projections and also employ a line search, significantly offsetting their other advantages in cost per iteration.See [4] for further details on all of these methods, including pseudocode for each method.
Although we showed in Sects.2.2.2 and 2.2.3 how to avoid any dependence on m in computing the projected gradient and projected Hessian, respectively, nevertheless these algorithms are still significantly more costly per iteration than the coordinate ascent algorithm.So one question to be resolved in our computational experiments is whether their faster convergence rates can offset this disadvantage.Because they represent the opposite extremes of the tradeoff between convergence rate and cost per iteration, we will focus primarily on the Newton and coordinate ascent algorithms, which also turn out to be the best performers, as we will see.

Related work
A previously published high-order algorithm for the MVEE problem is due to Sun and Freund [18], based specifically on Newton's method.The authors noted that Newton's method is seemingly too costly to be practical for a typical full MVEE problem with m n, so they instead select a relatively small subset of the data points and apply Newton's method to compute the MVEE for this subset.After iterating to convergence for the subset, the optimality conditions are checked for all points in the original problem, and a subset of points that most strongly violate the optimality conditions are added to the previous subset of points.Newton's method is then repeated with this new set of points, continuing until ultimately converging to a solution of the full problem.
At a conceptual level, the method of Sun and Freund contrasts sharply with the direct exploitation of sparsity for the full MVEE problem in our active set method, and there are also important practical differences.In particular, the active set potentially changes at each iteration in our approach, whereas the subset of points is modified only after convergence for each subproblem in [18].Assuming that the subset of points chosen for a given subproblem in [18] is the same as the active set represented implicitly by the basis matrix Z in our active set approach, one step of Newton's method will be asymptotically equivalent in cost for the two methods.The convergence histories of the two methods differ markedly, however, in that the algorithm of [18] must iterate to convergence repeatedly, once for each subproblem, whereas our approach converges only once, at the final optimal solution.Unfortunately, we were unable to make any empirical performance comparisons between the two methods, as we know of no publicly available code implementing the method of [18].

Categorizing MVEE problems
What feature of a given data set makes its MVEE relatively easy, or relatively difficult, to compute?And how can we quantify this difference?Recall from Sect.1.3 that the MVEE is determined entirely by the core set, i.e., the points lying on the surface of the ellipsoid.In this sense, the ellipsoid is determined entirely by the outliers in the data.It seems natural, then, to conjecture that the MVEE would be easier to compute for data sets whose outliers are more extreme, i.e., more isolated from other data points.
Computationally, what matters is the density of data points near the surface of the fitted ellipsoid.Points near the surface of the fitted ellipsoid are nearly in the core set, and the more viable candidates there are for membership in the core set, the more difficult it is to choose among them.

Kurtosis
A standard measure of the "tailedness" of a given distribution, i.e., its propensity for outliers, is called kurtosis, with a higher kurtosis value indicating a greater extremity of outliers.In one dimension, the kurtosis of a data set X with m elements is defined as where X is the mean.
There apparently being no standard definition of kurtosis for multidimensional data, we have found it effective to compute the mean of the logarithms of the onedimensional kurtosis values, taking each row of the data set (i.e., each coordinate dimension) separately.Taking logarithms of the kurtosis values before averaging accommodates data sets in which the kurtosis values along differing coordinate dimensions vary widely in magnitude, which would otherwise yield an average dominated by any outliers.We refer to this mean as the log(kurtosis) of the entire data set.If desired, a kurtosis on the same scale as one-dimensional kurtosis can be recovered by raising 10 to the power of the logarithmic value, though we often find it preferable to work with the logarithmic value directly.
Another technicality is that the MVEE algorithms considered here first transform the data set into one higher dimension and then find an ellipsoid in that dimension centered on the origin.The centered ellipsoid about a set of m points {x i } is the same as the MVEE about the set of points {x i } {−x i }.Using this relationship, we find that more accurate results are obtained by calculating the kurtosis of {x i } {−x i } instead of just {x i }.

Kurtosis and core set size
To examine the relationship between kurtosis and core set size, we perform computational experiments using various data sets described in Appendix A, in which we compute the MVEE of each data set to determine the size of its core set.Figure 1 shows the observed relationship between kurtosis and core set size for the indicated synthetic data sets.Each point on the plots corresponds to a single data set for which the MVEE was computed.Overall, there is a clear correlation, with core set size decreasing as kurtosis increases.
Based on their kurtosis values, the data points can be divided into three regions displaying different behaviors: low kurtosis points in the upper left that tend to have large core sets, very high kurtosis points in the lower right with core sets near the lower bound of n, and points in a middle region where the core set size is decreasing.The dashed lines in Fig. 1 demonstrate that the observed core set sizes indeed fall between the known upper and lower bounds.For problems with very high kurtosis values, the sizes of the core sets are near the lower bound of n elements, whereas for low kurtosis problems, on the other hand, the core set sizes tend not quite to reach the upper bound of n(n + 1)/2 elements.The median core set size appears to be roughly n √ n.

The horizontal breakpoints between the regions appear to occur at kurtosis values of
To determine the limits of the core set sizes that occur in practice, and to observe how the behavior varies with n, we repeat the process used to generate Fig. 1 with several different values of n and find the maximum and minimum core set size for each group of data sets for each value of n.The results are shown on the left in Fig. 2. Dashed lines indicating various functions of n are also included in the plot.Clearly, the largest observed core set size does not follow the upper bound of n(n + 1)/2, but instead grows as O(n √ n).The lower bound, on the other hand, does grow linearly with n.
The median core set sizes are also of interest because they are more representative of the typical behavior in the low kurtosis and very high kurtosis regions.We see on the right in Fig. 2   demonstrate that by choosing simple scaling constants for these classes of distributions, the relationship between core set size and n for a particular kurtosis region can be fit quite well, thereby enabling us to predict the approximate core set size for a given problem based on its computed kurtosis value.ensure that these conclusions based on synthetic data also apply to real data, we repeat the study using two families of empirical data sets described in Appendix A and combine all the results (with n = 50 for the synthetic data sets) in Fig. 3, where we see that the previously observed trends indeed hold, across an even wider range of kurtosis values.

Kurtosis and iteration count
With an understanding of how kurtosis affects core set size, it is not surprising that kurtosis is also a good predictor of iteration count.To begin demonstrating this, we first confirm our intuition that a larger core set size results in a more difficult MVEE problem.In Fig. 4, each data point indicates the number of iterations to convergence for an MVEE problem having the indicated core set size, with the dimension n indicated by the color code.For both Newton's method and coordinate ascent, a larger core set size means an increased iteration count.There are a few differences of note, however.The relationship between core set size and iteration count for Newton's method is almost linear, whereas for coordinate ascent it is highly nonlinear, depends more strongly on n, and iteration count increases much more sharply with core set size than is the case for Newton's method.
Finally, we look directly at the relationship between kurtosis and iteration count in Fig. 5, which includes both empirical and synthetic data sets (with n = 50 for the latter).We see a clear and fairly consistent relationship between kurtosis and iteration count, with increasing kurtosis being associated with decreasing iteration count.There Fig. 4 Iterations versus core set size for sinh-arcsinh transformed data using Newton (left) and coordinate ascent (right) Fig. 5 Iteration count versus kurtosis for both simulated and empirical data for Newton's method (top) and coordinate ascent (bottom).Legend is as for Fig. 3 are two distinct but related reasons for this close relationship among kurtosis, core set size, and iteration count: -Since constraints tend to be added or dropped one at a time, the larger the core set, the more iterations are needed to identify it.-The density of points near the surface of the ellipsoid determines the number of plausible candidates for the core set, which affects the number of iterations required to determine which of those candidates are actually in the core set.For high kurtosis data, there are not only relatively few points in the core set, but also fewer candidates to choose from, whereas for low kurtosis data, there are not only more points in the core set, but even more candidates from which to choose.
In summary, we have identified a single metric, kurtosis, that is effective in capturing problem difficulty for MVEE data sets, which has at least two important practical implications: -It enables categorization of test problems to ensure that propsed MVEE algorithms are tested over a wide range of problem difficulties, in contrast to some previous studies that were severely limited in the range of problems tested, leading to conclusions of correspondingly limited validity.-It helps inform our choice of algorithm and initialization in computing the MVEE for a given data set, as we will see in detail throughout the remainder of this paper.

Initialization
The first step in the algorithms presented in Sect. 2 is to choose a feasible initial guess û, which must satisfy the constraints given in Eq. (7).In making this choice, we first note that it must reflect the sparsity that accounts for the efficiency-or even the viability-of the algorithms from Sect. 2. For example, equal weighting of all the variables, i.e., taking ûi = 1/m, i = 1 . . .m, as in [13], would be a disastrous choice computationally when m n.Thus, there are two distinct issues in choosing an initial guess: -Which variables ûi should be nonzero?-What numerical values should be assigned to those nonzero variables?As we demonstrated in [4], however, it turns out that the second question is essentially moot, in that even a small perturbation to the true solution values for the nonzero variables renders it no better as an initial guess than random values for those varables.Thus, since finding "good" initial values is hopeless, we adopt the usual practice of evenly weighting the initial nonzero values, and focus our attention on the first question, finding a good initial approximation to the core set.As we observed in Sect.3, computing the kurtosis of the given data set will give us a good indication of the likely the size of the core set, so that we will know about how many points to seek.We next consider various schemes for selecting those points.Here we will give only high-level descriptions; see [4] for further details, including pseudo-code for each scheme.

Norm-based schemes
Since the ellipsoid is determined by the core set of points lying on its surface, the motivation behind all initialization schemes is to identify relative outliers in the data set that are more likely to end up in the core set, and therefore correspond to dual variables that have nonzero values.A simple way of identifying outliers is to compute some norm, say the 2-norm, of each data point and choose the k largest values, where k is the number of points we seek.A more sophisticated (but more expensive) version of this scheme, called SCI, uses the sample covariance matrix of the data to create a weighted norm that better reflects the overall shape of the data set [18].The simple norm-based initialization schemes are relatively inexpensive to compute, and they can generate any desired number of points, but they may select "redundant" points that lie near each other.

Orthogonalization-based schemes
To avoid such redundancy, the remaining data points can be othogonalized against each previously chosen point before choosing the next one.One way to implement this idea, while still maximizing 2-norms, is QR factorization with column pivoting, originally designed to select a set of linearly independent columns in potentially rank deficient least squares problems [5].A similarly motivated alternative specifically for MVEE problems is KY initialization [14], in which a random direction is chosen and orthogonalized against previous directions, if any, and then the maximum data point in that direction, determined by taking the absolute value of the dot product of each point with the direction vector, is chosen to be added to the core set.These orthogonalization methods can be further customized to a given data set by using the sample covariance matrix to redefine the norm used, combining, for example, KY and SCI initialization.
Although they avoid potential redundancy, orthogonalization methods are more expensive to compute (O(mn 2 ) instead of O(mn) for the simple norm-based schemes) and are limited to generating only n points, since there are only n mutually orthogonal directions.If more than n points are desired, an orthogonalization method could be simply repeated on the remaining data set after removing the previously chosen points, although this would reintroduce the possibility of redundancy.

Comparing initialization schemes
Extensive computational testing of all of these initialization schemes is reported in detail in [4], the main conclusions of which we will summarize here.Not surprisingly, the results depend on the distribution of the data.In terms of the number of correct points identified (relative to the true core set for the fitted ellipsoid), for problems of varying size in both m and n, the simple norm-based initializations are the most accurate for low kurtosis problems, but they are the least accurate for high kurtosis problems.The opposite holds for the orthogonalization-based initialization schemes, among which QR initialization has a substantial edge in accuracy for low kurtosis problems and a slight edge for high kurtosis problems.Overall, QR initialization is the most accurate initialization scheme across a range of problem types.
The initialization schemes are also compared in [4] with respect to how much difference they make in the performance of specific optimization methods, measured in terms of both number of iterations convergence and total time to solution (including initialization).For Newton's method, the simple norm-based initialization schemes again perform well for low kurtosis problems but poorly for high kurtosis problems, while QR initialization is the strongest performer across a range of kurtosis values.The performance of the coordinate ascent algorithm is much less sensitive to the initialization scheme, especially for low kurtosis problems, for which even the best schemes yield performance only marginally better than random initialization.
To make a fair comparison across all available initialization schemes, the foregoing tests limited each scheme to selecting only n points for the approximate core set, the lower bound of the true core set size.Since the true core set may contain significantly more than n points, computational experiments are also reported in [4] to assess the effectiveness of continuing the initialization process to select more than n points, using both the simple norm-based schemes (which extend naturally for this purpose) and repeated QR initialization.
First considering accuracy, for all of the initialization schemes and for both lowand high-kurtosis problems, the number of correct points selected continues to grow, at least initially, but the fraction that are correct declines, meaning that more incorrect points are also selected, which will have to be eliminated by the optimization algorithm at significant cost.Because of this tradeoff, additional points should be added with caution.In particular, although the kurtosis of a given data set enables us to predict the size of its core set reasonably well, the declining accuracy of the initialization schemes suggests that further points beyond the minimum of n should be added more sparingly than such an estimate would indicate.Once again, the simple norm-based initializations are most accurate for low kurtosis problems, while QR initialization is most accurate for high kurtosis problems.
Similar results also hold for the performance of the optimization algorithms when initialized using the extended schemes with more than n points.For a more concrete perspective on the issues involved, consider Fig. 6, in which the size of the evolving approximate core set is plotted as iterations proceed, from initialization to convergence, for both Newton's method and coordinate ascent for a low kurtosis problem using each of the indicated initializations.In these plots, once the correct core set has been identified, the curve becomes essentially horizontal thereafter.
We observe that the behavior of the two algorithms could hardly be more different: Newton's method spends virtually all of its time identifying the correct core set and then converges almost immediately, whereas coordinate ascent identifies the correct core set relatively quickly and then expends an immense number of further iterations converging to the optimal solution.As a particularly revealing benchmark, when both algorithms are initialized "perfectly" (i.e., started with the true core set), Newton's method converges almost instantly, whereas for coordinate ascent perfect initialization yields only slightly better performance than a random initialization (!), and actually performs worse than when using some of the approximate initial core sets.We also note from these plots that an initial core set with n √ n points yields better performance than using a somewhat larger initial core set that might be predicted from the kurtosis value of the data (see Fig. 2).The reason for this is the declining accuracy of the initialization schemes as more points are added, as indicated by the downward movement of the curves as erroneous points are removed from the approximate core set.

Initialization guidelines
Major conclusions from the initialization testing include -Newton's method is much more sensitive to the choice of initialization than is coordinate ascent; -Despite being the cheapest option, simple norm-based initialization is highly effective for low kurtosis problems but not for high kurtosis problems; -QR initialization is the single most broadly effective initialization across a wide range of problem types; -Although the kurtosis of a given data set enables us to predict the size of its core set reasonably well, the declining accuracy of the initialization schemes as more points are added suggests that further points beyond the minimum of n should be added more sparingly than such an estimate would indicate.
These conclusions are reflected in Table 1, which summarizes what can be considered best practices for various combinations of initialization scheme, initial core set size, and optimization method.We will follow these guidelines for the performance testing results given in the remainder of this paper.

Hybrid algorithm
When solving a constrained optimization problem using an active set strategy, there tends to be a sharp change in the rate of progress once the correct set of constraints (i.e., those that are binding at the optimal solution) is identified.Prior to that point the algorithm primarily adds and deletes constraints in search of the correct active set, typically making relatively slow progress that bears little relation to the asymptotic convergence rate of the underlying method.Once the correct active set has been identified, however, we then effectively have an equality-constrained problem for which the asymptotic convergence rate of the underlying method can be expected to hold.These two distinct phases are evident in Fig. 7 for Newton's method (left), where the initial phase proceeds exceedingly slowly, followed by extremely rapid convergence once the correct core set has been identified.Coordinate ascent (right) does not seem to exhibit the same two-phase behavior in Fig. 7, but this is because its asymptotic convergence rate is no better (indeed, perhaps worse) than that of the initial phase.The behavior of both methods is consistent with Fig. 6, where we see that Newton's method spends virtually all of its time finding the correct core set, after which it converges almost instantly, whereas coordinate ascent spends the vast majority of its time converging to the optimal solution after having found the correct core set relatively early.
In general, such behavior suggests a hybrid algorithm, using a method with low overhead per iteration (typically with slow but at this point irrelevant asymptotic convergence) for the initial phase, followed by an algorithm with faster asymptotic convergence (but typically with higher overhead per iteration) for the final phase.Applying this idea to the MVEE problem, combining Newton's method with coordi- nate ascent is potentially appealing, but it introduces a new issue without an obvious answer: how to know when to switch algorithms.
We address this issue by performing computational experiments combining the two methods to see how sensitive the resulting performance is to various transition points.The results are shown in Fig. 8, where we see that (1) the hybrid algorithm is indeed capable of significantly outperforming both of its constituent algorithms, and (2) its performance is not highly sensitive to the transition point between them.Note that these conclusions are independent of whether Newton or coordinate ascent performs better for a given data set (we deliberately chose one of each to illustrate this point).Thus, such a hybrid algorithm appears promising, and choosing a good transition point seems manageable.
In automating the choice of transition point for the hybrid algorithm, it is helpful to think of coordinate ascent as being used, in effect, as an initialization scheme for Newton's method.If run long enough, coordinate ascent is likely to identify a rather accurate approximate core set, but it need not be exact to still be highly effective as an initialization for Newton's method.A simple way to accomplish this in the hybrid algorithm is to monitor the approximate core set for coordinate ascent and transition when it stops changing.It is not clear, however, how long to wait after a change before further changes are deemed unlikely to occur.There is a tradeoff between making the approximate core set as accurate as possible and benefitting from the faster convergence rate of Newton's method as early as possible.Computational experience suggests that a longer wait tends to be more worthwhile for low kurtosis problems than for high kurtosis problems, presumably because of the larger core sets of the former.We have found that transitioning after the core set remains unchanged for 30 iterations of coordinate ascent strikes a reasonable balance between the two, and in our experience it is effective for a wide range of problems.

Performance testing
Comprehensive testing of the performance of all of the algorithms listed in Sect. 2 is reported in [4] (note, however, that the timings reported there are for a different computer from those reported here).With eight algorithms, three problem regimes (low, high, and very high kurtosis), and varying problem sizes (both m and n), the results are voluminous.Here we will merely summarize the preliminary results, with first-order and Newton-like methods grouped separately, and then present in greater detail the final results pitting the best performers against each other.
Implementations of all methods and initialization schemes were written in Python 3.8 with the NumPy [21] and SciPy [22] libraries used extensively to maximize the efficiency of the code [3].For all problems, = 10 −6 is used as the error tolerance in the convergence test (see Eq. ( 11) and Algorithm 2), which yields a reasonable level of accuracy for double precision computation.If higher accuracy were needed, a tighter tolerance would tend to favor higher-order methods, especially Newton's method, whereas for applications requiring less accuracy, a looser tolerance would make lower-order methods more viable.

First-order methods
The three first-order methods-coordinate ascent, gradient ascent, and conjugate gradient-all compute the gradient in some form at each iteration, but they neither compute nor approximate the Hessian matrix.The advantages noted in Sect.2.1 enabled coordinate ascent to maintain a substantial and consistent superiority over the other first-order methods in the preliminary performance results reported in [4] across all test problems, except for low kurtosis problems at the highest values of m and n tested, for which the performance of all three first-order methods was about the same.Overall, coordinate ascent was the clear winner among the first-order methods.

Newton-like methods
The Newton-like methods-Newton's method, truncated Newton, and BFGScompute both the projected gradient and the projected Hessian matrix (or an approximation to it) in some form at each iteration.Newton's method both computes and factors the projected Hessian at each iteration.Truncated Newton methods avoid the factorization by using an iterative method to solve the linear system for the computed step at each iteration.BFGS methods update a factored form of the approximate projected Hessian at each iteration.For these tests, all versions of BFGS were started with a factored form of the computed projected Hessian matrix at the initial point.Typically, Newton's method (as well as its truncated versions) does not require a line search, whereas all versions of BFGS do.
In the preliminary tests of Newton-like methods reported in [4], the Newton and truncated Newton methods consistently outperformed the BFGS methods by a significant margin across all problem types, probably due mainly to the line search required with BFGS.Although Newton's method generally outperformed the truncated ver-sions, the margin was fairly narrow, especially for larger values of n, for which the cost of the matrix factorization in Newton's method becomes more substantial.Although Newton's method was the overall winner among Newton-like methods, the truncated versions should be kept in mind for problems with even larger values of n.

Best methods comparison
We next compare the performance of the two best performing algorithms, coordinate ascent and Newton's method, along with that of the hybrid algorithm from Sect. 5.For all test problems and algorithms, the initializations used are as given in Table 1.The results are shown in Fig. 9, where the solution times for all three algorithms are plotted for a series of test problems of varying size and kurtosis values.For the plots on the left, the dimension n is fixed and the number of points m varies, while for the plots on the right, m is fixed and n varies.
From the plots we see that for low kurtosis data, the hybrid algorithm often outperforms Newton's method, especially for larger problems, and both substantially outperform coordinate ascent.For high kurtosis data, Newton's method and the hybrid algorithm are closely competitive with each other, and both continue to outperform coordinate ascent by a significant margin, especially for larger problems.Finally, for very high kurtosis data, all three algorithms perform similarly, and there is little to choose between them.
Overall, these test results confirm that with a suitably optimized implementation, Newton's method can indeed be made competitive with the previous state-of-the-art coordinate ascent algorithm, even for very large MVEE problems, and that the hybrid algorithm is a highly competitive choice for a wide range of MVEE problems.

Concluding summary
The main contributions of this paper are as follows.
-By carefully exploiting sparsity in the dual formulation of the MVEE problem, we showed that projected versions of both the gradient and the Hessian can be computed at a cost independent of the number of data points, m. -By employing a constrained optimization framework that exploits these new capabilities, we implemented a suite of high-order algorithms-up to and including Newton's method-that are vastly more efficient than previously thought possible, scaling to MVEE problems with many thousands or millions of data points.-In particular, we demonstrated that the performance of our implementation of Newton's method is competitive with the previous state-of-the-art coordinate ascent algorithm, even for very large problems.-Combining the respective strengths of Newton's method and coordinate ascent, we developed a new hybrid algorithm for MVEE problems that is highly competitive for a wide range of MVEE problems.-We showed that the computational difficulty of MVEE problems can be usefully characterized by the kurtosis of the corresponding data set, enabling a good esti- mate of the size of the core set and providing valuable guidance on the best choice of solution algorithm and initialization for a given problem.-We explored a number of new initialization schemes and showed that simple normbased initializations (for low kurtosis MVEE problems) and QR initialization (for a wide range of MVEE problems) often outperform previously proposed initializations.
These results nicely complement those in [20], significantly expanding the available capabilities for computing MVEEs for a much wider variety of data sets.
Author Contributions Both authors contributed substantially to the conception, execution, and reporting of this work.
Funding Both authors were supported in part by the endowment for the Fulton Watson Copp Chair in Computer Science at the University of Illinois at Urbana-Champaign.
-Very high kurtosis: Gaussian distribution with Cauchy weighting [20]; kurtosis can be hundreds, thousands, or higher, depending on problem size.

A.2 Collective distributions
For tests involving a range of kurtosis values, we use the following sets of distributions: -Sinh-arcsinh transformed data: the sinh-arcsinh transform [12], when applied to Gaussian data, has a parameter that allows for control over the kurtosis of the transformed data, which we use to generate simulated data sets that have a wide variety of kurtosis values but otherwise have similar properties, thereby isolating kurtosis as a single factor.We sometimes refer to such data sets as "specific," as they arise from a single distribution but with varying kurtosis values.-Various distributions: random data generated from a variety of distributions and various parameter values within a distribution, including Gaussian, beta, binomial, Laplace, gamma, and several others.We sometimes refer to such data sets as "general" because they include many different distributions with varying kurtosis values.

A.3 Empirical data sets
-MNIST: Modified National Institute of Standards and Technology database of handwritten digits, commonly used for training image processing systems and other machine learning applications [7].The database contains thousands of images, each of which consists of an array of grayscale pixel values.In terms of the MVEE problem, m is the number of images and n is the number of pixels in each image.In our adaptation, n = 49.-BoW: Bag-of-Words data set, commonly used for training text classifiers based on the frequency of occurrence of words from a given dictionary in a given set of documents [7].The data set contains many thousands of documents from five distinct sources.In terms of the MVEE problem, m is the number of documents and n is the number of words in the dictionary.In our adaptation, n = 50.

Fig. 1
Fig. 1 Core set size versus kurtosis for two families of synthetic data sets

Fig. 2
Fig. 2 Maximum and minimum (left) and median (right) observed core set sizes as a function of n for indicated synthetic data sets that the median core set sizes for low kurtosis and very high kurtosis regions grow as O(n √ n) and O(n), respectively.The dashed lines in the plots

Fig. 3
Fig. 3 Core set size versus kurtosis for both synthetic and empirical data sets

Fig. 6
Fig.6 Size of evolving approximate core set as iterations proceed for Newton's method (left) and coordinate ascent(right) for low kurtosis problem using each indicated initialization

Fig. 8
Fig. 8 Solution times (blue bullets) for hybrid algorithm using transition iteration indicated on horizontal axis for low kurtosis data (left) and very high kurtosis data (right)

Fig. 9
Fig. 9 Solution times for Newton's method, coordinate ascent, and hybrid algorithm for low kurtosis data (top row), high kurtosis data (middle row), and very high kurtosis data (bottom row)

Table 1
Recommended initialization scheme and initial core set size for various situations Constraint changes during Newton's method (left) and coordinate ascent (right) for an MVEE problem with low kurtosis.Error reported is 2-norm of difference between approximate and exact values for u