An active-set algorithm for norm constrained quadratic problems

We present an algorithm for the minimization of a nonconvex quadratic function subject to linear inequality constraints and a two-sided bound on the 2-norm of its solution. The algorithm minimizes the objective using an active-set method by solving a series of Trust-Region Subproblems (TRS). Underpinning the efficiency of this approach is that the global solution of the TRS has been widely studied in the literature, resulting in remarkably efficient algorithms and software. We extend these results by proving that nonglobal minimizers of the TRS, or a certificate of their absence, can also be calculated efficiently by computing the two rightmost eigenpairs of an eigenproblem. We demonstrate the usefulness and scalability of the algorithm in a series of experiments that often outperform state-of-the-art approaches; these include calculation of high-quality search directions arising in Sequential Quadratic Programming on problems of the CUTEst collection, and Sparse Principal Component Analysis on a large text corpus problem (70 million nonzeros) that can help organize documents in a user interpretable way.

eigenvalue problems [29], dimensionality reduction [21] and compressed sensing [5], are naturally posed on or in the unit sphere, while norm constraints are often useful for regularization, e.g. in general nonlinear [7] or robust optimization [20]. In this paper we consider the solution of norm constrained problems in the form minimize f (x) := 1 2 where x ∈ R n is the decision variable, P ∈ S n , i.e. an n × n symmetric matrix, A = [a T 1 . . . a T m ] ∈ R m×n , q ∈ R n , b = [b 1 . . . b m ] T ∈ R m and r min , r max nonnegative scalars.
It is generally difficult to solve problems of the form (P) due to the nonconvexity of the lower bounding norm constraint and the potential indefiniteness of the matrix P , which renders many, but certainly not all [20], of these problems intractable. Even finding a feasible point for (P) or testing if a first-order critical point of (P) is a local minimizer can be intractable (see Proposition 4 and [16]). As a result, we restrict our attention to the search for first order critical points and we assume that a feasible point is given or can be computed.
A specific tractable variation of (P) is the famous Trust-Region Subproblem 1 : Perhaps surprisingly, the TRS can be solved to global optimality despite its nonconvexity, and global solution of the TRS has been widely studied in the literature. Specific approaches to its solution include exact Semidefinite Reformulations, Gradient Steps, Truncated Conjugate Gradient Steps, Truncated Lanczos Steps and Newton root-finding, with associated software packages for its solution. The reader can find details in [7] and [27,Chapter 4].
We suggest an active-set algorithm for the solution of (P) that, as we proceed to describe, takes advantage of the existing efficient global solution methods for the TRS. Using such an approach, (P) is optimized by solving a series of equality constrained subproblems. Each subproblem is a modification of (P) where a subset of its inequalities (called the working-set) is replaced by equalities while the rest are ignored. If a subproblem does not include a norm constraint then it is called an Equality-constrained Quadratic Problem (EQP); otherwise it is a Trust Region Subproblem (TRS). As described earlier, the global solution for the TRS, and also for the EQP [14], has been widely studied in the literature. Thus we can efficiently address the global solution of each subproblem. However, a complication exists for the TRS because, unlike EQP, it can exhibit local-nonglobal minimizers i.e. local minimizers that are not globally optimal, which complicate the analysis. Indeed, if the optimal working-set includes a norm constraint and the global solutions of the respective TRS subproblem are infeasible for (P), then the solution of (P) must be obtained from a local-nonglobal optimizer of the TRS subproblem.
Thus, an algorithm for computing local-nonglobal minimizers of the TRS is required for our active-set algorithm. Unfortunately, algorithms for detecting the presence of/computing local-nonglobal solutions of (TRS) are significantly less mature than their global counterparts. A notable exception is the work of Martinez [25], which proved that there exists at most one local-nonglobal minimizer and gave conditions for its existence. Moreover, [25] presented a root-finding algorithm for the computation of the Lagrange Multiplier associated with the local-nonglobal minimizer that entailed the computation of the two smallest eigenvalues of the Hessian and the solution of a series of indefinite linear systems.
More recent work is based on a result from [1], which shows that each KKT point of the TRS, in the absence of linear equality constraints Ax = b (i.e. with a norm constraint on x only), can be extracted from an eigenpair of This result was used in [30] to calculate the local-nonglobal minimizer under the assumption that it exists. We extend these results by proving that both checking if the local-nonglobal minimizer exists and computing it, in the case where it exists, can be performed at the same time by simply calculating the two rightmost eigenpairs of M . Crucially, and similarly to the results of [1] for the global minimizer, this allows the use of efficient factorization-free methods such as the Arnoldi method for the detection and computation of the localnonglobal minimizer. Furthermore, we show that this approach can efficiently handle equality constraints Ax = b in the Trust Region Subproblem, without the need for variable reduction, by means of a projected Arnoldi method. The paper is organised as follows. In Section 2 we present an introduction to the TRS and present novel results for the detection/computation of its localnonglobal minimizer and for the incorporation of linear equality constraints. In Section 3.1 we introduce an active-set algorithm for solving (P) starting from the special case where r min = r max , and then generalizing to any r min , r max . Finally, in Section 4, we conclude with a series of numerical results.
Notation used: S n denotes the set of symmetric n × n real matrices. Given a vector (or matrix) x, x T denotes its transpose and x H its conjugate transpose.

The Trust-Region Subproblem
This section concerns the computation of global as well as local-nonglobal minimizers for (TRS). Although this section only considers solutions on the boundary x = r, the results can be trivially extended to the interior x ≤ r. Indeed, if an interior solution exists then the TRS is essentially convex (P is positive semidefinite in the nullspace of A [15, §1]), thus no local-nonglobal solution exists and the interior solution is obtained by solving an equality constrained quadratic problem. The presence of a (necessarily global) interior solution can be detected by checking the sign of the Lagrange Multiplier of the global boundary solution(s) [25,Lemma 2.2].
For clarity of exposition, we will first assume that there are no linear equality constraints in (TRS), resulting in the following problem: where x ∈ R n is the decision variable, and P ∈ S n , q ∈ R n , r ∈ R + are the problem data. For the rest of the section we will assume that n > 1, excluding the trivial one dimensional case where the feasible set consists of two points, and that r > 0. We will then show in §2.1 how to extend the results of this section to allow for the inclusion of linear equality constraints. Finally, in §2.A we prove the main result of this section, which is presented in a separate subsection to facilitate the presentation. In the sequel we will make frequent use of the eigendecomposition of P := W ΛW T defined by an orthonormal matrix W := [w 1 . . . w n ] and Λ := diag(λ 1 , . . . , λ n ) where λ 1 ≤ · · · ≤ λ n .
Every KKT point of (T) satisfies where µ is a Lagrange multiplier associated with the norm constraint 2 . Equation (1) is well defined when µ = −λ i , i = 1, . . . , n. For the purposes of clarity of the subsequent introduction, which follows [27, §4], we will assume that this holds. However, the conclusions of this section are independent of this assumption.
Using the feasibility of x, we have Then, noting that we can express the rightmost condition in (2) as 2 Technically, µ is the Lagrange multiplier of the equivalent constraint 1 2 x 2 2 = 1 2 r 2 . We define (T) with x 2 = r for simplicity and to match the notation of [1]. Determining the KKT points of (T) is therefore equivalent to finding the roots of s, which is often referred to as the secular equation [27]. We depict s for a particular choice of P, q, r in Figure 1. As might be expected from the tight connection between polynomial root-finding problems and eigenproblems, solving s(µ) = 0 is equivalent to the following eigenproblem [1,Equation (22)]: This elegant result was first noted more than 50 years ago [9,Equation (2.21)], only to be later disregarded as inefficient by some of the same authors [10], and then rediscovered by [1] who highlighted its great applicability owing to the remarkable efficacy of modern eigensolvers [24]. The relation between the spectrum of M and the Lagrange multipliers of (T) is formally stated below: The spectrum of M includes the Lagrange multiplier of every KKT point of (T).
Proof This is a consequence of Proposition 2 of subsection 2.A and [9,Theorem 4.1].
Note that the eigenvalues of M are complex in general since M is asymmetric. Furthermore, Lemma 1 and all the subsequent results allow for potential "degenerate" KKT points with µ = −λ i .
We will focus on the KKT points that are minimizers of (T) rather than maximizers or saddle points. The characterization of global minimizers is widely known in the literature. The following theorem shows that the Lagrange multiplier µ g of the global minimizer corresponds to the rightmost eigenvalue of M in C, which is always real: Theorem 1 A KKT point of (T) with Lagrange multiplier µ g is a global minimizer if and only if µ g is the rightmost eigenvalue of M . Furthermore, we necessarily have that µ g ∈ [−λ 1 , ∞).
In general, (T) is guaranteed to possess a unique global optimizer, except perhaps in a special case that has deservedly been given a special name: Definition (Hard case) Problem (T) belongs to the hard case when µ g = −λ 1 .
We will see that in the hard case (T) is still guaranteed to have a global minimizer, but it is not necessarily unique. Furthermore, the computation of the minimizer(s) can be more challenging than in the "standard" case.
Moreover, due to the nonconvexity of (T) there can exist a local minimizer that is not global. We derive an analogous result for this minimizer that is called "local-nonglobal" [25]: Theorem 2 Problem (T) has a second-order sufficient local-nonglobal minimizer if and only if it does not belong to the hard case and the second-rightmost eigenvalue of M is real and simple. Furthermore, if such a minimizer exists, then its Lagrange multiplier µ is equal to this second-rightmost eigenvalue.
The Lagrange multipliers for the global and local-nonglobal minimizers of (T) can therefore be identified by calculating the two rightmost eigenvalues 3 of M , which can be done efficiently e.g. with the Arnoldi method.
Furthermore, for each of the respective multipliers a corresponding minimizer can be extracted immediately by the respective eigenvectors [z * 1 ; z * 2 ] of M using as unless z * 1 = 0. Indeed, this follows from the next Proposition: Proof This proof is based on [1, §3.3] that shows the result only for the rightmost eigenpair of M . We will first show the result for y * = −r 2 q T z * 2 z * 1 and then show that y * = x * . Note that using x * is preferred over y * because it can avoid unnecessary numerical errors according to [1, §3.3]. We also emphasize that all of the proof assumes that z * 1 is nonzero.
For y * , we have to show that it is well defined, feasible and stationary. Regarding the first two points, note that (4) (first row) gives or since (P +µ * I)z * 2 = z * 1 due to (4) (second row), we have z * which implies that q T z * 2 = 0 because z * 1 = 0 by assumption. Thus: To show stationarity, note that the first row of (4) also gives: Finally, we show that y * = x * : Dealing with the hard case: Extracting a solution x * via (5) is not possible when z * 1 = 0, since (5) is not well-defined in that case. In this case, (µ * , z * 2 ) is an eigenpair of −P due to (4). Note that, according to [25,Lemma 3.3], this can never happen for the local-nonglobal minimizer; thus we can always extract the local-nonglobal minimizer with (5). However, it is possible for global minimizers in the hard case, i.e. the case where µ * = −λ 1 . In the hard case, the necessary and sufficient conditions for global optimality are, due to the KKT conditions of (T) and Theorem (1), the following: Note that P − λ 1 I is singular, and the above system of equations represents the intersection of an affine subspace with a sphere. This intersection must be non-empty, since (T) necessarily has a global minimizer as it arises from the minimization of a smooth function over a compact subset of R n . One can then solve (6)-(7), by computing the minimum length solution y min of (P − λ 1 I)x + q = 0 and return x * = y min + αυ where α is a scalar such that x * 2 = y min + αυ 2 = r and υ is any null-vector of P − λ 1 I. Interestingly, when z * 1 = 0, z * 2 is a null vector of P − λ 1 I due to (4), thus the only additional computation for extracting a solution when z * 1 = 0 is finding the minimumlength solution of a symmetric linear system. This can be achieved e.g. with MINRES-QLP [6], or with the Conjugate Gradient method [1, Theorem 4.3].

Equality-constrained Trust-Region Subproblems
We will now extend the preceding results of this section to also account for the presence of linear equality constraints, Ax = b, in the TRS. We will show that, given an operator that projects an n−dimensional vector to the nullspace of A, we can calculate global as well as local-nonlocal minima of (TRS) by applying a projected Arnoldi method to M .
For simplicity, and without loss of generality 4 , we will assume that b = 0, thus solving the following problem: where x ∈ R n is the decision variable, P ∈ S n , i.e. an n × n symmetric matrix, A is a m×n matrix of full row rank and q is an n−dimensional vector. Similarly to the preceding results for (T), we assume that n − m > 1 and r > 0.
In principle, (8) can be solved with the "eigenproblem approach" described in this section if we reduce (8) into a smaller TRS subproblem via a nullspace elimination procedure [27, §15.3], obtaining minimize 1 2 y TP y +q T y subject to y 2 = r where we define Z to be a n × (n − m) orthonormal matrix with R(Z) = N (A) andP According to the preceding results of this section, global and local-nonglobal minimizer(s) y * of (9) can be computed by calculating the two rightmost eigenvectors ofM The respective solution(s) x * of (P) can then be recovered as Some disadvantages of this approach are that a nullspace basis is required and that the structure or sparsity of the problem might be destroyed. We will present an alternative method that avoids these issues, mirroring standard approaches for solving equality constrained quadratic programs [14]. To this end, note that Since Z is an orthonormal matrix, every eigenvalue-eigenvector pair {µ, [z 1 ;z 2 ]} ofM corresponds to an eigenvalue-eigenvector pair {µ, [z 1 ; z 2 ]} of the matrix with Z[z 1 ;z 2 ] = [z 1 ; z 2 ]. Note that (11) also has an extra eigenvalue at zero with multiplicity 2m.
Recalling that Z is an orthonormal matrix spanning N (A), we note that multiplication with ZZ T is equivalent to projection onto the nullspace of A. In practice, eigenvalues of (11) can therefore be calculated using the Arnoldi method on M where we project the starting vector and every requested matrixvector product with M onto the nullspace of A 0 0 A .
Note that the two rightmost eigenvalues ofM correspond to the two rightmost eigenvalues of (11) if at least two eigenvalues ofM have positive real parts. We will show that we can always transform (8) so that this holds. Indeed if we shift P to P − αI with an appropriate α ∈ R so thatP become negative definite 5 , thenM has at least two eigenvalues with positive real part: Lemma 2 IfP ∈ S n−m is negative definite thenM has at least two eigenvalues with positive real part, unless we trivially have n − m = 1.
Proof To avoid redefining the notation used in the introduction of §2 we will show the equivalent result for (P, M ), i.e. that M has at least two eigenvalues with positive real part when P is negative definite and n > 1.
To do so, it suffices to show that s has at most one root with non-positive real part. This is because M has 2n ≥ 4 eigenvalues and the spectrum of M matches the roots of s on the left-hand complex plane (shown in Proposition 2 that we will prove on §2.A, given that P is negative definite).
To show this, note that s has at most one (simple) real root with nonpositive real part since is positive for any α ≤ 0 since λ 1 , . . . , λ n < 0 by assumption. To conclude the proof, it suffices to show that all the roots of s with nonpositive real part are real. Indeed, note that for any a, b ∈ R we have Such an α can be obtained by applying e.g., the Lanczos method, or the Gershgorin circle Theorem on P .
Such a shift would not affect the (local/global) optimizers of (P) as or, since x 2 = r, where α ∈ R. Thus, we can always transform (8) so that the two rightmost eigenvalues ofM are equal to the two rightmost eigenvalues of (11).
Having computed the two rightmost eigenvalues/vectors with a projected Arnoldi method applied to M , global/local solutions of (P) can be obtained from the respective eigenvectors. Indeed, recall that according to (5) and (10), an appropriate (rightmost/second-rightmost) eigenvector [z * 1 ;z * 2 ] of (11) gives a solution x * to (P) as follows where [z * 1 ; z * 2 ] is an eigenvector of M . Thus, x * can be extracted solely by the eigenvector of M unless z * 1 = 0.
Dealing with the hard case: In the case where z * 1 = 0, which can appear in the hard case, one would have to calculate the minimum length solution of (P + µ * I)y +q = 0 (15) and then construct a global solution x * as x * = Z(y + αz * 2 ) = Zy + αz * 2 , where α is a scalar such that x * 2 = r. However, the minimum length solution of (15) can be obtained by the minimum length solution of (P + µ * I)x + q = 0 (16) which can be calculated via projected MINRES-QLP or a projected CG algorithm [1, Theorem 4.3] [14].
Algorithm 1: Calculation of global and local-nonglobal minimizers of the equality constrained trust-region subproblem (8) 1 Given the problem data P ∈ S n , q ∈ R n , r ∈ R + , A ∈ R m×n where A is full row rank with m < n − 1 and Π : C n → C n , a projector into the nullspace of A; 2 Shift P by a multiple of identity so that it becomes negative definite; 3 Run an Arnoldi method on for any x 1 , x 2 ∈ C n , starting from an initial point Π(z) ∈ C 2n , to calculate a rightmost eigenvalue/vector pair (µ g , z g ); Resume the projected Arnoldi method to compute a second-rightmost eigenvalue/vector (µ , z ); 7 if µ is real and simple then Return x g and x as the unique global and local-nonglobal minimizers; Solve a quadratic equation to find α 1 , α 2 such that x min + α 1/2 z g 2 2 = r; 16 Return x min + α 1 z g 2 , x min + α 2 z g 2 as global solutions, state that no local-nonglobal minimizer exists; 17 end 2.2 An algorithm for computing global and local solutions of the TRS Using the ideas presented in this section, we now present Algorithm 1, a practical algorithm for computing global and second-order sufficient local-nonglobal solutions of equality constrained trust-region subproblems. In the remainder of this section, we establish the correctness of Algorithm 1.
Algorithm 1 uses the results of this section (in particular Theorems 1 and 2, Equation (13) and the paragraph "dealing with the hard case" of §2.1) to compute the minimizers of (8). The only novel point is the way we detect the presence of the local-nonglobal minimisers, which we proceed to show to be valid.
If the TRS is in the hard case, then the algorithm should only return global minimiser(s). Indeed, this is the case as 1. If z g 1 = 0, then the algorithm terminates at line 16, according to the remarks of the paragraph "Dealing with the hard case" of §2.1. 2. If z g 1 = 0, then there must exist another eigenvector associated with µ g besides [z g 1 ; z g 2 ]. This is because µ g = −λ 1 thus according to Proposition 2 of §2.A there exists a vector u ∈ N (P − λ 1 I) that is orthogonal to q which makes (µ g , [0; u]) an eigenpair of M due to (4). Thus µ g = µ l , which implies that µ l is not simple. As a result the algorithm will only return a global minimizer using (5) at line 11.
If the TRS is not in the hard case, then we must have z g 1 = 0 (see remarks of paragraph "Dealing with the hard case" of §2). Thus the algorithm will return either in lines 9 or 11, detecting correctly the presence of the local-nonglobal minimiser according to Theorem 2.
Finally, we emphasize that in the hard case the TRS can have an infinite number of solutions, but only one or two of them are returned by Algorithm 1.

2.A Proof of Theorem 2
In this subsection we provide a proof of our main theoretical result, Theorem 2, which shows that (T) has a second-order sufficient local-nonglobal minimizer if and only if it does not belong to the hard case and the second-rightmost eigenvalue of the matrix with counted algebraic multiplicities, is real and simple. Furthermore, we show that if such a minimizer exists, its Lagrange multiplier µ is equal to this second-rightmost eigenvalue. The proof is based on the following Lemma, which follows from [25]: Before we use Lemma 3 we will need the following two ancillary results: Lemma 4 For any complex number a + bi ∈ C where a, b are real numbers such that a is in the domain of s and b = 0, we have: Proof Recall the definition of s: and define, for convenience, κ j = w T j q, j = 1, . . . , n. Thus or, equivalently, Re(s(a + bi)) < s(a). Proof We will prove this result with the argument principle. We propose encircling the half-space with the contour of Figure 2 and show that it maps to a contour that lies strictly in the left half plane. The suggested closed contour C intersects the real axis at x and at ∞. It consists of two segments: i.e. a segment that is parallel to the imaginary axis, and C 2 , a half-circle with infinite radius.
By the argument principle [2] we have 1 where Z, P are the number of roots and poles of s in the domain enclosed by C. It follows that the number of poles in S is equal to the number of zeros in S if C s (z) s(z) dz = 0, which holds if there are no poles or zeros of s on C and s(C) does not enclose the origin. By examining (3) we conclude that s has only real, finite poles. Thus, the contour C does not intersect any pole of s. It remains to show that there exist no zeros of s on C and s(C) does not enclose the origin. We show both by proving that every point in s(C) = s(C 1 )∪s(C 2 ) has negative real part. For s(C 2 ), it is easy to see that, since s( Regarding s(C 1 ) note that s(x) < 0, and thus Lemma 4 gives Re(s(x + βi)) ≤ s(x) < 0 ∀β ∈ R, i.e., any point in s(C 1 ) has negative real part. We will now connect the second-rightmost root of the secular equation s with the results of Lemma 3. Note that, due to Lemma 4, if s has a real root µ then it cannot also have an unreal root with real part µ and vice versa. Thus, the concept of "the second-rightmost root of s" is well defined as long as we assume (or are prepared to check) that this root is real.
Proof Note that in both cases q ⊥ w 1 , thus (T) does not belong to the hard case [1]. Hence s has a real root in In the subsequent analysis, we will frequently refer to the poles of s which are −λ i i = 1, . . . , n such that q T w i = 0 , and have double multiplicity, and the following parametric set We will begin with claim (i). Since µ < −λ 1 < µ g we conclude that µ is not the rightmost root of s. In order to show that it is the second rightmost, it suffices to show that there are at most two roots of s in S(µ).
We conclude that there should be extra roots for s in S(µ), besides the simple roots µ g and µ. Moreover, these extra roots cannot have real part equal to µ as according to Lemma 4 Re(s(µ + bi)) < s(µ) = 0 for any real b = 0. Thus their real parts must be in (µ, ∞). This means that µ is not the second-rightmost root of s, which is a contradiction.
Finally, if s (µ) < 0, then, using the same arguments as before, we conclude that there exist four roots in S(µ + ) for any sufficiently small > 0. Thus, there exist extra roots (besides the simple root µ g ) with real part in (µ, ∞), which, again, contradicts the assumption that µ is the second-rightmost root of s. Lemma 6, in combination with Lemma 3, provides an "almost" iff relationship between the second-rightmost root of s and the local-nonglobal minimizer of (T). However, the quantity of primary interest is the second-rightmost eigenvalue of M instead of the second-rightmost root of s. The following result characterizes the spectrum of M and its tight connection with the roots of s: The spectrum of M consists of: the roots of s with matched algebraic multiplicity; and all of the eigenvalues −λ i of −P that are non-simple for −P or have w i (the i−th eigenvector of P ) orthogonal to q.
Furthermore, every −λ i that is in the spectrum of M is non-simple for M .
Proof We will prove the result by deriving a closed form expression for the characteristic polynomial of M . Since −I and P + µI commute, we have [31, Theorem 3]: or, using the matrix determinant lemma, for all µ = −λ 1 , · · · − λ n . Thus, (20) and the characteristic polynomial of M coincide in all C except perhaps on 2n points. It follows from continuity arguments that (20) is in fact the characteristic polynomial of M . By examining (19) (2nd leftmost) and (20), we conclude that the eigenvalues of M include all the roots of s with matched algebraic multiplicity. The secular equation s has 2n roots whenever the eigenvalues of P are distinct and w T i q = 0 for all i = 1, . . . , n. Otherwise, i.e. for every i with λ i non-simple or w T i q = 0, M ∈ R 2n×2n has extra non-simple eigenvalues at these −λ i , which constitute the only differences between the spectrum of M and the roots of s.
The following result follows directly from the above Proposition and will be useful for the rest of this section.
We are now ready to prove the main result of this section, Theorem 2. The proof will occupy the rest of this subsection. Note, again, that, due to Lemma 4 and Proposition 2, if M has a real eigenvalue µ then it cannot also have an unreal eigenvalue with real part µ and vice versa. Thus, the concept of "the second-rightmost eigenvalue of M " is well defined as long as we assume (or are prepared to check) that this eigenvalue is real.
"only-if" In this case we know that µ is the Lagrange multiplier of a second-order sufficient local-nonglobal minimizer and we have to show that Points (i), and (ii) follow directly from Lemma 3 and Proposition 2 (note that q ⊥ w 1 implies that (T) is not in the hard case [1]). In order to prove (iii) we will first show that the spectrum (counted with algebraic multiplicities) of M with real part larger than −λ 2 coincides with the roots of s in the same region. This follows from Proposition 2, because λ 1 is simple (due to µ ∈ (−λ 2 , −λ 1 )) and q T w 1 = 0. Point (iii) then follows from Lemma 6(i) which shows that µ is the second-rightmost root of s, and thus the second-rightmost eigenvalue of M .
"if" In this case, we know that (T) is not in the hard case and that µ is the second-rightmost eigenvalue of M which is real and simple, and we want to show that µ is the Lagrange multiplier of the second-order sufficient local-nonglobal minimizer of (T).
Note that, by assumption, µ is in the spectrum of M but, due to Corollary 1, not in that of −P . Thus µ is a root of s due to Proposition 2. As a result, it suffices to show that µ is in (−λ 2 , −λ 1 ) and s (µ ) > 0 (Lemma 3).
We will first show that µ ∈ (−λ 2 , −λ 1 ). Since we are not in the hard case, M has a real eigenvalue in (−λ 1 , ∞) (Theorem 1) which is simple and unique in (−λ 1 , ∞) due to Proposition 2 and the fact that s (µ) = −2 n j=1 (w T j q) 2 /(λ j + µ) 3 is negative in that region. Thus, −λ 1 is not in the spectrum of M as otherwise it would be its second-rightmost eigenvalue (which by assumption is real) and it would imply that µ is in the spectrum of −P , which we have already excluded. Thus, µ ∈ (−∞, −λ 1 ), and µ = −λ 2 .
It remains to show that µ ≥ −λ 2 . If q T w 2 = 0 then, (−λ 2 , [0; w 2 ]) is an eigenpair of M , thus we must have µ > −λ 2 so as to avoid µ (the secondrightmost eigenvalue of M ) being in the spectrum of −P . Consider now the case where q T w 2 = 0. Note that λ 1 = λ 2 and q ⊥ w 1 as otherwise −λ 1 is in the spectrum of M (Proposition 2), which, according to the paragraph above is a contradiction. Thus, λ 1 = λ 2 and q ⊥ w 1 , w 2 resulting in µ ≥ −λ 2 due to Lemma 6(ii).
3 An active-set algorithm for (P) We are now in a position to present the main contribution of this paper, an active-set algorithm for solving (P). We will first present an algorithm for the special that r min = r max := r and then describe how to generalize for any r min , r max .
3.1 Solving (P) when r min = r max := r In this section we introduce a primal active-set approach for the optimization of (P) when r min = r max := r. It will be useful in the subsequent analysis to recall the Karush-Kuhn-Tucker (KKT) conditions of (P), which are As is typical from a primal active-set approach, our algorithm starts from a given feasible point of (P) and generates iterates that remain feasible for (P) and have non-increasing objective values. At every iteration we treat a subset of the inequality constraints Ax ≤ b as equalities. We refer to this subset as the working set W k and definē where [·] denotes vertical (row-wise) concatenation. To simplify the analysis, we will assume that one of the simplest and most common constraint qualification holds for every iterate of the algorithm:

Assumption 1 Linear Independence Constraint Qualification [LICQ]
The LICQ holds for every iterate of the suggested Algorithm, i.e.
Ā k x T k is full row rank.
We will first give a brief, schematic description of our algorithm. At every iteration k of our active-set based minimization procedure, we use Algorithm 1 to compute minimizers of the subproblem We then attempt to move towards those minimizers in that hope that we either (i) "hit" a new constraint or (ii) set x k+1 to a (potentially local) minimizer of (SP). Unfortunately due to the nonconvexity of the problem, we might achieve neither (i) nor (ii) by simply moving towards the minimizer of (SP), as the objective of (P) is not guaranteed to be non-increasing along these moves.
In that case, which we will show can happen only when (SP) has a unique minimizer, we perform a projected gradient descent followed (if necessary) by a move towards the global minimizer of (SP) that is guaranteed to achieve either (i) or (ii) 6 . Finally, at the end of each iteration, we check for termination and update the working set.
Our overall active-set procedure is shown in Algorithm 2. We now proceed in the following subsections that describe in detail each of the Algorithm's steps. For the rest of this section, we will say that a point is "feasible" when it is in the feasible region of (P). Also, given a point x we will call the projection of ∇f (x) to the nullspace of constraint normals of (SP), Ā k x T , as "projected gradient of f (·) at x".

Move towards the minimizers of (SP)
Notice that Algorithm 1 always returns a global minimizer x g of (SP) and possibly a second minimizer x * .
Let us consider first the case where two distinct minimizers are returned which is treated in lines 5-6 of Algorithm 2. In this case it is always possible to either set x k+1 to a minimizer of (SP) or "hit" a new constraint as follows. Consider the circle defined by the intersection of the sphere x 2 = r and the plane defined by x k and the two returned minimizers. When (SP) is constrained in this circle, then it can be reduced to a two-dimensional TRS. As such, we will show that it has at most 2 maximizers and 2 minimizers, except perhaps when x k is a global minimizer of (SP) in which case we can trivially set x k+1 = x k .
Proposition 3 A two dimensional TRS has at most 2 minimizers and 2 maximizers except when its objective is constant across its feasible region. if x k+1 is not a minimizer of (SP) and no new constraint was "hit" then 11 Run projected gradient descent starting from x k+1 until a new constraint is "hit" or until convergence, and store the resulting point in x k+1 ; 12 if PGD converged to a feasible x k+1 of indefinite projected Hessian then 13 Compute a suitable limiting direction d along which x k+1 is a local maximum; 14 x k+1 ← solution of the 2D subproblem on the plane defined by x k+1 , a global minimizer of (SP) and d; 15 if a new constraint was "hit" then 16 Obtain Proof It suffices to show the result only for the minimizers, as a negation of the TRS's objective would show the same result for its maximizers. Suppose that the TRS does not belong to the hard case. Then it has a unique global minimizer and at most one local-nonglobal minimizer [25]. If it belongs to the hard case, then only global minimizers can exist. These are intersections of the affine subspace (6) and the sphere (7). Note that this intersection is either a distinct point, two points or a circle. In the latter case, every feasible point of (T) is a global minimizer. This completes the proof.
We can identify the two minimizers of this two-dimensional TRS as x g and x * . It follows that at least in one of the circular arcs connecting x k with x g and x * the objective value is always less than f (x k ). Thus, by moving into that circular arc one will either end up with x g or x * or "hit" a new constraint.
On the other hand, if a single global minimizer x g was returned, which corresponds to lines 7-8 of Algorithm 2, then we consider the circle defined by the intersection of the sphere x 2 = r and the plane defined by x k , x g and the projected gradient of f (·) at x k . Obviously, if x g is feasible for (P) then we choose x k+1 = x g , but otherwise we are not guaranteed to "hit" a new constraint. This is because the associated two-dimensional TRS might possess a second minimizer, which is not necessarily a stationary point of (SP), but might be the best feasible solution on this circle.

Perform projected gradient descent (if necessary)
This subsection concerns the case where the procedure described in the previous subsection could neither set x k+1 to a minimizer of SP nor "hit" a new constraint. In this case (lines 10-14 of Algorithm 2), we proceed by performing projected gradient descent (PGD) on (SP) starting from the current iterate of the active-set algorithm. This is guaranteed to converge to a KKT point Theorem 4.5]. Suppose that x s is not feasible for (P). Since the feasible region is a closed set, PGD exits the feasible region in finitely many steps. By stopping the projected gradient descent just before it exits the feasible region, we can find a point x k+1 with a newly "hit" constraint.
Consider now the case where x s is feasible for (P). If the minimum eigenvalue λ min of the projected Hessian of f (·) at x s is nonnegative (lines 12-14 of Algorithm 2), then we set x k+1 = x s and we proceed to the steps outlined in the next subsection. Otherwise λ min < 0, and there exists a feasible sequence {z k } (w.r.t. to (SP)) converging to x s with an appropriate limiting direction d such that In practice, we can choose d equal to some projected eigenvector associated with λ min < 0. Consider now the circle defined by the intersection of the sphere x 2 = r and the plane defined by x s , x g and the limiting direction d.
As before, this can be reduced to a two-dimensional TRS that possesses at most 2 minimizers and 2 maximizers (Proposition 3). We can identify two of them: x g which is a global minimum and x s which, according to (26), is a local maximum. It follows that in at least one of the two circular arcs connecting x s with x g the objective value is always less than f (x s ). Thus, by moving into that circular arc, and since x s is feasible for (P) and x g infeasible, we can identify a suitable feasible point x k+1 such that f (x g ) ≤ f (x k+1 ) ≤ f (x s ) for which a new constraint, not in the current working set, becomes active.

Update the working set and check for termination
If a new constraint was identified in the steps above, then we update the working set and proceed to the next iteration (lines 15-16 of Algorithm 2).
Otherwise, x k+1 was set to a second-order necessary stationary point of (SP) and we proceed as follows (lines 17-18 of Algorithm 2). Stationarity of x k+1 gives for some Lagrange multipliers µ and κ i ∀i ∈ W k . Thus, (x k+1 , µ, κ) satisfy the first KKT condition (21) if we define κ i = 0 ∀ ∈ W k . It follows from feasibility of x k+1 and the definition of κ that the third and fourth KKT conditions (23)-(24) also hold. If we also have that κ i ≥ 0 ∀i ∈ W k , then (x k+1 , µ, κ) is a KKT pair for (P) and we terminate the algorithm (lines 23-24 of Algorithm 2). If, on the other hand, κ i < 0 for some i ∈ W k (lines 25-27 of Algorithm 2), then (22) is not satisfied and x k+1 is not a local minimizer for (P). In fact, the objective f (·) may be decreased by dropping one of the constraints corresponding to a negative Lagrange multiplier [27, Section 12.3].

Finite termination
We will show that Algorithm 2 terminates in finitely many outer iterations, assuming that x k+1 = x k for every k = 1, 2. . . . . Indeed at every iteration the algorithm either -Moves to a stationary point of the current TRS subproblem; or -Activates a new constraint.
Since there can be at most m constraints in the working set it follows that x k visits a stationary point of the k−th TRS subproblem periodically (at least every m iterations). Furthermore, note that every TRS subproblem, resulting from a particular working set, has at most 2n sets of stationary points of the same objective value [9, Theorem 4.1 and subsequent comments]. Since there are a finite number of different working set configurations, it follows that there exist finitely many such sets. On the other hand, every constraint that is deleted from the working set has an associated negative Lagrange multiplier, thus every iterate x k is not a stationary point of the next iteration. Furthermore, note that if x k is not a stationary point for the k + 1 subproblem then our algorithm generates x k+1 with f (x k+1 ) < f (x k ) unless x k = x k+1 which is excluded by assumption. This means that once the algorithm visits one of these sets of stationary points with equal objective value, it can never visit it again. Hence, the algorithm terminates after finitely many iterations. We believe that it is not possible to bound the number of iterations by a polynomial in m, n. Indeed, even finding an initial feasible point is NPcomplete, as we show in §3.2.1.

Further Remarks
Similarly to active-set algorithms for quadratic programs, we can always update the working set such thatĀ k is full row rank. However, LICQ might still fail to be satisfied when the gradient of the spherical constraint lies on R(Ā T k ). In these cases we terminate the algorithm without guarantees about local optimality.

Solving (P) for any r min , r max
We are now ready to present an active-set algorithm that solves (P) for any r min and r max . At every iteration of the suggested active-set algorithm, the spherical inequality constraint will either be in the working set or not. If it is, then we iterate as described in Algorithm 2; otherwise we proceed with a generic (nonconvex) quadratic programming active-set algorithm [27, §16.8], [12], [18]. We switch between the two algorithms when an iterate of the QP algorithm hits the spherical boundary or when the Lagrange Multiplier µ of the norm constraint in Algorithm 2 is negative and less than any other Lagrange Multiplier κ i .
In the special the case where r min = 0, the norm constraint reduces to x 2 ≤ r max . In this case, when a switch from Algorithm 2 to the QP Algorithm happens, the projected Hessian of the Lagrangian of f (·) at x k is positive semidefinite, i.e.
Q T (P + µI)Q 0 (27) where Q := [Q 1 q 2 ] is defined by the "thin" QR decomposition of [Ā T k x k ] and q 2 is an appropriate vector. Recall that for a switch to happen, we need µ < 0; hence Q T P Q is positive definite. Thus, Q T 1 P Q 1 , i.e. the projected Hessian of the next iteration that is to be handled by the generic QP algorithm, has at most one nonpositive eigenvalue [29,Corollary 4.1]. This means that even the popular, but less flexible, class of "inertia controlling" QP algorithms can be used as part of the suggested active-set algorithm for solving (P) when r min = 0.

Finding an initial point
Algorithm 2 is a primal active-set algorithm and, as such, it requires an initial feasible point. Unfortunately, finding such a point is, in general, NP-complete as we formally establish at the end of this subsection. Nevertheless, we can find a feasible point for (P) with standard tools, albeit in exponential worst-case complexity, as we proceed to show. Indeed, define the following problems: i.e. a convex minimization and a convex maximization problem. Denote with x * min and x * max the solutions to the convex minimization and maximization problems respectively. Then, (P) is feasible iff x * min 2 ≤ r max and x * max 2 ≥ r min , in which case a feasible point for (P) can be found by interpolating between x * min and x * max . The convex minimization problem above can be solved in polynomial time with e.g. interior point or first order [32] methods. On the other hand, the convex maximization problem can have exponential worst-case complexity, but it can be solved to local or even global optimality with standard commercial solvers e.g. with IBM CPLEX.
Finally, we formally show that determining if (P) is feasible is NP-complete: Proof Determining if (F) has a solution is NP since we can easily check whether a candidate point x is feasible for (F). Furthermore, it can be decomposed into the following two independent problems: (i) "is there a point in the polytope Ax ≤ b such that x 2 ≤ r max ?" and (ii) "is there a point in the polytope Ax ≤ b such that x 2 ≥ r min ?" Problem (i) can be answered in polynomial time with e.g. interior point methods. Thus determining if (F) has a solution is reducible to problem (ii) which is known to be NP-complete [26,Problem 3].

Applications and Experiments
In this section we present numerical results of the suggested active-set algorithms on a range of numerical optimization problems. A Julia implementation of the algorithm, along with code for the generation of the results of this section is available at: https://github.com/oxfordcontrol/QPnorm.jl As described in the previous sections, the algorithm is based on a TRS solver, and a general (nonconvex) Quadratic Programming solver. We implemented these dependencies as separate packages that we also release publicly as listed below.
-TRS.jl: A Julia package for the computation of global and local-nonglobal minimizers of essentially implementing in Julia the theoretical results of Section 2 and [1]. Available at: https://github.com/oxfordcontrol/TRS.jl -GeneralQP.jl: A Julia implementation of [13], i.e. an "inertia controlling" active-set solver for nonconvex, dense quadratic problems, with efficient and numerically stable routines for updating QR decompositions of the working set and LDLt factorizations of the projected Hessian. Available at: https://github.com/oxfordcontrol/GeneralQP.jl This solver is useful as a part of the suggested algorithm for solving (P) according to the remarks of §3.2.
For simplicity our implementations of the active-set algorithms are based on dense linear algebra that uses a QR factorization to compute/update a nullspace basis for every working set. Thus the presented results are limited to dense problems, except for §4.3 where the special structure of the constraints of (32) results in trivial, sparse QR factorizations of the nullspace bases.
Before presenting the results we will discuss some practical considerations regarding the suggested active-set algorithm. The eigenproblems (4) associated with the trust-region subproblems of each working set are solved with ARPACK [24]. In the occasional cases where ARPACK fails, a direct eigensolver is used. Finally, before solving the subproblem (SP) of every iteration, we perform a few projected gradient steps with the hope to quickly activate a new constraint.
We now proceed with the results, starting with the simplest case of dense random problems.

Random Dense Constant Norm QPs
We compared the performance of our algorithm against the state-of-the-art non-linear solver Ipopt [33] with its default parameters. We use the Julia interface of Ipopt, Ipopt.jl, which exhibits negligible interface overhead times due to the excellent interfacing capabilities of Julia with C++.
We consider a set of these randomly generated problems with varying number of variables n. A feasible point for each problem instance is calculated for our Algorithm as described in Section 3.2.1. The time required to compute the initial feasible points is included in the subsequent results. Figure 3 (left) shows the execution times. We observe that our algorithm is significantly faster than Ipopt by a factor of up to 50. Both of the solvers achieve practically identical objectives (relative difference less than 10 −9 ) in all of the problem instances, except in two cases where there is a considerable difference due to the fact that the solvers ended up converging in different local minimizers. Finally, Figure 3 (right) shows the infeasibility of the returned solution, where we observe that our solver returns solutions of significantly smaller (i.e. better) infeasibility. Fig. 3 Timing results (left) and maximum feasibility violation (right) of the solution of random problems with varying size n generated with the following parameters P = Symmetric(randn(n, n)), q = randn(n), A = randn(1.5n, n), b = randn(n) and r = 100. Hardware used: Intel E5-2640v3, 64GB memory. The maximum feasibility violation is defined as max((Ax

Infeasibility
Active Set Ipopt

Computing search directions for Sequential Quadratic Programming
Sequential Quadratic Programming (SQP) is a powerful and popular algorithm that aims to solve the problem where g : R n → R and h : R n → R m are general (nonlinear) functions. At every iteration, SQP minimizes the nonlinear objective over some search directions. These search directions can, for example, be obtained by the optimization of a quadratic approximation of the original objective subject to some linear approximation of the original constraints. It is common to introduce a norm constraint δx 2 ≤ r that ensures that the solution will be inside a confidence or trust region of this approximation thus resulting in a problem of the form (P). This is particularly useful at points where the Hessian of the nonlinear objective is singular or indefinite as the search directions computed by the respective QPs might be unrealistically large or unbounded. Typically, SQP algorithms solve problems simpler than (P) by e.g. not including the inequality constraints (P) in explicit form for the calculation of the search directions [27, §18.2]. We will show, however, that our solver is capable of computing search directions from (P) directly, without the need for these simplifications.
We demonstrate this on all the problems of the CUTEst collection [17] that have linear constraints and number of variables less than 2000. For each of these problems we consider the quadratic approximation where x 0 is the closest feasible point (in the 2-norm) to the initial point suggested by CUTEst, which we compute with GUROBI [19], and δx := x − x 0 .
We discard problems where GUROBI fails to calculate an initial feasible point. Furthermore, we do not consider problems for which (29) is convex, since these problems can be solved to global optimality in polynomial time with standard solvers such as MOSEK or COSMO.jl [11]. Finally, some problems in the CUTEst collection are parametric; if the default parameters lead to a problem with more than 2000 variables then we choose a parameter that, if possible, leads to a number of variables close to, but not exceeding, 2000. Tables 4-6 list all the 63 problems considered, along with any non-default parameters.
We compare the performance of our algorithm against Ipopt with its default options. For some problems, Ipopt terminates without indication of failure but returns a low quality solution. We consider a problem as "solved" using the same criteria as [8]. That is, we require that the overall error of the KKT conditions is less than 10 −4 , defined as and κ, µ are the Lagrange multipliers corresponding to the linear inequality and norm constraints of (29). Figure 4 shows the performance profile for the problems considered. The performance profile suggests that our algorithm significantly outperforms Ipopt on this set of problems especially in reliability, in the sense that we consistently obtain high quality solutions. Note that unlike the dense implementation of our solver, Ipopt can exploit the sparsity of the problems efficiently. Further speedups can be brought to our algorithm with a sparse implementation which might allow its use in large scale sparse problems. Moreover, as an active-set solver, our algorithm can efficiently exploit warm starting, which might be highly desirable in SQP where repeated solution of problems of the form (29) is required as part of the SQP procedure for minimizing (28). This is in contrast to Ipopt whose Interior Point nature makes warm starting very difficult to exploit.

Sparse Principal Component Analysis
Principal Component Analysis (PCA) is a standard, widely used dimensionality reduction algorithm. It has applications in numerous fields, including statistics, machine learning, bioinformatics, genetics, meteorology and others. Given a k × n data matrix D that consists of k points of n variables, PCA suggests a few linear combinations of these variables, which we call principal vectors, that explain as much variance of the data as possible. Standard PCA amounts to the following problem where Σ is the covariance matrix of the data. We will assume for the rest of this section that the data matrix D is "centered", i.e. that it has column-wise zero mean. We can then consider Σ to be the empirical covariance matrix Σ := D T D k−1 ∈ S n + . The above problem is essentially an eigenvalue problem on Σ or a singular value problem on D and, as such, can be solved with standard linear algebra tools.
In general, each principal vector is a linear combination of all the variables, i.e. typically all components of the principal vectors are non-zero. This can pose an issue of interpretability of the reduced dimensions, as it is often desirable to express the principal vector as a combination of a few variables, especially when the variables are associated with a user interpretable meaning. To alleviate this problem, sparsity has to be enforced in the original PCA problem, resulting in a new optimization problem that aims to identify a small set of variables the linear combination of which will hopefully still be able to explain a significant proportion of the variance of the data.
Unfortunately, enforcing sparsity in PCA results in a combinatorial optimization problem. Various remarkably efficient and scalable heuristics have been suggested to avoid this intractability [23], [34]. We will focus on one of the most popular heuristics that is based on a lasso type constraint originally introduced by [22], resulting in the following optimization problem: where x ∈ R n is the decision variable and γ ≥ 1 is a parameter that controls the sparsity of the solution.
We will now show how Problem (31) can be solved with our algorithm. Note that (31) is not in the standard norm-constrained form (P) that we address. However, we show in the Appendix that it is equivalent to where [w T 1 w T 2 ] := w T ∈ R 2n , in the sense that (31) and (32) have the same optimal value and every minimizer of (32)w defines a minimizerx =w 1 −w 2 for (31). Also, the optimalw has the same cardinality as the respective optimal x =w 1 −w 2 .
The reader might think that solving (32) with an active-set method requires a solve time that is polynomial in the number of variables, thus limiting its scalability as compared to simpler gradient based methods that are scalable but might have inferior accuracy and, potentially, weaker convergence guarantees. This is not true however, since our algorithm takes advantage of the sparsity of the iterates. Indeed, excluding the lasso constraint, the working set W k of Algorithm 2 applied in Problem (32) can be interpreted as the set of variables w i that are fixed to zero. Thus every subproblem can be trivially reduced to k dimensions, where k is the number of variables w i that are not included in the current working set. This property is particularly attractive when the user is interested in a very sparse solution and has been recognized in the literature on 1 -penalized active-set algorithms [3, §6].
We will demonstrate the scalability and flexibility of our algorithm by applying it in a large "bag of words" dataset from articles of the New York Times newspaper obtained by the University of California, Irvine (UCI) Machine Learning Repository 7 . In this dataset, every row of the associated data matrix D corresponds to a word used in any of the documents and every column to a document. The entries of the matrix are the number of occurrences of a word in a document. The dataset contains m ≈ 300, 000 articles that consist of n ≈ 102, 660 words (or keywords) with ≈ 70 million non-zeros in the dataset matrix D. We seek to calculate a few sparse principal vectors, each of which will hopefully correspond to a user-interpretable category, like politics, economics, the arts etc.
The usefulness of Sparse PCA on this dataset has been already demonstrated in the literature. Both [28] and [35] have generated a set of five sparse principal vectors, each of which is a linear combination of five words. The results of [28] are listed in Table 1. The five resulting principal vectors have distinct, user interpretable associated meanings that correspond to sports, economics, politics, education, and US foreign policy. The resulting words can be used to create user interpretable categories of the documents [35]. However, to TPower: our algorithm runs in 5.22 seconds and has a resulting variance equal to 58.23. However, our algorithm is more general, in the sense that it can solve any problem of the form (P) which, as we proceed to show, allows for variants of (31) that can generate principal vectors with more meaningful interpretation. Indeed, improving interpretability of the results of Sparse PCA can be achieved by the incorporation of additional constraints in the underlying optimization problem. Unfortunately, most of the algorithms that are specific for sparse PCA do not allow for the presence of extra constraints. An exception is perhaps GPower of [23], which is a generic framework for minimizing a concave function f over a compact subset F. The iterates of GPower are generated as where f (·) is a subgradient of f (·). [23] showed that for the case of Sparse PCA of (31) the above optimization problem reduces to two matrix multiplications with D and D T . However, in the presence of general linear inequality constraints, solving (33) can be at least as hard as solving a linear program (LP), which is obviously considerably more computationally demanding than two matrix multiplications. In contrast, due to the generality of our approach, imposing any linear constraint is straightforward for our algorithm. To demonstrate this, we impose a non-negativity constraint on the elements of the principal vector to avoid cases where the principal vectors consist of elements that have opposing meanings which happens, for example, in the results of TPower listed in Table 2. Table 3 lists five sparse principal vectors obtained with this approach which take 24.37 seconds to compute on a standard Laptop with an Intel i7-5557U CPU and 8GB of memory. Note that, unlike the results of Table 2, the principal vectors of Table 3 have a clear associated meaning that can be labeled as sports, economics, politics, US foreign policy and education. These words could for example be used for organizing the documents in a user interpretable way [35].

Conclusion
This paper introduced an active-set algorithm for the solution of quadratic functions subject to linear constraints and a single norm constraint. The suggested algorithm is based on repeated solutions of the TRS, for which we derived novel theoretical results regarding its local-nonglobal minimizer. The usefulness of the suggested algorithm was demonstrated in a range of real world experiments. Table 3 Results of applying non-negative sparse PCA on the NYtimes dataset with the suggested active-set algorithm. Each column lists words that correspond to nonzero entries in the respective principal vectors. The words are sorted in decreasing magnitude of the respective weights. Given a set of computed principal vectors {x i }, the next principal vector is computed on the deflated data matrix D− (Dx i )x T i . The algorithm is initialized with the 30 most positive, or most negative, entries (depending on which one leads to higher explained variance) of the first singular vector of the respective matrix. Furthermore we perform a standard post-processing polishing step at each resulting principal vector, as described in [23, §4.2]. Note that, unlike Table 2 our algorithm results in principal vectors that have clear, user interpretable associated meanings that corresponds to sports, economics, politics, education, and US foreign policy. These words could be used for document classification [35]. Moreover, for any minimizer (w 1 ,w 2 ) of (32), choosingx =w 1 −w 2 gives f (x) = g([w 1 ;w 2 ]) withx feasible for (31) as we proceed to show. Indeed, note thatw T 1w 2 ≥ 0 and w 1 2 2 + w 2 2 2 = 1 thus and Assume thatx is not a minimizer for (31). Then, there exists anx with x −x ∞ arbitrarily small such that f (x) > f (x). Thus (w 1 ,w 2 ) := (x + , −x − ) is feasible for (32) with g([w 1 ;w 2 ]) = f (x) < f (w) = g([w 1 ;w 2 ])) and i.e. (w 1 ,w 2 ) is arbitrarily close to (w 1 ,w 2 ). This contradicts the assumption that (w 1 ,w 2 ) is a local minimizer for (32). Finally, since the optimal value of (32) is less than the one of (31) and the global optimum (w 1 ,w 2 ) of (32) defines a feasiblex for (31) with g([w 1 ;w 2 ]) = f (x) we conclude that the optimal values of (31) and (32) coincide.
Furthermore, the following complementarity condition holds which implies that the sparsity of a solutionw of (32) is equal to the sparsity of the solutionx =w 1 −w 2 for Problem (31).
Finally, we present detailed comparison results of our algorithm against Ipopt on the problems described in Section 4.2.  Table 5 Continuation of Table 4 Problem