An augmented Lagrangian method for optimization problems with structured geometric constraints

This paper is devoted to the theoretical and numerical investigation of an augmented Lagrangian method for the solution of optimization problems with geometric constraints. Specifically, we study situations where parts of the constraints are nonconvex and possibly complicated, but allow for a fast computation of projections onto this nonconvex set. Typical problem classes which satisfy this requirement are optimization problems with disjunctive constraints (like complementarity or cardinality constraints) as well as optimization problems over sets of matrices which have to satisfy additional rank constraints. The key idea behind our method is to keep these complicated constraints explicitly in the constraints and to penalize only the remaining constraints by an augmented Lagrangian function. The resulting subproblems are then solved with the aid of a problem-tailored nonmonotone projected gradient method. The corresponding convergence theory allows for an inexact solution of these subproblems. Nevertheless, the overall algorithm computes so-called Mordukhovich-stationary points of the original problem under a mild asymptotic regularity condition, which is generally weaker than most of the respective available problem-tailored constraint qualifications. Extensive numerical experiments addressing complementarity- and cardinality-constrained optimization problems as well as a semidefinite reformulation of MAXCUT problems visualize the power of our approach.


Introduction
We consider the program where W and Y are Euclidean spaces, i.e., real and finite-dimensional Hilbert spaces, f : W → R and G : W → Y are continuously differentiable, C ⊂ Y is nonempty, closed, and convex, whereas the set D ⊂ W is only assumed to be nonempty and closed. This setting is very general and covers, amongst others, standard nonlinear programs, second-order cone and, more generally, conic optimization problems [14,24], as well as several so-called disjunctive programming problems like mathematical programs with complementarity, vanishing, switching, or cardinality constraints, see [15,16,29,55] for an overview and suitable references. Since W and Y are Euclidean spaces, our model also covers matrix optimization problems like semidefinite programs or low-rank approximation problems [52]. The aim of this paper is to apply a (structured) augmented Lagrangian technique to (P) in order to find suitable stationary points. The augmented Lagrangian or multiplier penalty method is a classical approach for the solution of nonlinear programs, see [17] as a standard reference. The more recent book [18] presents a slightly modified version of this classical augmented Lagrangian method, which uses a safeguarded update of the Lagrange multipliers and has stronger global convergence properties. In the meantime, this safeguarded augmented Lagrangian method has also been applied to a number of optimization problems with disjunctive constraints, see e.g. [5,33,42,44,60].
Since, to the best of our knowledge, augmented Lagrangian methods have not yet been applied to the general problem (P) with general nonconvex D and arbitrary convex sets C in the setting of Euclidean spaces, and in order to get a better understanding of our contributions, let us add some comments regarding the existing results for the probably most prominent non-standard optimization problem, namely the class of mathematical programs with complementarity constraints (MPCCs). Due to the particular structure of the feasible set, the usual Karush-Kuhn-Tucker (KKT for short) conditions are typically not satisfied at a local minimum. Hence, other (weaker) stationarity concepts have been proposed, like C-(abbreviating Clarke) and M-(for Mordukhovich) stationarity, with M-stationarity being the stronger concept. Most algorithms (regularization, penalty, augmented Lagrangian methods etc.) for the solution of MPCCs solve a sequence of standard nonlinear programs, and their limit points are typically C-stationary points only. Some approaches can identify Mstationary points if the underlying nonlinear programs are solved exactly, but they loose this desirable property if these programs are solved only inexactly, see the discussion in [46] for more details.
idea of reference [9] is to solve the subproblems such that both a first-and a secondorder necessary optimality condition hold inexactly at each iteration, i.e., satisfaction of the second-order condition is the central point here which, obviously, causes some overhead for the subproblem solver and usually excludes the application of this approach to large-scale problems. The paper [60] proves convergence to M-stationary points by solving some complicated subproblems, but for the latter no method is specified. Finally, the recent approach described in [33] provides an augmented Lagrangian technique for the solution of MPCCs where the complementarity constraints are kept as constraints, whereas the standard constraints are penalized. The authors present a technique which computes a suitable stationary point of these subproblems in such a way that the entire method generates M-stationary accumulation points for the original MPCC. Let us also mention that [36] suggests to solve (a discontinuous reformulation of) the M-stationarity system associated with an MPCC by means of a semismooth Newton-type method. Naturally, this approach should be robust with respect to (w.r.t.) an inexact solution of the appearing Newton-type equations although this issue is not discussed in [36].
The present paper universalizes the idea from [33] to the much more general problem (P). In fact, a closer look at the corresponding proofs shows that the technique from [33] can be generalized using some relatively small modifications. This allows us to concentrate on some additional new contributions. In particular, we prove convergence to an M-type stationary point of the general problem (P) under a very weak sequential constraint qualification introduced recently in [53] for the general setting from (P). We further show that this sequential constraint qualification holds under the conditions for which convergence to M-stationary points of an MPCC is shown in [33]. Note that this is also the first algorithmic application of the general sequential stationarity and regularity concepts from [53].
The global convergence result for our method holds for the abstract problem (P) with geometric constraints without any further assumptions regarding the sets C and, in particular, D. Conceptually, we are therefore able to deal with a very large class of optimization problems. On the other hand, we use a projected gradient-type method for the solution of the resulting subproblems. Since this requires projections onto the (usually nonconvex) set D, our method can be implemented efficiently only if D is simple in the sense that projections onto D are easy to compute. For this kind of "structured" geometric constraints (this explains the title of this paper), the entire method is then both an efficient tool and applicable to large-scale problems. In particular, we show that this is the case for MPCCs, optimization problems with cardinality constraints, and some rank-constrained matrix optimization problems.
The paper is organized as follows. We begin with restating some basic definitions from variational analysis in Section 2. There, we also relate the general regularity concept from [53] to the constraint qualification (the so-called relaxed constant positive linear dependence condition, RCPLD for short) used in the underlying paper [33] (as well as in many other related publications in this area). We then present the spectral gradient method for optimization problems over nonconvex sets in Section 3. This method is used to solve the resulting subproblems of our augmented Lagrangian method whose details are given in Section 4. Global convergence to M-type stationary points is also shown in this section. Since, in our augmented Lagrangian approach, we penalize the seemingly easy constraints G(w) ∈ C, but keep the condition w ∈ D explicitly in the constraints, we have to compute projections onto D. Section 5 therefore considers a couple of situations where this can be done in a numerically very efficient way. Extensive computational experiments for some of these situations are documented in Section 6. This includes MPCCs, cardinality-constrained (sparse) optimization problems, and a rank-constrained reformulation of the famous MAXCUT problem. We close with some final remarks in Section 7.
Notation. The Euclidean inner product of two vectors x, y ∈ R n will be denoted by x ⊤ y. More generally, x, y is used to represent the inner product of x, y ∈ W whenever W is some abstract Euclidean space. For brevity, we exploit x + A := A + x := {x + a | a ∈ A} for arbitrary vectors x ∈ W and sets A ⊂ W. The sets cone A and span A denote the smallest cone containing the set A and the smallest subspace containing A, respectively. Whenever L : W → Y is a linear operator between Euclidean spaces W and Y, L * : Y → W denotes its adjoint. For some continuously differentiable mapping ϕ : W → Y and some point w ∈ W, we use ϕ ′ (w) : W → Y in order to denote the derivative of ϕ at w which is a continuous linear operator. In the particular case Y := R, we set ∇ϕ(w) := ϕ ′ (w) * 1 ∈ W for brevity.

Preliminaries
We first recall some basic concepts from variational analysis in Section 2.1, and then introduce and discuss general stationarity and regularity concepts for the abstract problem (P) in Section 2.2.

Fundamentals of Variational Analysis
In this section, we comment on the tools of variational analysis which will be exploited in order to describe the geometry of the closed, convex set C ⊂ Y and the closed (but not necessarily convex) set D ⊂ W which appear in the formulation of (P).
The Euclidean projection P C : Y → Y onto the closed, convex set C is given by Givenw ∈ D, the closed cone is referred to as the limiting normal cone to D atw, see [58,63] for other representations and properties of this variational tool. Above, we used the notion of the outer (or upper) limit of a set-valued mapping at a certain point, see e.g. [63,Definition 4.1]. For w / ∈ D, we set N lim D (w) := ∅. Note that the limiting normal cone depends on the inner product of W and is stable in the sense that holds. This stability property, which might be referred to as outer semicontinuity of the set-valued operator N lim D : W ⇒ W, will play an essential role in our subsequent analysis. The limiting normal cone to the convex set C coincides with the standard normal cone from convex analysis, i.e., forȳ ∈ C, we have For points y / ∈ C, we set N C (y) := ∅ for formal completeness. Note that the stability property (2.1) is also satisfied by the set-valued operator N C : Y ⇒ Y.

Stationarity and Regularity Concepts
Noting that the abstract set D is generally nonconvex in the exemplary settings we have in mind, the so-called concept of Mordukhovich-stationarity, which exploits limiting normals to D, is a reasonable concept of stationarity which addresses (P).
Definition 2.1. Letw ∈ W be feasible for the optimization problem (P). Thenw is called an M-stationary point (Mordukhovich-stationary point) of (P) if there exists a multiplier λ ∈ Y such that Note that this definition coincides with the usual KKT conditions of (P) if the set D is convex. An asymptotic counterpart of this definition is the following one, see [53].
Definition 2.2. Letw ∈ W be feasible for the optimization problem (P). Thenw is called an AM-stationary point (asymptotically M-stationary point) of (P) if there exist sequences {w k }, {ε k } ⊂ W and {λ k }, {z k } ⊂ Y such that w k →w, ε k → 0, z k → 0, as well as The definition of an AM-stationary point is similar to the notion of an AKKT (asymptotic or approximate KKT) point in standard nonlinear programming, see [18], but requires some explanation: The meanings of the iterates w k and the Lagrange multiplier estimates λ k should be clear. The vector ε k measures the inexactness by which the stationary conditions are satisfied at w k and λ k . The vector z k does not occur (at least not explicitly) in the context of standard nonlinear programs, but is required here for the following reason: The method to be considered in this paper generates a sequence {w k } satisfying w k ∈ D, while the constraint G(w) ∈ C gets penalized, hence, the condition G(w k ) ∈ C will typically be violated. Consequently, the corresponding normal cone N C (G(w k )) would be empty which is why we cannot expect to have λ k ∈ N C (G(w k )), though we hope that this holds asymptotically. In order to deal with this situation, we therefore have to introduce the sequence {z k }. Let us note that AM-stationarity corresponds to so-called AKKT stationarity for conic optimization problems, i.e., where C is a closed, convex cone and D := W, see [3,Section 5]. The more general situation where C and D are closed, convex sets and the overall problem is stated in arbitrary Banach spaces is investigated in [20]. Asymptotic notions of stationarity addressing situations where D is a nonconvex set of special type can be found, e.g., in [5,45,60]. As shown in [53], the overall concept of asymptotic stationarity can be further generalized to feasible sets which are given as the kernel of a set-valued mapping. Let us mention that the theory in this section is still valid in situations where C is merely closed. In this case, one may replace the normal cone to C in the sense of convex analysis by the limiting normal cone everywhere. However, for nonconvex sets C, our algorithmic approach from Section 4 is not valid anymore. Note that, for the price of a slack variable w s ∈ Y, we can transfer the given constraint system into where the right-hand side of the nonlinear constraint is trivially convex. In order to apply the algorithmic framework of this paper to this reformulation, projections onto C have to be computed efficiently. Moreover, there might be a difference between the asymptotic notions of stationarity and regularity discussed here when applied to this reformulation or the original formulation of the constraints.
Apart from the aforementioned difference, the motivation of AM-stationarity is similar to the one of AKKT-stationarity: Suppose that the sequence {λ k } is bounded and, therefore, convergent along a subsequence. Then, taking the limit on this subsequence in the definition of an AM-stationary point while using the stability property (2.1) of the limiting normal cone shows that the corresponding limit point satisfies the M-stationarity conditions from Definition 2.1. In general, however, the Lagrange multiplier estimates {λ k } in the definition of AM-stationarity might be unbounded. Though this boundedness can be guaranteed under suitable (relatively strong) assumptions, the resulting convergence theory works under significantly weaker conditions.
It is well known in optimization theory that a local minimizer of (P) is Mstationary only under validity of a suitable constraint qualification. In contrast, it has been pointed out in [53, Theorem 4.2, Section 5.1] that each local minimizer of (P) is AM-stationary. In order to infer that an AM-stationary point is already M-stationary, the presence of so-called asymptotic regularity is necessary, see [53,Definition 4.4]. Definition 2.3. A feasible pointw ∈ W of (P) is called AM-regular (asymptotically Mordukhovich-regular) whenever the condition . The concept of AM-regularity has been inspired by the notion of AKKT-regularity (sometimes referred to as cone continuity property), which became popular as one of the weakest constraint qualifications for standard nonlinear programs or MPCCs, see e.g. [7,8,60], and can be generalized to a much higher level of abstractness. In this regard, we would like to point the reader's attention to the fact that AM-stationarity and -regularity from Definitions 2.2 and 2.3 are referred to as decoupled asymptotic Mordukhovich-stationarity and -regularity in [53] since these are already refinements of more general concepts. For the sake of a concise notation, however, we omit the term decoupled here.
It has been shown in [53, Section 5.1] that validity of AM-regularity at a feasible pointw ∈ W of (P) is implied by The latter is known as NNAMCQ (no nonzero abnormal multiplier constraint qualification) or GMFCQ (generalized Mangasarian-Fromovitz constraint qualification) in the literature. Indeed, in the setting where we fix C := R m 1 − × {0} m 2 and D := W, (2.2) boils down to the classical Mangasarian-Fromovitz constraint qualification from standard nonlinear programming. The latter choice for C will be of particular interest, which is why we formalize this setting below.
Setting 2.4. Given m 1 , m 2 ∈ N, we set m := m 1 + m 2 , Y := R m , and C := R m 1 − × {0} m 2 . No additional assumptions are postulated on the set D. We denote the component functions of G by G 1 , . . . , G m : W → R. Thus, the constraint G(w) ∈ C encodes the constraint system Setting 2.5. Let X be another Euclidean space, let X ⊂ X be the union of finitely many convex, polyhedral sets, and let T ⊂ R 2 be the union of two polyhedrons T 1 , T 2 ⊂ R 2 . For functions g : X → R m 1 , h : X → R m 2 , and p, q : X → R m 3 , we consider the constraint system given by Definition 2.6. Letw ∈ W be a feasible point of the optimization problem (P) in Setting 2.4. Thenw is said to satisfy RCPLD whenever the following conditions hold: (i) the family (∇G i (w)) i∈J has constant rank on a neighborhood ofw, (ii) there exists an index set S ⊂ J such that the family (∇G i (w)) i∈S is a basis of the subspace span{∇G i (w) | i ∈ J}, and (iii) for each index set I ⊂ I(w), each set of multipliers λ i ≥ 0 (i ∈ I) and λ i ∈ R (i ∈ S), not all vanishing at the same time, and each vector η ∈ N lim D (w) which satisfy 0 ∈ i∈I∪S λ i ∇G i (w) + η, we find neighborhoods U ofw and V of η such that for all w ∈ U andη ∈ N lim D (w) ∩ V , the vectors from RCPLD has been introduced for standard nonlinear programs (i.e., D := W = R n in Setting 2.4) in [4]. Some extensions to complementarity-constrained programs can be found in [27,34]. A more restrictive RCPLD-type constraint qualification which is capable of handling an abstract constraint set can be found in [35,Definition 1]. Let us note that RCPLD from Definition 2.6 does not depend on the precise choice of the index set S in (ii).
In case where D is a set of product structure, condition (iii) in Definition 2.6 can be slightly weakened in order to obtain a reasonable generalization of the classical relaxed constant positive linear dependence constraint qualification, see [67, Remark 1.1] for details. Observing that GMFCQ from (2.2) takes the particular form in Setting 2.4, it is obviously sufficient for RCPLD. The subsequently stated result generalizes related observations from [7,60].
Lemma 2.7. Letw ∈ W be a feasible point for the optimization problem (P) in Setting 2.4 where RCPLD holds. Thenw is AM-regular.
Particularly, there are sequences . . , m 1 } \ I(w) and all sufficiently large k ∈ N, i.e., for sufficiently large k ∈ N. Thus, we may assume without loss of generality that holds for all k ∈ N. By definition of RCPLD, (∇G i (w k )) i∈S is a basis of the subspace span{∇G i (w k ) | i ∈ J} for all sufficiently large k ∈ N. Hence, there exist scalars µ k i (i ∈ S) such that holds for all sufficiently large k ∈ N. On the other hand, [4, Lemma 1] yields the existence of an index set I k ⊂ I(w) and multipliersμ k i > 0 (i ∈ I k ),μ k i ∈ R (i ∈ S), and σ k ≥ 0 such that ξ k = i∈I k ∪Sμ k i ∇G i (w k ) + σ k η k and σ k > 0 =⇒ (∇G i (w k )) i∈I k ∪S , η k linearly independent, σ k = 0 =⇒ (∇G i (w k )) i∈I k ∪S linearly independent.
Since there are only finitely many subsets of I(w), there needs to exist I ⊂ I(w) such that I k = I holds along a whole subsequence. Along such a particular subsequence (without relabeling), we furthermore may assume σ k > 0 (otherwise, the proof will be easier) and, thus, may setη k := σ k η k ∈ N lim D (w k ) \ {0}. From above, we find linear independence of (∇G i (w k )) i∈I∪S ,η k . Furthermore, we have Suppose that the sequence {((μ k i ) i∈I∪S ,η k )} is not bounded. Dividing (2.3) by the norm of ((μ k i ) i∈I∪S ,η k ), taking the limit k → ∞, and respecting boundedness of {ξ k }, continuity of G ′ , and outer semicontinuity of the limiting normal cone yield the existence of a non-vanishing multiplier ((μ i ) i∈I∪S ,η) which satisfiesμ i ≥ 0 (i ∈ I), η ∈ N lim D (w), and 0 = i∈I∪Sμ i ∇G i (w) +η.
Obviously, the multipliersμ i (i ∈ I ∪ S) do not vanish at the same time since, otherwise,η = 0 would follow from above which yields a contradiction. Now, validity of RCPLD guarantees that the vectors (∇G i (w k )) i∈I∪S ,η k need to be linearly dependent for sufficiently large k ∈ N. However, we already have shown above that these vectors are linearly independent, a contradiction. Thus, the sequence {((μ k i ) i∈I∪S ,η k )} is bounded and, therefore, possesses a convergent subsequence with limit ((μ i ) i∈I∪S ,η). Taking the limit in (2.3) while respecting ξ k → ξ, the continuity of G ′ , and the outer semicontinuity of the limiting normal cone, we come up withμ i ≥ 0 (i ∈ I),η ∈ N lim D (w), and ξ = i∈I∪Sμ i ∇G i (w) +η.
A popular situation, where AM-regularity simplifies and, thus, becomes easier to verify, is described in the following lemma which follows from [53, Theorems 3.10, 5.2]. Lemma 2.8. Letw ∈ W be a feasible point for the optimization problem (P) where C is a polyhedron and D is the union of finitely many polyhedrons. Thenw is AMregular if any only if Particularly, in case where G is an affine function,w is AM-regular.
Let us consider the situation where (P) is given as described in Setting 2.4, and assume in addition that D := W holds, i.e., that (P) is a standard nonlinear optimization problem with finitely many equality and inequality constraints. Then Lemma 2.8 shows that AM-regularity corresponds to the cone continuity property from [7, Definition 3.1], and the latter has been shown to be weaker than most of the established constraint qualifications which can be checked in terms of initial problem data.
Let us specify these findings for MPCCs which can be stated in the form (P) via Setting 2.5. Taking Lemmas 2.8 and 2.9 into account, AM-regularity corresponds to the so-called MPCC cone continuity property from [60,Definition 3.9]. The latter has been shown to be strictly weaker than MPCC-RCPLD, see [60, Definition 4.1, Theorem 4.2, Example 4.3] for a definition and this result. A similar reasoning can be used in order to show that problem-tailored versions of RCPLD associated with other classes of disjunctive programs are sufficient for the respective AM-regularity. This, to some extend, recovers our result from Lemma 2.7 although we need to admit that, exemplary, RCPLD from Definition 2.6 applied to MPCC in Setting 2.5 does not correspond to MPCC-RCPLD.
The above considerations underline that AM-regularity is a comparatively weak constraint qualification for (P). Exemplary, for standard nonlinear problems and for MPCCs, this follows from the above comments and the considerations in [7,60]. For other types of disjunctive programs, the situation is likely to be similar, see e.g. [49, Figure 3] for the setting of switching-constrained optimization. It remains a topic of future research to find further sufficient conditions for AM-regularity which can be checked in terms of initial problem data, particularly, in situations where C and D are of particular structure like in semidefinite or second-order cone programming, see e.g. [6,Section 6]. Let us mention that the provably weakest constraint qualification which guarantees that local minimizers of a geometrically constrained program are M-stationary is slightly weaker than validity of the pre-image rule for the computation of the limiting normal cone to the constraint region of (P), see [34, Section 3] for a discussion, but the latter cannot be checked in practice. Due to [53,Theorem 3.16], AM-regularity indeed implies validity of this pre-image rule.

A Spectral Gradient Method for Nonconvex Sets
In this section, we discuss a solution method for constrained optimization problems which applies whenever projections onto the feasible set are easy to find. Exemplary, our method can be used in situations where the feasible set has a disjunctive nonconvex structure.
To motivate the method, first consider the unconstrained optimization problem with a continuously differentiable objective function ϕ : R n → R, and let w j be a current estimate for a solution of this problem. Computing the next iterate w j+1 as the unique minimizer of the local quadratic model for some γ j > 0 leads to the explicit expression i.e., we get a steepest descent method with stepsize t j := 1/γ j . Classical approaches compute t j using a suitable stepsize rule such that ϕ(w j+1 ) < ϕ(w j ). On the other hand, one can view the update formula as a special instance of a quasi-Newton scheme with the very simple quasi-Newton matrix B j := γ j I as an estimate of the (not necessarily existing) Hessian ∇ 2 ϕ(w j ). Then the corresponding quasi-Newton equation see [28], reduces to the linear system γ j+1 s j = y j . Solving this overdetermined system in a least squares sense, we then obtain the stepsize introduced by Barzilai and Borwein [10]. This stepsize often leads to very good numerical results, but may not yield a monotone decrease in the function value. A convergence proof for general nonlinear programs is therefore difficult, even if the choice of γ j is safeguarded in the sense that it is projected onto some box [γ min , γ max ] for suitable constants 0 < γ min < γ max . Raydan [61] then suggested to control this nonmonotone behavior by combining the Barzilai-Borwein stepsize with the nonmonotone linesearch strategy introduced by Grippo et al. [32]. This, in particular, leads to a global convergence theory for general unconstrained optimization problems.
This idea was then generalized by Birgin et al. [19] to constrained optimization problems min with a nonempty, closed, and convex set W ⊂ R n and is called the nonmonotone spectral gradient method. Here, we extend their approach to minimization problems with a continuously differentiable function ϕ : W → R and some nonempty, closed set D ⊂ W, where W is an arbitrary Euclidean space. Let us emphasize that neither ϕ nor D need to be convex in our subsequent considerations. A detailed description of the corresponding generalized spectral gradient is given in Algorithm 3.1. Particular instances of this approach with nonconvex sets D can already be found in [13,25,26,33]. Note that all iterates belong to the set D, that the subproblems (Q(j, i)) are always solvable, and that we have to compute only one solution, although their solutions are not necessarily unique. We would like to emphasize that ∇ϕ(w j ) if w j,i satisfies a termination criterion then Set i j := i, γ j := γ j,i , and w j+1 := w j,i ; 10 end was used in the formulation of (Q(j, i)) in order to underline that Algorithm 3.1 is a projected gradient method. Indeed, simple calculations reveal that the global solutions of (Q(j, i)) correspond to the projections of w j − γ −1 j,i ∇ϕ(w j ) onto D. Note also that the acceptance criterion in Line 8 is the nonmonotone Armijo rule introduced by Grippo et al. [32]. In particular, the parameter m j := min(j, m) controls the nonmonotonicity. The choice m = 0 corresponds to the standard (monotone) method, whereas m > 0 typically allows larger stepsizes and often leads to faster convergence of the method.
We stress that the previous generalization of existing spectral gradient methods plays a fundamental role in order to apply our subsequent augmented Lagrangian technique to several interesting and difficult optimization problems, but the convergence analysis of Algorithm 3.1 can be carried out similar to the one given in [33] where a more specific situation is discussed. We therefore skip the corresponding proofs in this section, but for the reader's convenience, we present them in Appendix A.
The goal of Algorithm 3.1 is the computation of a point which is approximately M-stationary for (3.1). We recall that w is an M-stationary point of (3.1) if holds, and that each locally optimal solution of (3.1) is M-stationary by [58, Theorem 6.1]. Similarly, since w j,i solves the subproblem (Q(j, i)), it satisfies the corresponding M-stationarity condition Let us point the reader's attention to the fact that strong stationarity, where the limiting normal cone is replaced by the smaller regular normal cone in the stationarity system, provides a more restrictive necessary optimality condition for (3.1) and the surrogate (Q(j, i)), see [63, Definition 6.3, Theorem 6.12]. It is well known that the limiting normal cone is the outer limit of the regular normal cone. In contrast to the limiting normal cone, the regular one is not robust in the sense of (2.1), and since we are interested in taking limits later on, one either way ends up with a stationarity systems in terms of limiting normals at the end. Thus, we will rely on the limiting normal cone and the associated concept of M-stationarity. For the following theoretical results, we neglect the termination criterion in Line 5. This means that Algorithm 3.1 does not terminate and performs either infinitely many inner or infinitely many outer interations. The first result analyzes the inner loop.
Proposition 3.1. Consider a fixed (outer) iteration j in Algorithm 3.1. Then the inner loop terminates (due to Line 8) or If the inner loop does not terminate, we get w j,i → w j and w j is M-stationary.
We refer to Appendix A for the proof. It remains to analyze the situation where the inner loop always terminates. Let w 0 ∈ D be the starting point from Algorithm 3.1, and let denote the corresponding (feasible) sublevel set. Then the following observation holds, see [32,66] and Appendix A for the details.
Proposition 3.2. We assume that the inner loop in Algorithm 3.1 always terminates (due to Line 8) and we denote by {w j } the infinite sequence of (outer) iterates. Assume that ϕ is bounded from below and uniformly continuous on S ϕ (w 0 ). Then we have w j+1 − w j → 0 as j → ∞.
The previous result allows to prove the following main convergence result for Algorithm 3.1, see, again, Appendix A for a complete proof. Proposition 3.3. We assume that the inner loop in Algorithm 3.1 always terminates (due to Line 8) and we denote by {w j } the infinite sequence of (outer) iterates. Assume that ϕ is bounded from below and uniformly continuous on S ϕ (w 0 ). Suppose thatw is an accumulation point of {w j }, i.e., w j → Kw along a subsequence K. Thenw is an M-stationary point of the optimization problem (3.1), and we have From the proof of Proposition 3.2, it can be easily seen that the iterates of Algorithm 3.1 belong to the sublevel set S ϕ (w 0 ) although the associated sequence of function values does not need to be monotonically decreasing. Hence, whenever this sublevel set is bounded, e.g., if ϕ is coercive or if D is bounded, the existence of an accumulation point as in Proposition 3.3 is ensured. Moreover, the boundedness of S ϕ (w 0 ) implies that this set is compact. Hence, ϕ is automatically bounded from below and uniformly continuous on S ϕ (w 0 ) in this situation.
By combining Propositions 3.1 and 3.3 we get the following convergence result.
Theorem 3.4. We consider Algorithm 3.1 without termination in Line 5 and assume that Then exactly one of the following situations occurs.
(i) The inner loop does not terminate in the outer iteration j, w j,i → w j as i → ∞, w j is M-stationary, and (3.3) holds. (ii) The inner loop always terminates. The infinite sequence {w j } of outer iterates possesses convergent subsequences {w j } K and every convergent subsequence satisfies w j → Kw ,w is M-stationary, and γ j (w j+1 − w j ) → K 0. This result shows that the infinite sequence of (inner or outer) iterates of Algorithm 3.1 always converges towards M-stationary points (along subsequences). Note that the boundedness of S ϕ (w 0 ) can be replaced by the assumptions on ϕ of Proposition 3.3, but then the outer iterates {w j } might fail to possess accumulation points.
In what follows, we show that these theoretical results also give rise to a reasonable and applicable termination criterion which can be used in Line 5. To this end, we note that the optimality condition (3.2) is equivalent to (or a similar condition), with ε tol > 0, as a termination criterion in Line 5. Indeed, Proposition 3.1 implies that the inner loop always terminates if (3.4) is used. Moreover, the termination criterion (3.4) directly encodes that w j,i is approximately M-stationary for (3.1). This is very desirable since the goal of Algorithm 3.1 is the computation of approximately M-stationary points. Furthermore, we can check that condition (3.4) always ensures the finite termination of Algorithm 3.1 if the mild assumptions of Theorem 3.4 (or the even weaker assumptions of Proposition 3.3) are satisfied. Indeed, due to γ j = γ j,i j and w j+1 = w j,i j , we have γ j,i j w j − w j,i j = γ j w j − w j+1 → K 0. Using w j+1 , w j → Kw and the continuity of ∇ϕ : W → W shows ∇ϕ(w j,i j )−∇ϕ(w j ) = ∇ϕ(w j+1 )−∇ϕ(w j ) → K 0. Thus, the left-hand side of (3.4) with i = i j is arbitrarily small if j ∈ K is large enough. Thus, Algorithm 3.1 with the termination criterion (3.4) terminates in finitely many steps.
Let us mention that the above convergence theory differs from the one provided in [25,26] since no Lipschitzianity of ∇ϕ : W → W is needed. In the particular setting of complementarity-constrained optimization, related results have been obtained in [33,Section 4]. Our findings substantially generalize the theory from [33] to arbitrary set constraints.

Statement of the Algorithm
We now consider the optimization problem (P) under the given smoothness and convexity assumptions stated there (recall that D is not necessarily convex). This section presents a safeguarded augmented Lagrangian approach for the solution of (P). The method penalizes the constraints G(w) ∈ C, but leaves the possibly complicated condition w ∈ D explicitly in the constraints. Hence, the resulting subproblems that have to be solved in the augmented Lagrangian framework have exactly the structure of the (simplified) optimization problems discussed in Section 3.
To be specific, consider the (partially) augmented Lagrangian of (P), where ρ > 0 denotes the penalty parameter. Note that the squared distance function of a nonempty, closed, and convex set is always continuously differentiable, see e.g. [11,Corollary 12.30], which yields that L ρ (·, λ) is a continuously differentiable mapping. Using the definition of the distance, we can alternatively write this (partially) augmented Lagrangian as In order to control the update of the penalty parameter, we also introduce the auxiliary function This function V ρ can also be used to obtain a meaningful termination criterion, see the discussion after (4.4) below. The overall method is stated in Algorithm 4.1. Line 6 of Algorithm 4.1, in general, contains the main computational effort since we have to "solve" a constrained nonlinear program at each iteration. Due to the nonconvexity of this subproblem, we only require to compute an M-stationary point of this program. In fact, we allow the computation of an approximately M-stationary point, with the vector ε k+1 measuring the degree of inexactness. The choice ε k+1 = 0 corresponds to an exact M-stationary point. Note that the subproblems arising in Line 6 have precisely the structure of the problem investigated in Section 3, hence, the spectral gradient method discussed there is a canonical candidate for the solution of these subproblems (note also that the objective function L ρ k (·, u k ) is once, but usually not twice continuously differentiable).
Note that Algorithm 4.1 is called a safeguarded augmented Lagrangian method due to the appearance of the auxiliary sequence {u k } ⊂ U where U is a bounded set. In fact, if we would replace u k by λ k in Line 6 of Algorithm 4.1 (and the corresponding subsequent formulas), we would obtain the classical augmented Lagrangian method. However, the safeguarded version has superior global convergence properties, see [18] Algorithm 4.1: Safeguarded Augmented Lagrangian Method for Geometric Constraints Compute an approximately M-stationary point w k+1 of the subproblem i.e., for some suitable (sufficiently small) vector ε k+1 ∈ W, w k+1 needs to satisfy for a general discussion and [47] for an explicit (counter-) example. In practice, u k is typically chosen to be equal to λ k as long as this vector belongs to the set U, otherwise u k is taken as the projection of λ k onto this set. In situations where Y is equipped with some (partial) order relation , a typical choice for U is given by the box [u min , u max ] := {u ∈ Y | u min u u max } where u min , u max ∈ Y are given bounds satisfying u min u max . In order to understand the update of the Lagrange multiplier estimate in Line 7 of Algorithm 4.1, recall that the augmented Lagrangian is differentiable, with its derivative given by see [11,Corollary 12.30] again. Hence, if we denote the usual (partial) Lagrangian of (P) by we obtain from Line 7 that This formula is actually the motivation for the precise update used in Line 7.
The particular updating rule in Lines 8 to 12 of Algorithm 4.1 is quite common, but other formulas might also be possible. In particular, one can use a different norm in the definition (4.2) of V ρ . Exemplary, we exploited the maximum-norm for our experiments in Section 6 where W is a space of real vectors or matrices. Let us emphasize that increasing the penalty parameter ρ k based on a pure infeasibility measure does not work in Algorithm 4.1. One usually has to take into account both the infeasibility of the current iterate (w.r.t. the constraint G(w) ∈ C) and a kind of complementarity condition (i.e., λ ∈ N C (G(w))).
For the discussion of a suitable termination criterion, we define Using (4.3) and the update formula for λ k , Algorithm 4.1 ensures and this corresponds to the definition of AM-stationary points, see Definition 2.2. Thus, it is reasonable to require ε k → 0 and to use for some ε tol > 0 as a termination criterion. In practical implementations of Algorithm 4.1, a maximum number of iterations should also be incorporated into the termination criterion.

Convergence
Throughout our convergence analysis, we assume implicitly that Algorithm 4.1 does not stop after finitely many iterations. Like all penalty-type methods in the setting of nonconvex programming, augmented Lagrangian methods suffer from the drawback that they generate accumulation points which are not necessarily feasible for the given optimization problem (P). The following (standard) result therefore presents some conditions under which it is guaranteed that limit points are feasible.
Proof. Letw be an arbitrary accumulation point of {w k } and, say, {w k+1 } K a corresponding subsequence with w k+1 → Kw .
We start with the proof under validity of condition (a). Since {ρ k } is bounded, Lines 8 to 12 of Algorithm 4.1 imply that V ρ k (w k+1 , u k ) → 0 for k → ∞. This implies A continuity argument yields d C (G(w)) = 0. Since C is a closed set, this implies G(w) ∈ C. Furthermore, by construction, we have w k+1 ∈ D for all k ∈ N, so that the closedness of D also yieldsw ∈ D. Altogether, this shows thatw is feasible for the optimization problem (P).
Let us now proof the result in presence of (b). In view of (a), it suffices to consider the situation where ρ k → ∞. By assumption, we have Rearranging terms yields Taking the limit k → K ∞ in (4.6) and using the boundedness of {u k }, we obtain by a continuity argument. Similar to part (a), this implies feasibility ofw.
The two conditions (a) and (b) of Proposition 4.1 are, of course, difficult to check a priori. Nevertheless, in the situation where each iterate w k+1 is actually a global minimizer of the subproblem in Line 6 of Algorithm 4.1 and w denotes any feasible point of the optimization problem (P), we have for some suitable constant B due to the boundedness of the sequence {u k }. The same argument also works if w k+1 is only an inexact global minimizer. The next result shows that, even in the case where a limit point is not necessarily feasible, it still contains some useful information in the sense that it is at least a stationary point for the constraint violation. In general, this is the best that one can expect. Proof. In view of Proposition 4.1, if {ρ k } is bounded, then each accumulation point is a global minimum of the feasibility problem (4.7) and, therefore, an M-stationary point of this problem. Hence, it remains to consider the case where {ρ k } is unbounded, i.e., we have ρ k → ∞ as k → ∞. In view of Lines 6 and 7 of Algorithm 4.1, see also (4.3), we have with λ k+1 as in Line 7. Dividing this inclusion by ρ k and using the fact that N lim D (w k+1 ) is a cone, we therefore get Now, letw be an accumulation point and {w k+1 } K be a subsequence satisfying w k+1 → Kw . Then the sequences {ε k+1 } K , {u k } K , and {∇f (w k+1 )} K are bounded. Thus, taking the limit k → K ∞ yields by the outer semicontinuity of the limiting normal cone. Since we also havew ∈ D and due to , see, once more, [11,Corollary 12.30], it follows thatw is an M-stationary point of the feasibility problem (4.7).
We next investigate suitable properties of feasible limit points. The following may be viewed as the main observation in that respect and shows that any such accumulation point is automatically an AM-stationary point in the sense of Definition 2.2. Proof. Let {w k+1 } K denote a subsequence such that w k+1 → Kw . Define for each k ∈ N. We claim that the four (sub-) sequences {w k+1 } K , {z k+1 } K , {ε k+1 } K , and {λ k+1 } K generated by Algorithm 4.1 or defined in the above way satisfy the properties from Definition 2.2 and therefore show thatw is an AM-stationary point. By construction, we have w k+1 → Kw and ε k+1 → K 0. Further, from Line 6 of Algorithm 4.1 and (4.3), we obtain Since N C (s k+1 ) is a cone, the relation between P C and N C together with the definitions of s k+1 , λ k+1 , and z k+1 yield Hence, it remains to show z k+1 → K 0. To this end, we consider two cases, namely whether {ρ k } stays bounded or is unbounded. In the bounded case, Lines 8 to 12 of Algorithm 4.1 imply that V ρ k (w k+1 , u k ) → 0 for k → ∞. The corresponding definitions therefore yield On the other hand, if {ρ k } is unbounded, we have ρ k → ∞. Since {u k } is bounded by construction, the continuity of the projection operator together with the assumed feasibility ofw implies Consequently, we obtain z k+1 = G(w k+1 ) − s k+1 → K 0 also in this case. Altogether, this implies thatw is AM-stationary.
We point out that the proof of Theorem 4.3 even shows the convergence z k+1 = V ρ k (w k+1 , u k ) → K 0, i.e., the stopping criterion (4.5) will be satisfied after finitely many steps.
Recalling that, by definition, each AM-stationary point of (P) which is AM-regular must already be M-stationary, we obtain the following corollary.

Realizations
Let k be a fixed iteration of Algorithm 4.1. For the (approximate) solution of the ALM-subproblem in Line 6 of Algorithm 4.1, we may use Algorithm 3.1. Recall that, given an outer iteration j of Algorithm 3.1, we need to solve the subproblem with some given w j and γ j,i > 0 in the inner iteration i of Algorithm 3.1. As pointed out in Section 3, the above problem possesses the same solutions as i.e., we need to be able to compute elements of the (possibly multi-valued) projection Π D w j − 1 γ j,i ∇ w L ρ k (w j , u k ) . Boiling this requirement down to its essentials, we have to be in position to find projections of arbitrary points onto the set D in an efficient way. Subsequently, this will be discussed in the context of several practically relevant settings.

The Disjunctive Programming Case
We consider (P) in the special Setting 2.5 with X := R n and X := [ℓ, u] where ℓ, u ∈ R n satisfy −∞ ≤ ℓ i < u i ≤ ∞ for i = 1, . . . , n. Recall that the set D is given by in this situation. For givenw = (x,ȳ,z) ∈ R n × R m 3 × R m 3 , we want to characterize the elements of Π D (w). Therefore, we consider the optimization problem We observe that the latter can be decomposed into the n one-dimensional optimization problems min i = 1, . . . , m 3 . Due to T = T 1 ∪ T 2 , each of these problems on its own can be decomposed into the two two-dimensional subproblems In most of the popular settings from disjunctive programming, (R(i, j)) can be solved with ease. By a simple comparison of the associated objective function values, we find the solutions of (5.3). Putting the solutions of the subproblems together, we find the solutions of (5.2), i.e., the elements of Π D (w).
Particularly, it turns out that in order to compute the projections onto the set D under consideration, one basically needs to compute n + 2m 3 projections onto real intervals. In the specific setting of complementarity-constrained programming, this already has been observed in [33,Section 4].
Let us briefly mention that other popular instances of disjunctive programs like vanishing-and or-constrained optimization problems, see e.g. [1,54], where T is given by respectively, can be treated in an analogous fashion. Furthermore, an analogous procedure applies to more general situations where T is the union of finitely many convex, polyhedral sets.

The Sparsity-Constrained Case
We fix W := R n and some κ ∈ N with 1 ≤ κ ≤ n − 1. Consider the set S κ := w ∈ R n w 0 ≤ κ with w 0 being the number of nonzero entries of the vector w. This set plays a prominent role in sparse optimization and for problems with cardinality constraints. Since S κ is nonempty and closed, projections of some vector w ∈ R n (w.r.t. the Euclidean norm) onto this set exist (but may not be unique), and are known to consist of those vectors y ∈ R n such that the nonzero entries of y are precisely the κ largest (in absolute value) components of w (which may not be unique), see e.g.
[12, Proposition 3.6]. Hence, within our augmented Lagrangian framework, we may take D := S κ and then get an explicit formula for the solutions of the corresponding subproblems arising within the spectral gradient method. However, typical implementations of augmented Lagrangian methods (like ALGENCAN, see [2]) do not penalize box constraints, i.e., they leave the box constraints explicitly as constraints when solving the corresponding subproblems. Hence, let us assume that we have some lower and upper bounds satisfying −∞ ≤ ℓ i < u i ≤ ∞ for all i = 1, . . . , n. We are then forced to compute projections onto the set D := S κ ∩ [ℓ, u]. We mention that this assumption is not restrictive. Indeed, let us assume that, e.g., 0 ∈ [ℓ 1 , u 1 ]. Then the first component of w ∈ D cannot be zero, and this shows whereŜ κ−1 := {w ∈ R n−1 | w 0 ≤ κ − 1} and the vectorsl,û ∈ R n−1 are obtained from ℓ, u by dropping the first component, respectively. For the computation of the projection onto S κ , we can now exploit the product structure (5.7). Similarly, we can remove all remaining components i = 2, . . . , n with 0 ∈ [ℓ i , u i ] from D. Thus, we can assume (5.6) without loss of generality. We begin with a simple observation.
Lemma 5.2. Let w ∈ R n be arbitrary. Then, for each y ∈ Π D (w), where D is the set from (5.5), we have Proof. To the contrary, assume that y i = 0 and y i = P [ℓ i ,u i ] (w i ) hold for some index i ∈ {1, . . . , n}. Define the vector q ∈ R n by q j := y j for j = i and q i := P [ℓ i ,u i ] (w i ).
Due to y i = 0, we have q 0 ≤ y 0 ≤ κ, i.e., q ∈ S κ . Additionally, q ∈ [ℓ, u] is clear from y ∈ [ℓ, u] and q i = P [ℓ i ,u i ] (w i ). Thus, we find q ∈ D. Furthermore, q − w < y − w since q i = P [ℓ i ,u i ] (w i ) = y i . This contradicts the fact that y is a projection of w onto D.
Due to the above lemma, we only have two choices for the value of the components associated with projections to D from (5.5). Thus, for an arbitrary index set I ⊂ {1, . . . , n} and an arbitrary vector w ∈ R n , we define p I (w) ∈ R n via It remains to characterize those index sets I which ensure that p I (w) is a projection of w onto D. To this end, we define an auxiliary vector d(w) ∈ R n via Note that this definition directly yields We state the following simple observation.
Lemma 5.3. Fix w ∈ R n and assume that (5.6) is valid. Then the following statements hold: (a) d i (w) ≥ 0 for all i = 1, . . . , n, Observe that the second assertion of the above lemma implies This can be used to characterize the set of projections onto the set D from (5.5).
Proposition 5.4. Let D be the set from (5.5) and assume that (5.6) holds. Then, for each w ∈ R n , y ∈ Π D (w) holds if and only if there exists an index set I ⊂ {1, . . . , n} with |I| = κ such that ∀i ∈ I, ∀j ∈ I (5.10) and y = p I (w) hold.
Proof. If y ∈ Π D (w) holds, then y = p J (w) is valid for some index set J, see To prove the converse direction =⇒, let p J (w) be a projection. Thus, J solves (5.11). We note that the solutions of this problem are invariant under addition and removal of indices i with d i (w) = 0. Due to Lemma 5.3 (b), these operations also do not alter the associated p I (w). Thus, for each projection p J (w), we can add or remove indices i with d i (w) = 0, to obtain a set I with p I (w) = p J (w) and |I| = κ. It is also clear that (5.10) holds for such a choice of I.
Below, we comment on the result of Proposition 5.4.
Remark 5.5. (a) Let y = p I (w) be a projection of w ∈ R n onto D from (5.5) such that (5.6) holds. Observe that y i = 0 may also hold for some indices i ∈ I.
(b) In the unconstrained case [ℓ, u] = R n , we find d i (w) = w 2 i for each w ∈ R n and all i = 1, . . . , n. Thus, Proposition 5.4 recovers the well-known characterization of the projection onto the set S κ which can be found in [12,Proposition 3.6].
We want to close this section with some brief remarks regarding the variational geometry of D = S k ∩ [ℓ, u] from (5.5). Observing that the sets S κ and [ℓ, u] are both polyhedral in the sense that they can be represented as the union of finitely many polyhedrons, the normal cone intersection rule

General Low-Rank Approximations
For natural numbers m, n ∈ N with m, n ≥ 2, we fix W := R m×n . Equipped with the standard Frobenius inner product, W indeed is a Euclidean space. Now, for fixed κ ∈ N satisfying 1 ≤ κ ≤ min(m, n) − 1, let us investigate the set Constraint systems involving rank constraints of type W ∈ D can be used to model numerous practically relevant problems in computer vision, machine learning, computer algebra, signal processing, or model order reduction, see [52, Section 1.3] for an overview. Nowadays, one of the most popular applications behind low-rank constraints is the so-called low-rank matrix completion, particularly, the "Netflixproblem", see [22] for details.
Observe that the variational geometry of D has been explored recently in [40]. Particularly, a formula for the limiting normal cone to this set can be found in [ Proposition 5.6. For a given matrix W ∈ W, let W = UΣV ⊤ be its singular value decomposition with orthogonal matrices U ∈ R m×m and V ∈ R n×n as well as a diagonal matrix Σ ∈ R m×n whose diagonal entries are in non-increasing order. Let U ∈ R m×κ and V ∈ R n×κ be the matrices resulting from U and V by deleting the last m − κ and n − κ columns, respectively. Furthermore, let Σ ∈ R κ×κ be the top left κ × κ block of Σ. Then we have U Σ V ⊤ ∈ Π D ( W ).
Note that the projection formulas from the previous sections allow a very efficient computation of the corresponding projections, which is in contrast to the projection provided by Proposition 5.6. Though the formula given there is conceptually very simple, its realization requires to compute the singular value decomposition of the given matrix.

Symmetric Low-Rank Approximation
Given n ∈ N with n ≥ 2, we consider the set of symmetric matrices W := R n×n sym , still equipped with the Frobenius inner product. Now, for fixed κ ∈ N satisfying 1 ≤ κ ≤ n, let us investigate the set Above, the constraint W 0 is used to abbreviate that W has to be positive semidefinite. Constraint systems involving rank constraints of type W ∈ D arise frequently in several different mathematical models of data science, see [48] for an overview, and Section 6.3 for an application. Note that κ := n covers the setting of pure semidefiniteness constraints.
Exploiting the eigenvalue decomposition of a given matrix W ∈ W, one can easily construct an element of Π D ( W ).
Proposition 5.7. For a given matrix W ∈ W, we denote by W = n i=1 λ i v i v ⊤ i its (orthonormal) eigenvalue decomposition with non-increasingly ordered eigenvalues λ 1 ≥ λ 2 ≥ . . . ≥ λ n and associated pairwise orthonormal eigenvectors v 1 , . . . , v n . Then we have W : Proof. We define the positive and negative part Since the singular value decomposition of W + coincides with the eigenvalue decomposition, the right-hand side is minimized by B = W , see Proposition 5.6 while noting that we have W = W + in case κ = n. Due to W − , W = 0, B = W also minimizes the left-hand side.
It is clear that the computation of the κ largest eigenvalues of W ∈ W is sufficient to compute an element from the projection Π D ( W ). This can be done particularly efficient for small κ (note that κ = 1 holds in our application from Section 6.3).

Extension to Nonsmooth Objectives
For some lower semicontinuous functional q : W → R, we consider the optimization problem min Particularly, we do not assume that q is continuous. Exemplary, let us mention the special cases where q is the indicator function of a closed set, counts the nonzero entries of the argument vector (in case W := R n ), or encodes the rank of the argument matrix (in case W := R m×n ). In this regard, (5.12) can be used to model real-world applications from e.g. image restoration or signal processing. Necessary optimality conditions and qualification conditions addressing (5.12) can be found in [35]. In [25], the authors suggest to handle (5.12) numerically with the aid of an augmented Lagrangian method (without safeguarding) based on the (partially) augmented Lagrangian function (4.1) and the subproblems which are solved with a nonmonotone proximal gradient method inspired by [66]. In this regard, the solution approach to (5.12) described in [25] possesses some parallels to our strategy for the numerical solution of (P). The authors in [25] were able to prove convergence of their method to reasonable stationary points of (5.12) under a variant of the basic qualification condition and RCPLD. Let us mention that the authors in [25,35] only considered standard inequality and equality constraints, but the theory in these papers can be easily extended to the more general constraints considered in (5.12) doing some nearby adjustments. We note that (P) can be interpreted as a special instance of (5.12) where q plays the role of the indicator function of the set D. Then the nonmonotone proximal gradient method from [25] reduces to the spectral gradient method from Section 3. However, the authors in [25] did not challenge their method with discontinuous functionals q and, thus, cut away some of the more reasonable applications behind the model (P). Furthermore, we would like to mention that (5.12) can be reformulated (by using the epigraph epi q : which is a problem of type (P). One can easily check that (5.12) and (5.13) are equivalent in the sense thatw ∈ W is a local/global minimizer of (5.12) if and only if (w, q(w)) is a local/global minimizer of (5.13). Problem (5.13) can be handled with Algorithm 4.1 as soon as the computation of projections onto D := epi q is possible in an efficient way. Our result from Corollary 4.4 shows that Algorithm 4.1 applied to (5.13) computes M-stationary points of (5.12) under AM-regularity (associated with (5.13) at (w, q(w))), i.e., we are in position to find points satisfying under a very mild condition which enhances [25,Theorem 3.1]. Here, we used the limiting subdifferential of q given by ∂q(w) := {ξ ∈ W | (ξ, −1) ∈ N lim epi q (w, q(w))}.
In iteration k of Algorithm 4.1, we terminate Algorithm 3.1 if the inner iterates w j,i satisfy where · ∞ stands for the maximum-norm for both W equal to R n and equal to R n×n sym (other Euclidean spaces do not occur in the subsequent applications), see (3.4). Similarly, we use the infinity norm in the definition (4.2) of V ρ . Algorithm 4.1 is terminated as soon as (4.5) is satisfied with ε tol := 10 −4 . These two termination criteria ensure that the final iterate w k together with the multiplier λ k is approximately M-stationary, see (4.4).
Given an arbitrary (possibly random) starting point w 0 , we note that we first project this point onto the set D and then use this projected point as the true starting point, so that all iterates w k generated by Algorithm 4.1 belong to D. The choice of the initial penalty parameter is similar to the rule in [18, p. 153] and given by .
In all our examples, the space Y is given by R m as in Setting 2.4. This allows us to choose the safeguarded multiplier estimate u k as the projection of the current value λ k onto a given box [u min , u max ], where this box is (in componentwise fashion) chosen to be [−10 20 , 10 20 ] for all equality constraints and [0, 10 20 ] for all inequality constraints. In this way, we basically guarantee that the safeguarded augmented Lagrangian method from Algorithm 4.1 coincides with the classical approach as long as bounded multiplier estimates λ k are generated.

MPCC Examples
The specification of Algorithm 4.1 to MPCCs is essentially the method discussed in [33], where extensive numerical results (including comparisons with other methods) are presented. We therefore keep this section short and consider only two particular examples in order to illustrate certain aspects of our method.
Example 6.1. Here, for w := (y, z) ∈ R 2 , we consider the two-dimensional MPCC given by which is essentially the example from [64] with an additional (inactive) inequality constraint in order to have at least one standard constraint, so that Algorithm 4.1 does not automatically reduce to the spectral gradient method. The problem possesses two global minimizers at (0, 1) and (1, 0) which are M-stationary (in fact, they are even strongly stationary in the MPCC-terminology). Moreover, it has a local maximizer at (0, 0) which is a point of attraction for many MPCC solvers since it can be shown to be C-stationary, see e.g. [39] for the corresponding definitions and some convergence results to C-and M-stationary points. Due to Lemma 2.8, each feasible point of the problem is AM-regular. In view of our convergence theory, Algorithm 4.1 should not converge to the origin. To verify this statement numerically, we generated 1000 random starting points (uniformly distributed) from the box [−10, 10] 2 and then applied Algorithm 4.1 to the above example. As expected, the method converges for all 1000 starting points to one of the two minima. Moreover, we can even start our method at the origin, and the method still converges to the point (1, 0) or (0, 1). The limit point itself depends on our choice of the projection which is not unique for iterates (y k , z k ) with y k = z k > 0.
The next example is used to illustrate a limitation of our approach which is based on the fact that we exploit the spectral gradient method as a subproblem solver.
There are examples where this spectral gradient method reduces the number of iterations even for two-dimensional problems from more than 100000 to just a few iterations. Nevertheless, in the end, the spectral gradient method is a projected gradient method, which exploits a different stepsize selection, but which eventually reduces to a standard projected gradient method if there are a number of consecutive iterations with very small progress, i.e., with almost identical function values during the last few iterations so that the maximum term in the nonmonotone line search is almost identical to the current function value used in the monotone version. This situation typically happens for problems which are ill-conditioned, and we illustrate this observation by the following example.
Example 6.2. We consider the optimal control of a discretized obstacle problem as investigated in [36,Section 7.4]. Using w := (x, y, z), in our notation, the problem is given by Here, A is a tridiagonal matrix which arises from a discretization of the negative Laplace operator in one dimension, i.e., a ii = 2 for all i and a ij = −1 for all i = j ± 1.
Furthermore, e denotes the all-one vector of appropriate size. We note thatw := 0 is the global minimizer as well as an M-stationary point of this program. Again, Lemma 2.8 shows that each feasible point is AM-regular. Viewing the constraint x ≥ 0 as a box constraint, taking a moderate discretization with A ∈ R 64×64 , and using the all-one vector as a starting point, we obtain the results from Table 6.1. The number of (outer) iterations is denoted by k, j is the number of inner iterations, j cum the accumulated number of inner iterations, f -ev. provides the number of function evaluations (note that, due to the stepsize rule, we might have several function evaluations in a single inner iteration, hence, f -ev. is always an upper bound for j cum ), f (w k ) denotes the current function value, the column titled "V k " contains V ρ k−1 (w k , u k−1 ), t j := 1/γ j is the stepsize, and ρ k denotes the penalty parameter at iteration k. The method terminates after 12 outer iterations, which is a reasonable number, especially taking into account that the final penalty parameter ρ k is relatively large, so that several subproblems with different values of ρ k have to be solved in the intermediate steps. On the other hand, the number of inner iterations j (at each outer iteration k) is very large. In the final step, the method requires more than one million inner iterations. This is a typical behavior of gradient-type methods and indicates that the underlying subproblems are ill-conditioned. This is also reflected by the fact that the stepsize t j tends to zero.
There are two types of difficulties in Example 6.2: there are challenging constraints (the complementarity constraints), and there is an ill-conditioning. The difficult constraints are treated by Algorithm 4.1 successfully, but the ill-conditioning causes  some problems when solving the resulting subproblems. In principle, this difficulty can be circumvented by using another subproblem solver (like a semismooth Newton method, see [36]), but then it is no longer guaranteed that we obtain M-stationary points at the limit.
Despite the fact that the ill-conditioning causes some difficulties, we stress again that each iteration of the spectral gradient method is extremely cheap. Moreover, for all test problems in the subsequent sections, we put an upper bound of 50000 inner iterations (as a safeguard), and this upper bound was not reached in any of these examples.

Cardinality-Constrained Problems
We first consider an artificial example to illustrate the convergence behavior of Algorithm 4.1 for cardinality-constrained problems.
where Q := E + I with E ∈ R 5×5 being the all one matrix, I ∈ R 5×5 the identity matrix, and c := −(3, 2, 3, 12, 5) ⊤ ∈ R 5 . Clearly, by Lemma 2.8, all feasible points are AM-regular. This is a minor modification of an example from [13], to which we added an (inactive) inequality constraint for the same reason as in Example 6.1. Taking into account that there are 5 2 possibilities to choose two possibly nonzero components of w, an elementary calculation shows that there are exactly 10 M-stationary points w 1 , . . . ,w 10 which are given in Table 6.2 together with the corresponding function values. It follows thatw 6 is the global minimizer. The pointsw 3 ,w 8 , andw 10 have function values which are not too far away from f (w 6 ), whereas all other M-stationary points have significantly larger function values. We then took 1000 random starting points from the box [−10, 10] 5 (uniformly distributed) and applied Algorithm 4.1 to this example. Surprisingly, the method converged, for all 1000 starting points, to the global minimizerw 6 . We then changed the example by putting an upper bound w 4 ≤ 0 to the fourth component. This excludes the four most interesting points w 3 ,w 6 ,w 8 , andw 10 . Among the remaining points, the three vectorsw 4 ,w 7 , andw 9 have identical function values. Running our program again using 1000 randomly generated starting points, we obtain convergence tow 4 in 589 cases, convergence tō w 7 in 350 situations, whereas in 61 instances only we observe convergence to the non-optimal pointw 2 .  We next consider a class of cardinality-constrained problems of the form This is a classical portfolio optimization problem, where Q and µ denote the covariance matrix and the mean of n possible assets, respectively, while ̺ is some lower bound for the expected return. Furthermore, u provides an upper bound for the individual assets within the portfolio. The affine structure of the constraints in (6.1) implies that all feasible points are AM-regular, see Lemma 2.8. The data Q, µ, ̺, u were randomly created by the test problem collection [30], which is available from the webpage https://commalab.di.unipi.it/datasets/MV/. Here, we used all 30 test instances of dimension n := 200 and three different values κ ∈ {5, 10, 20} for each problem. We apply three different methods: The CPLEX solver is used to (hopefully) identify the global optimum of the optimization problem (6.1). Note that we put a time limit of 0.5 hours for each test problem. Method (a) applies our augmented Lagrangian method to (6.1) using the set D := {w ∈ [0, u] | w 0 ≤ κ}. Projections onto D are computed using the analytic formula from Proposition 5.4. Finally, the boosted version of Algorithm 4.1 is the following: We first delete the cardinality constraint from the portfolio optimization problem. The resulting quadratic program is then convex and can therefore be solved easily. Afterwards, we apply Algorithm 4.1 to a sequence of relaxations of (6.1) in which the cardinality is recursively decreased by 10 in each step (starting with n − 10) as long as the desired value κ ∈ {5, 10, 20} is not undercut. For κ = 5, a final call of Algorithm 4.1 with the correct cardinality is necessary since, otherwise, the procedure would terminate with cardinality level 10. In each outer iteration, the projection of the solution of the previous iteration onto the set D is used as a starting point.
The corresponding results are summarized in Figure 6.1 for the three different values κ ∈ {5, 10, 20}. This figure compares the optimal function values obtained by the above three methods for each of the 30 test problems. The optimal function values produced by CPLEX are used here as a reference value in order to judge the quality of the results obtained by the other approaches. The main observations are the following: The optimal function value computed by CPLEX is (not surprisingly) always the best one. On the other hand, the corresponding values computed by method (a) are usually not too far away from the optimal ones. Moreover, for all test problems, the boosted version (b) generates even better function values which are usually very close to the ones computed by CPLEX. Of course, if κ is taken smaller, the problems are getting more demanding and are therefore more difficult to solve (in general). Nevertheless, also for κ = 5, especially the boosted algorithm still computes rather good points. In this context, one should also note that our methods always terminate with a (numerically) feasible point, hence, the final iterate computed by our method can actually be used as a (good) approximation of the global minimizer. We also would like to mention that our MATLAB implementation of Algorithm 4.1 typically requires, on an Intel Core i7-8700 processor, only a CPU time of about 0.1 seconds for each of the test problems, whereas the boosted version requires roughly two seconds CPU time in average.

MAXCUT Problems
This section considers the famous MAXCUT problem as an application of our algorithm to problems with rank constraints. To this end, let G = (V, E) be an undirected graph with vertex set V = {1, . . . , n} and edges e ij between vertices i, j ∈ V . We assume that we have a weighted graph, with a ij = a ji denoting the nonnegative weights of the edge e ij . Since we allow zero weights, we can assume without loss of generality that G is a complete graph. Now, given a subset S ⊂ V with complement S c , the cut defined by S is the set δ(S) := {e ij | i ∈ S, j ∈ S c } of all edges such that one end point belongs to S and the other one to S c . The corresponding weight of this cut is defined by The MAXCUT problem looks for the maximum cut, i.e., a cut with maximum weight. This graph-theoretical problem is known to be NP-hard, thus very difficult to solve. where the variable W is chosen from the space W := R n×n sym . Due to the linear constraint diag W = e, it follows that this problem is equivalent to Deleting the difficult rank constraint, one gets the (convex) relaxation which is a famous test problem for semidefinite programs.
Here, we directly deal with (6.3) by taking D := {W ∈ W | W 0, rank W ≤ 1} as the complicated set. Then GMFCQ holds at all feasible matrices of (6.3), see Appendix B. Particularly, AM-regularity is valid at all feasible points of (6.3). Projections onto D can be calculated via Proposition 5.7: Let W ∈ W denote an arbitrary symmetric matrix with maximum eigenvalue λ and corresponding (normalized) eigenvector v (note that λ and v are not necessarily unique), then max(λ, 0)vv ⊤ is a projection of W onto D. In particular, the computation of this projection does not require the full spectral decomposition. However, it is not clear whether a projection onto the feasible set of (6.3) can be computed efficiently. Consequently, we penalize the linear constraint diag W = e by the augmented Lagrangian approach.
Throughout this section, we take the zero matrix as the starting point. In order to illustrate the performance of our method, we begin with the simple graph from Figure 6.2. Algorithm 4.1 applied to this example using the reformulation (6.3) (more precisely, the corresponding minimization problem) together with the previous specifications yields the iterations shown in Table 6.3. The meaning of the columns is the same as for Table 6.1. Note that the penalty parameter stays constant for this example. The feasibility measure tends to zero, and we terminate at iteration k = 6 since this measure becomes less than 10 −4 , i.e., we stop successfully. The associated function value is (approximately) 12 which actually corresponds to the maximum cut S := {1, 3} for the graph from Figure 6.2, i.e., our method is able to solve the MAXCUT problem for this particular instance. We next apply our method to two test problem collections that can be downloaded from http://biqmac.aau.at/biqmaclib.html, namely the rudy and the ising collection. The first class of problems consists of 130 instances, whereas the second one includes 48 problems. The optimal function value f opt of all these examples is known. The details of the corresponding results obtained by our method are given in Appendix C. Let us summarize the main observations.
All 130 + 48 test problems were solved successfully by our method since the standard termination criterion was satisfied after finitely many iterations, i.e., we stop with an iterate W k which is feasible (within the given tolerance). Hence, the corresponding optimal function value f ALM is a lower bound for the optimal value f opt . For the sake of completeness, we also solved the (convex) relaxed problem from (6.4), using again our augmented Lagrangian method with D := {W ∈ W | W 0}. The corresponding function value is denoted by f SDP . Since the feasible set of (6.4) is larger than the one of (6.3), we have the inequalities f ALM ≤ f opt ≤ f SDP . The corresponding details for the solution of the SDP-relaxation are provided in Appendix C for the rudy collection.
The bar charts from Figures 6.3 and 6.4 summarize the results for the rudy and ising collections, respectively, in a very condensed way. They basically show that the function value f ALM obtained by our method is very close to the optimal value f opt . More precisely, the interpretation is as follows: For each test problem, we take the quotient f ALM /f opt ∈ [0, 1]. If this quotient is equal to, say, 0.91, we count this example as one where we reach 91% of the optimal function value. For two examples (pm1d_80.9, and pw01_100.8), we actually get the exact global maximum. Figure 6.4 has a similar meaning for the ising collection: Though there is no example which is solved exactly, almost one half of the problems reaches an accuracy of at least 99%, and even in the worst case, we obtain a precision of 94%. Percentage of optimal value function Number of test problems Altogether, this shows that we obtain a very good lower bound for the optimal function value. Moreover, since we are always feasible (in particular, all iterates are matrices of rank one), the final matrix can be used to create a cut through the given graph, i.e., the method provides a constructive way to create cuts which seem to be close to the optimal cuts. Note that this is in contrast to the semidefinite relaxation (6.4) which gives an upper bound, but the solution associated with this upper bound is usually not feasible for the MAXCUT problem since the rank constraint is violated (the results in Appendix C show that the solutions of the relaxed programs for the rudy collection are matrices of rank between 4 and 7). In particular, these matrices can, in general, not be used to compute a cut for the graph and, therefore, are less constructive than the outputs of our method. Moreover, it is interesting to observe that f ALM is usually much closer to f opt than f SDP . In any case, both techniques together might be useful tools in a branch-and-bound-type method for solving MAXCUT problems.

Concluding Remarks
In this paper, we demonstrated how M-stationary points of optimization problems with structured geometric constraints can be computed with the aid of an augmented Lagrangian method. The fundamental idea was to keep the complicated constraints out of the augmented Lagrangian function and to treat them directly in the associated subproblems which are solved by means of a nonmonotone projected gradient method. This way, the handling of challenging variational structures is encapsulated within the efficient computation of projections. This also puts a natural limit for the applicability. In contrast to several other approaches from the literature, the convergence guarantees for our method, which are valid in the presence of a comparatively weak asymptotic constraint qualification, remain true if the appearing subproblems are solved inexactly. Extensive numerical experiments visualized the quantitative qualities of this approach.
Despite our observations in Example 6.2, it might be interesting to think about extensions of these ideas to infinite-dimensional situations. In [20], an augmented Lagrangian method for the numerical solution of (P) in the context of Banach spaces has been considered where the set D was assumed to be convex, and the subproblems in the resulting algorithm are of the same type as in our paper. Furthermore, convergence of the method to KKT points was shown under validity of a problem-tailored version of asymptotic regularity. As soon as D becomes nonconvex, one has to face some uncomfortable properties of the appearing limiting normal cone which turns out to be comparatively large since weak- * -convergence is used for its definition as a set limit in the dual space, see [37,57]. That it why the associated M-stationarity conditions are, in general, too weak in order to yield a reasonable stationarity condition. However, this issue might be surpassed by investigating the smaller strong limiting normal cone which is based on strong convergence in the dual space but possesses very limited calculus. It remains open whether reasonable asymptotic regularity conditions w.r.t. this variational object can be formulated. Furthermore, in order to exploit the smallness of the strong limiting normal cone in the resulting algorithm, one has to make sure (amongst others) that the (primal) sequence {w j } possesses strong accumulation points while the (dual) measures of inexactness {ε j } need to be strongly convergent as well. This might be restrictive. Furthermore, it has to be clarified how the subproblems can be solved to approximate strong M-stationarity.
Consequently, we obtain from (A.1) that Together with a Taylor expansion, we therefore get for all l sufficiently large, i.e., the inner loop terminates.
In the second case, (A.2) is not satisfied, i.e., γ j,i w j,i − w j → 0. By continuity of ∇ϕ, this yields (3.3). Together with w j,i → w j and by using the continuity of ∇ϕ as well as (2.1) we can pass to the limit i → ∞ in (3.2) and obtain that w j is M-stationary.
Since γ j ≥ γ min > 0 for all j ∈ N, we get where, for simplicity, we set d j := w j+1 − w j for all j ∈ N. Using (A.6) and (A.7), we then obtain where the last equality takes into account the uniform continuity of ϕ. We will now prove, by induction, that We already know from (A.7) and (A.8) that (A.9) holds for r = 1. Suppose that (A.9) holds for some r ≥ 1. We need to show that it holds for r + 1. Using (A.5) with j replaced by l(j) − r − 1, we have ϕ(w l(j)−r ) ≤ ϕ(w l(l(j)−r−1) ) − γ l(j)−r−1 σ 2 d l(j)−r−1 2 (here we assume implicitly that j is large enough such that no negative indices l(j) − r − 1 occur). Rearranging this expression and using γ j ≥ γ min for all j yields Taking the limit j → ∞ while using (A.6) as well as the induction hypothesis, it follows that lim In the final step of our proof, we now show that lim j→∞ d j = 0. Suppose that this is not true. Then there is a (suitably shifted, for notational simplicity) subsequence {d j−m−1 } K and a constant ρ > 0 such that Now, for each j ∈ K, the corresponding index l(j) is one of the indices j − m, j − m + 1, . . . , j. Hence, we can write j−m−1 = l(j)−r j for some index r j ∈ {1, 2, . . . , m+1}.
Since there are only finitely many possible indices r j , we may assume without loss of generality that r j = r holds for some fixed index r. Then (A.9) implies This contradicts (A.11) and therefore completes the proof.
Due to Proposition 3.2, we also have w j+1 → Kw . Hence, taking the limit j → K ∞ and exploiting once again the upper semicontinuity of the limiting normal cone, we obtain 0 ∈ ∇ϕ(w) + N lim D (w), i.e.,w is an M-stationary point of (3.1).

B Constraint Regularity of the MAXCUT Problem
We show that feasible points of (6.3) with D := {W ∈ R n×n sym | W 0, rank W ≤ 1} generally satisfy GMFCQ. Note that we use G : R n×n sym → R n given by G(W ) := diag W , W ∈ R n×n sym , and C := {e} here in order to model the feasible set of (6.3) in the form given in (P).
Let us fix a feasible matrix W ∈ R n×n sym of (6.3). Then we find a vector u ∈ {±n −1/2 } n such that W = nuu ⊤ , i.e., W possesses the non-zero eigenvalue n and the associated eigenvector u.
First, we will show that For Y ∈ N lim D (W ), we find sequences {W k }, {Y k } ⊂ R n×n sym and {α k } ⊂ [0, ∞) such that W k → W , Y k → Y , and Y k ∈ α k (W k − Π D (W k )) for all k ∈ N. For each k ∈ N, let W k = n i=1 µ k i u k i (u k i ) ⊤ be an (orthonormal) eigenvalue decomposition with non-increasingly ordered eigenvalues µ k 1 ≥ µ k 2 ≥ . . . ≥ µ k n and associated pairwise orthonormal eigenvectors u k 1 , . . . , u k n . Due to W k → W , we find µ k 1 → n, µ k 2 , . . . , µ k n → 0, and (along a subsequence without relabeling) u k 1 → ±u. For sufficiently large k ∈ N, this implies Π D (W k ) = {µ k 1 u k 1 (u k 1 ) ⊤ }, see [50,Proposition 3.4] and Proposition 5.7. Hence, for any such k ∈ N, we find Y k = α k n i=2 µ k i u k i (u k i ) ⊤ . Particularly, this gives Y k u k 1 = 0 for large enough k ∈ N, so that Y u = 0 follows by taking the limit k → ∞.
Second, suppose that there are a vector λ ∈ N C (G(W )) = R n and a matrix Y ∈ N lim D (W ) such that diag λ + Y = 0. In order to prove validity of GMFCQ, λ = 0 has to be shown. From (B.1), we find λ • u = −Y u = 0 where • represents the entrywise product operation. Observing that the components of u are all different from zero, λ = 0 follows. Recall that the SDP relaxation was also solved by Algorithm 4.1 with the semidefiniteness constraint taken as the complicated constraint, whereas the remaining linear equality constraint was penalized by the augmented Lagrangian approach.