Inexact penalty decomposition methods for optimization problems with geometric constraints

This paper provides a theoretical and numerical investigation of a penalty decomposition scheme for the solution of optimization problems with geometric constraints. In particular, we consider some situations where parts of the constraints are nonconvex and complicated, like cardinality constraints, disjunctive programs, or matrix problems involving rank constraints. By a variable duplication and decomposition strategy, the method presented here explicitly handles these difficult constraints, thus generating iterates which are feasible with respect to them, while the remaining (standard and supposingly simple) constraints are tackled by sequential penalization. Inexact optimization steps are proven sufficient for the resulting algorithm to work, so that it is employable even with difficult objective functions. The current work is therefore a significant generalization of existing papers on penalty decomposition methods. On the other hand, it is related to some recent publications which use an augmented Lagrangian idea to solve optimization problems with geometric constraints. Compared to these methods, the decomposition idea is shown to be numerically superior since it allows much more freedom in the choice of the subproblem solver, and since the number of certain (possibly expensive) projection steps is significantly less. Extensive numerical results on several highly complicated classes of optimization problems in vector and matrix spaces indicate that the current method is indeed very efficient to solve these problems.


Introduction
We consider the program where f : X → R and G : X → Y are continuously differentiable mappings, X and Y are Euclidean spaces, i.e., real and finite-dimensional Hilbert spaces, C ⊆ Y is nonempty, closed, and convex, whereas D ⊆ X is only assumed to be nonempty and closed (not necessarily convex), representing a possibly complicated set.This very general setting (analyzed for example in [25]) covers, for example, standard nonlinear programming problems with convex constraints, but also difficult disjunctive programming problems [8,9,18,36], e.g., complementarity [42], vanishing [1], switching [37] and cardinality constrained [30,31] problems.Matrix optimization problems such as low-rank approximation [28,34] are also captured by our setting.
Problems with this structure, where the feasible set consists of the intersection of a set of constraints expressed in analytical form and another complicated set, without regularity guarantees but manageable for example by easy projections, have been deeply studied in recent years.In particular, approaches based on decomposition and sequential penalty or augmented Lagrangian methods have been proposed for the convex case [19], the cardinality constrained case [31,33] and the low-rank approximation case [43]; the recurrent idea in all these works consists of the application of the variable splitting technique [21,26], to then define a penalty function associated with the differentiable constraints and the additional equality constraint linking the two blocks of variables and finally solve the problem by a sequential penalty method.The optimization of the penalty function is carried out by a two-block alternatimg minimization scheme [20], which can be run in an exact [33,43] or inexact [19,31] fashion.
The aim of this work is to extend the inexact Penalty Decomposition approach to the general setting (1.1) in such a way that it can deal with arbitrary abstract constraints D (at least theoretically, in practice D needs to be such that projections onto this set are easy to compute) and that it allows additional (seemingly simple) constraints given by G(x) ∈ C.This setting is related to some recent work on (safeguarded) augmented Lagrangian methods, see, in particular, [25], where the resulting subprobems are solved by a projected gradient-type method, which might be inefficient especially for ill-conditioned problems.The decomposition idea used here allows a much wider choice of subproblem solvers, usually resulting in a more efficient solver of the given optimization problem (1.1).
The paper is organized as follows: Section 2 summarizes some preliminary concepts and results.In particular, we recall the definitions of an M-stationary point (the counterpart of a KKT point for the general setting from (1.1)), of an AM-stationary point (as a sequential version of M-stationarity) and of an AM-regular point (this being a suitable and relatively weak constraint qualification).Section 3 then presents the Penalty Decomposition method together with a global convergence theory, assuming that the resulting subproblems can be solved inexactly up to a certain degree.In Section 4, we then present a class of inexact alternating minimization methods which, under certain assumptions, are guaranteed to find the desired approximate solution of the subproblems arising in the outer penalty scheme.The remaining part of the paper is then devoted to the implementation of the overall method and corresponding numerical results.To this end, Section 5 first discusses several instances of the general setting (1.1) with difficult constraints D where our method can be applied to quite efficiently since projections onto D are simple and/or known analytically (though the latter does not necessarily imply that these projections are easy to compute numerically).In Section 6, we then present the results of an extensive numerical testing, where we also compare our method, using different realizations, with the augemented Lagrangian method from [25].We conclude with some final remarks in Section 7.

Preliminaries
The Note that the distance function is nonsmooth (in general), but the squared distance function s C (y) := 1 2 dist 2 C (y) is continuously differentiable everywhere with derivative given by ∇s C (y) = y − P C (y), (2.1) see [5,Cor. 12.30].Moreover, projections onto the nonempty and closed set D also exist, but are not necessarily unique.Therefore, we define the (usually set-valued) projection operator Π D : X ⇒ X by The corresponding distance function dist D (•) is, of course, single-valued again.Furthermore, given a set-valued mapping S : X ⇒ X on an arbitrary Euclidean space X, we define the outer limit of S at a point x by that will be exploited heavily in our subsequent analysis, see [38,Prop. 1.3].Note that, for D being convex, this limiting normal cone reduces to the standard normal cone from convex analysis, i.e., we have for any given x ∈ D. For points x ∈ D, we set N lim D (x) := N D (x) := ∅.For the convex set C, the standard normal cone and the projection operator are related by see [5,Prop. 6.46].We next introduce a stationarity condition which generalizes the concept of a KKT point to constrained optimization problems with possibly difficult constraints as given by the set D in our setting (1.1).Definition 2.1.A feasible point x ∈ X of the optimization problem (1.1) is called an M-stationary point (Mordukhovich-stationary point) of (1.1) if there exists a multiplier λ ∈ Y such that Note that this definition coincides with the one of a KKT point if D is convex.The following is a sequential version of M-stationarity.
Definition 2.2.A feasible point x ∈ X of the optimization problem (1.1) is called an AM-stationary point (asymptotically M-stationary point) of (1.1) if there exist sequences Note that the previous definition generalizes the related concept of AKKT points introduced for standard nonlinear programs in [2] to our setting with the more difficult constraints.In a similar way, the subsequent regularity conditions are also motivated by related ones from [3] , where they were presented for standard nonlinear programs.
Every M-stationary point is obviously also AM-stationary, whereas the opposite implication will be guaranteed to hold by a regularity condition that will now be introduced.To this end, let us write Recall that N lim D (x) is nonempty if and only if x ∈ D, which is therefore an implicit requirement for the set M(x, z) to be nonempty.Moreover, we consider the set Note that the auxiliary sequence {z k } needs to be introduced since the elements G(x k ) do not necessarily belong to C, whereas x k is supposed to be an element of D. Definition 2.3.Let x be feasible for (1.1).Then x is called AM-regular for (1.1) if lim sup x→x,z→0 M(x, z) ⊆ M(x, 0).
Using this terminology, the following statements hold, cf.[35] for further details.(c) Conversely, if for every continuously differentiable function f , the implication x is an AM-stationary point =⇒ x is an M-stationary point holds for the corresponding optimization problem (1.1), then x is AM-regular.
Statement (a) shows that every local minimum of (1.1) is an AM-stationary point even in the absense of any constraint qualification (CQ for short).Hence AM-stationary is a (sequential) first-order optimality condition.In order to guarantee that an AMstationary point is already an M-stationary point (hence a KKT point in the standard setting of a nonlinear program, say), we require a CQ, namely the AM-regularity condition, cf.Theorem 2.4 (b).The final statement (c) of that result shows that, in a certain sense, AM-regularity is the weakest CQ which implies AM-stationary points to be M-stationary.In fact, this AM-regularity condition turns out to be a fairly weak condition.For example, for standard nonlinear programs, AM-regularity is stronger than the Abadie CQ, but weaker than most of the other well-known CQs like MFCQ (Mangasarian-Fromovitz CQ), CRCQ (constant rank CQ), CPLD (constant positive linear dependence), and RCPLD (relaxed CPLD), to mention at least some of the more prominent ones.We refer the interested reader to [4,35] and references therein for further details.
The algorithm, to be described in the following section, is based on the reformulation min x,y of the given optimization problem (1.1).The previous notions of M-and AMstationarity and AM-regularity can be directly translated to this program by observing that (2.4) can be written in the format of (1.1) as The counterpart of Theorem 2.4 then also holds for the corresponding (A)M-stationarity and regularity concepts defined for the formulation (2.4).Note that regularity conditions and constraint qualifications depend on the explicit formulation of a constraint system, hence the corresponding concepts are not necessarily equivalent for the two formulations of our program.We stress, however, that an easy inspection shows that the important notion of an M-stationary point for (2.4) is equivalent to the notion of an M-stationary point for (1.1).
For the sake of completenss, and since this condition will be used explicitly in our convergence analysis, let us write down explicitly the resulting AM-stationarity condition for the reformulated program (2.4): with the above identifications, a feasible point x * of (2.4) is AM-stationary if there exist sequences {x k }, {ε k }, and { λk }, {z k } such that xk → x * , zk → 0, εk → 0 as well as for all k.Using the definitions of f , G etc., exploiting standard properties of the limiting and standard normal cones (in particular, the Cartesian product rule, cf. [40, Prop.6.41]), and writing xk =: (x k , y k ), λk =: (λ k , µ k ) as well as z k for the first block of zk (the second block component of zk turns out to be irrelevant), we see that the two conditions from (2.5) can be rewritten as (2.6)

Algorithm and Convergence
The algorithm to be presented here is based on the reformulation (2.4) of the given program (1.1).The idea is to take advantage of the fact that the constraints G(x) ∈ C and y ∈ D occur in a decomposed way.This formulation allows to develop an alternating direction-type penalty scheme for the solution of the orginal problem (1.1).To this end, let τ > 0 be a penalty parameter and define the partial penalty function q τ (x, y) Note that q τ does not include the potentially difficult constraint x ∈ D, which we therefore have to deal with explicitly.The general algorithmic scheme that we will investigate here is the following one.
(S.1) If a suitable termination criterion holds: STOP.
Note that Algorithm 3.1 is a very general scheme for the solution of the reformulated problem (2.4).The main computational burden is in (S.2).We will see how this step can be realized by an alternating minimization-type iteration in Section 4.Here we only note that the computation of the exact minimizer y k+1 = argmin y∈D q τ k (x k+1 , y) can be carried out very easily if projections onto the set D can be computed efficiently.This follows immediately from the definition of q τ k , which implies that y k+1 is characterized by Some examples of complicated (nonconvex) sets D, where this projection is easy to compute, will be given in the numerical section.
The remaining part of this section is devoted to the global convergence properties of the general scheme from Algorithm 3.1.The technique of proof patterns the one used in [25] for an augmented Lagrangian method.
We begin with a feasibility-type result.To this end, recall that all penalty-type methods suffer from the fact that accumulation points may not be feasible for the given optimization problem.The following result shows that such an accumulation point still has a very nice property.Proposition 3.2.Let {(x k , y k )} be a sequence generated by Algorithm 3.1 with {δ k } being bounded and {τ k } → ∞.Then every accumulation point (x, ȳ) of the sequence {(x k , y k )} is an M-stationary point of the feasibility problem Proof.Let {(x k+1 , y k+1 )} K be a subsequence converging to (x, ȳ).Using the derivative formula of the distance function from (2.1) together with the chain rule, we have, by construction, ) for all k ∈ N. Dividing the first equation by τ k and exploiting the cone property in the second inclusion yields ) for all k ∈ N, cf.[40,Thm. 6.12].Taking the limit k → K ∞, using the continuity of ∇f, G, G , P C , and the robustness property (2.2) of the limiting normal cone, we obtain This shows that (x, ȳ) is an M-stationary point of (3.5).
Recall that Algorithm 3.1 automatically generates iterates y k which belong to the set D. The objective function in (3.5) therefore only measures the violation of the constraints G(x) ∈ C and x − y = 0, which is included in the penalty term of q τ , i.e., (3.5) is a feasibility problem of the decomposed problem (2.4).If x = ȳ, then x turns out to be an M-stationary point of which is the feasibility problem of the original problem (1.1).Though Proposition 3.2 obviously does not guarantee that an accumulation point is feasible (either for the original or the decomposed formulation), it guarantees at least a stationarity property, which is the best one can expect in general.Moreover, if the feasible set of (1.1) is nonempty and the function 1 2 dist 2 C G(x) of (3.5) is convex, then every M-stationary point is a global minimum and, hence, a feasible point of (1.1) or (2.4).Note that the square of the above distance function is automatically convex if the constraint G(x) ∈ C satisfies standard conditions which imply that this set is convex by itself.
Moreover, one can also define an extended Robinson-type constraint qualification (which boils down to the extended MFCQ condition for standard nonlinear programs) which automatically imply that accumulation points of a sequence generated by Algorithm 3.1 are feasible, cf.[12,27] for further details.
Hence, under reasonable assumptions, we can guarantee that accumulation points are automatically feasible for (1.1) or (2.4), whereas, in general, they are at least M-stationary points.The following global convergence result therefore assumes that we have a feasible accumulation point and shows that this one is automatically AMstationary for problem (2.4).Theorem 3.3.Let {(x k , y k )} be a sequence generated by Algorithm 3.1 with {δ k } → 0 and {τ k } → ∞, and let (x, ȳ) be a feasible accumulation point of this sequence.Then (x, ȳ) is an AM-stationary point of the optimization problem (2.4).
Proof.Let {(x k+1 , y k+1 )} K be a subsequence converging to (x, ȳ).Recall that (x, ȳ) is feasible, hence G(x) ∈ C and x = ȳ ∈ D. We further define the sequences Then setting we claim that the corresponding four (sub-) sequences satisfy the properties of an AM-stationary point for problem (2.4) as stated at the end of Section 2, cf.(2.5) and (2.6).First of all, (x, ȳ) is feasible and {(x k+1 , y k+1 )} K → (x, ȳ) by assumption.Furthermore, by definition of ε k+1 and the construction of Algorithm 3.1, we also have This obviously implies εk+1 → 0. Furthermore, the definitions of λ k+1 and µ k+1 together with 0 ∈ τ k y k+1 − x k+1 + N lim D (y k+1 ) yield ), hence the first inclusion in (2.6) holds.To verify the second inclusion, we only have to take a closer look at the first block.By definition of λ k+1 and the relation (2.3) between the projection and the normal cone, we get where the last identity comes from the definition of z k+1 .Finally, we also have zk+1 → K 0 since z k+1 satisfies by the continuity of G and the projection operator P C as well as the feasibility of x.Altogether, this shows that (x, ȳ) is an AM-stationary point of the program (2.4).Using Theorem 3.3 together with the counterpart of Theorem 2.4 for (2.4) and the fact that the M-stationarity conditions for the two problems (1.1) and (2.4) are equivalent, we directly obtain the following result.Theorem 3.4.Let {(x k , y k )} be a sequence generated by Algorithm 3.1 with {δ k } → 0 and {τ k } → ∞, and let (x, ȳ) be a feasible accumulation point of this sequence satisfying AM-regularity for (2.4).Then (x, ȳ) is an M-stationary point of the optimization problem (2.4), and x itself is an M-stationary point of the original problem (1.1).

Solution of Subproblems by Inexact Alternating Minimization
The Penalty Decomposition approach basically consists in approximately solving the sequence of penalty subproblems at step (S.2) by a two-block decomposition method.
The alternating minimization loop can be stopped, at each iteration, as soon as an approximate stationary point of the penalty function w.r.t. the first block of variables x is attained.The instructions of the (inexact) Alternating Minimization loop at a fixed iteration k of the Penalty Decomposition method are detailed in Algorithm 4.1.
As already pointed out, if we assume that projections onto the set D are easily computable, the update of the second block of variables can be carried out exactly by (3.4).
On the other hand, an exact x-update step may be prohibitive in most applications.For this reason, the x-variable is only updated by a descent step along a descent direction, with a step size selected by an Armijo-type line search.
Note that the direction The Armijo line search provides a sufficient decrease granting, under suitable assumptions on the sequence of maps H , the convergence of the entire alternate minimization scheme.
Note that by properly choosing H we can retrieve the descent directions employed in most widely employed nonlinear optimization solvers.This point, which we will emphasize again later on, is particularly relevant from the computational point of view.
Throughout this section, we make the following assumption.
We then begin by proving that, under Assumption 4.2, the penalty function has bounded level sets for any nonnegative value of the penalty parameter τ .
Assume by contradiction that L qτ (η) is not bounded, i.e., there exists {(x t , y t )} such that (x t , y t ) ∈ L qτ (η) for all t and (x t , y t ) → ∞.Then, either Thus, y t → ∞ while x t stays bounded.However, for t sufficiently large, as x t − y t 2 → ∞, dist 2 C G(x) ≥ 0 and f is bounded having compact level sets.This again is a contradiction, which completes the proof.
It can be easily seen that step (S.2) of Algorithm 4.1 is well-defined, i.e., there exists a finite integer j such that β j satisfies the acceptability condition (4.1).Moreover the following result can be readily obtained by standard results in nonlinear optimization [10].Let {d } be a sequence of directions such that ∇ x q τ k (u , v ), d < 0 and assume that d ≤ M for some M > 0 and for all ∈ T .If, for any fixed (outer iteration) k, the following equation holds Proof.Since, for any , α is chosen according to (4.1), we have Taking the limits for ∈ T , → ∞, we get where the last inequality comes from the fact that γ > 0, α ≥ 0 and ∇ x q τ k (u , v ), d < 0 by assumption.From the hypotheses, we also have that the leftmost limit goes to 0, hence we obtain lim Assume, by contradiction, that ∇ x q τ k (u , v ), d T does not converge to zero.Subsequencing if necessary, we may assume that lim → T ∞ ∇ x q τ k (u , v ), d = −ν for some number ν > 0. On the other hand, we have (u , v ) → T (ū, v) by assumption, and {d } is bounded, so we may also assume that {d } T → d for some limit point d.
Altogether, we then have Exploiting (4.3), we see that α → T 0 holds.Consequently, for all ∈ T sufficiently large, we have α < β 0 = 1 and thus By the mean-value theorem, we can write for some z = u + θ α β d , θ ∈ (0, 1).Subtracting the last two relations and dividing by α /β, we get On the other hand, lim since α → T 0 and d → d.Taking the limits in the previous inequality, we finally get which is absurd since γ ∈ (0, 1) and In order to ensure that the sequence generated by the Alternating Minimization scheme properly converges, we need the sequence of directions {d } to satisfy suitable properties.Here, in particular, we assume that the entire sequence of linear mappings {H } satisfies the bounded eigenvalues condition [10, Sec.1.2]: We are finally able to show that the inexact alternating minimization loop stops in a finite number of iterations providing a point (x k+1 , y k+1 ) satisfying conditions (3.2)-(3.3).
Proposition 4.5.Assume the sequence of linear maps {H } in Algorithm 4.1 satisfies the bounded eigenvalues condition (4.4).Then the algorithm cannot cycle infinitely and determines in a finite number of iterations a point (x k+1 , y k+1 ) such that Proof.Suppose, by contradiction that, for some values of τ k and δ k , the sequence {(u , v )} is infinite.From the instructions of the algorithm, it is easy to see that we have cf. (4.5).Hence, for all ≥ 0, the point (u , v ) belongs to the level set Lemma 4.3 implies that this is a bounded set.Therefore, the sequence {(u , v )} admits cluster points.Let K ⊆ N be an infinite subset such that Recalling the continuity of the gradient, we have lim We now show that ∇ x q τ k (ū, v) = 0. Taking into account the instructions of the algorithm, we have By (4.4), it can be easily seen that ), we see that there exists a constant M > 0 such that d ≤ M for all ∈ K. Since the entire sequence {q τ k (u , v )} is monotonically decreasing by (4.5), and the subsequence {q τ k (u, v)} K converges to q τ k (ū, v), it follows that the whole sequence of function values converges to this limit, i.e., we have Thus, the hypotheses of Lemma 4.4 are satisfied.Moreover, from (4.2) and (4.4), we have Using Lemma 4.4, we therefore obain which implies that, for ∈ K sufficiently large, we have ∇ x q τ k (u , v ) ≤ δ k , i.e., that the stopping criterion of step (S.1) is satisfied in a finite number of iterations, and this contradicts the fact that {(u , v )} is an infinite sequence.Condition (3.2) is then satisfied by the stopping criterion, whereas condition (3.3) follows by construction.
In order for the theoretical analysis to hold, we only need to ensure that H satisfies condition (4.4).This assumption can be guaranteed a priori by different ways of defining H .Among these valid choices, we can find classical setups leading back to iterations of standard algorithmic schemes such as gradient method (H = I), Newton method (H = ∇ 2 xx q τ k (u , v ), provided f is uniformly convex), quasi-Newton methods and limited-memory BFGS type methods.
This aspect is crucial in practice: we are allowed to employ the most efficient solvers for nonlinear optimization to carry out step (S.3) of the Alternate Minimization algorithm and thus speed up the computation of step (S.2) of Algorithm 3.1, which is the most burdensome one.As a comparison, the Augmented Lagrangian algorithm from [25] has to resort to a gradient-based method to solve the (constrained) sequential subproblems, possibly resulting in an inefficient method especially for illconditioned problems.Another difference is pointed out in the following comment.
Remark 4.6.The Augmented Lagrangian algorithm from [25] has to compute projections onto the set D within the computation of the stepsizes, i.e., it may require many projections for a single (inner) iteration.This is a notable difference to our Algorithm 4.1, which requires only a single projection after the computation of the new iterate u +1 .In fact, it would also be possible to apply several iterations of an unconstrained optimization solver to the subproblem of minimizing the penalty function q τ k (•, v ) before updating the v-component, i.e., before using a single projection step.

Particular Instances
The idea of this section, similar to [25], is to present some difficult optimization problems where projections onto the complicated set D can be carried out easily.This section does not contain any proofs since the corresponding results are known from the literature.However, since these particular instances will be used in our numerical section, they have to be discussed in some detail.

The case of Sparsity Constraints
A particular case of problem (1.1) is that of sparsity constrained optimization problems, i.e., optimization problems of the form where s < n and x 0 denotes the zero pseudo-norm of x, i.e., the number of nonzero components of x.The Penalty Decomposition approach was originally proposed in [33] for this class of problems, and the inexact version was then proposed for the case In fact, from the analysis in Section 3, we can deduce that the convergence results continue to hold for the inexact version of the algorithm even in presence of additional constraints.
The Penalty Decomposition method is particularly appealing, from a computational perspective, for this class of problems since the Euclidean projection onto the sparse set D is easily obtainable in closed form, as outlined e.g. in [31,33].Let us denote the index set of the largest s variables at x in absolute value by G s (x); for simplicity, we furthermore assume that cases of tie are handled unambiguously.Then, the projection of x onto D is given by In other words, the projection can be simply computed by setting to zero the n − s smallest components of x Note that M-stationarity, as defined in Definition 2.1, coincides with Lu-Zhang stationarity [30], which is the property guaranteed to hold for cluster points obtained by the original Penalty Decomposition method [33].Hence, we can conclude, from the results shown in Section 3, that the inexact Penalty Decomposition method has the same convergence properties as its exact counterpart, and that the M-stationarity concept includes a corresponding stationarity condition particularly designed for cardinality constrained problems.We note, however, that there exist further stationarity concepts in this setting, see the corresponding discussions in, e.g., [6,29,30,33].

Low-Rank Approximation Problems
Here we consider the space X = R m×n with given n, m ∈ N, n, m ≥ 2; equipped with the standard Frobenius inner product, this is a Euclidean space.
In applications like computer vision, machine learning, computer algebra or signal processing, there is a strong interest in low-rank matrix optimization problems, see, e.g., [7,14,15,34,39].Specifically, letting q = min(m, n) and given κ ≤ q − 1, we are interested in problems of the form (5. 3) The set D has been thoroughly analyzed from a geometrical point of view, see e.g.[24] for a formula for N lim D (X).Interestingly, elements of Π D (X) can be easily constructed exploiting the singular value decomposition of X [34,43].
Proposition 5.1.Let X ∈ X = R m×n and let X = U ΣV T its singular value decomposition, with orthogonal matrices U ∈ R m×m , V ∈ R n×n and Σ ∈ R m×n diagonal with entries in non-increasing order, i.e., being σ 1 , . . ., σ q the singular values of X.Moreover, let Σ the matrix obtained setting to zero the q − κ bottom-right elements of Σ, i.e., Of course, the computation of the SVD for a matrix X is not a costless operation, so obtaining an element of Π D (X), even though conceptually simple, requires a nonnegligible amount of computing resources.
If we restrict the discussion to the case of symmetric positive semi-definite matrices, i.e., D = {X ∈ R n×n | X 0, rank(X) ≤ κ}, we can resort to the eigenvalue decomposition instead of the SVD [25,43].
i its eigenvalue decomposition, where λ 1 ≥ . . .≥ λ n are the non-increasingly ordered eigenvalues with corresponding eigenvectors v 1 , . . ., v n .Then, we have X = κ i=1 max{0, λ i }v i v T i ∈ Π D (X).We can thus observe that, in this particular case, in order to compute the projection onto the set D we only need to find the κ largest eigenvalues with the corresponding eigenvectors; this can be done efficiently, especially when κ is small, as in most applications.
A (exact) Penalty Decomposition scheme was developed in [43] to tackle low-rank optimization problems, exploiting the above closed form rules for projection onto D both in the general and the positive semi-definite cases.The analysis in Section 3 shows that the algorithmic framework maintains the same convergence properties even when the X-update step is carried out in an inexact fashion.

Box-Switching Constrained Problems
A wide class of relevant optimization problems with difficult geometric constraints is constituted by the so called box-switching constrained problems [25] that can be formalized as follows: where, for simplicity, we assume that l x ≤ 0 ≤ u x and l y ≤ 0 ≤ u y .This setting covers various disjunctive programming problems such as problems with • switching constraints [37]: • complementarity constraints [42]: • relaxed sparsity constraints [13]: It is easy to realize that projection onto the set D in this case is simple.Indeed, let us first consider the projection onto classical bound constraints [l, u] of a vector w.Since the constraints are separable, we can immediately obtain the projection by computing, for each component i, the value With this in mind, noting that the set D is also (pairwise) separable, we can obtain an element (x, ŷ) ∈ Π D [(x, ȳ)] by first computing and then setting Computing the projection onto D thus amounts to computing 2n projections onto real intervals, which can be done with low computational effort.For this reason, a Penalty Decomposition type scheme again appears particularly appealing for this class of problems.

General Disjunctive Programs
A broad class of optimization problems with geometric constraints is represented by programs where variables are required to satisfy at least one among several sets of constraints: where D i , i = 1, . . ., N are closed convex sets.The resulting overall feasible set of these disjunctive programming problems [18] typically takes the structure of a nonconvex, disconnected set.Projections onto D in this case can be computed by finding the closest among the N projections onto D 1 , . . ., D N : Since, in general, the projection onto a convex set is already an expensive operation, the projection onto D is consequently a costly task.We shall observe that, in fact, the settings analyzed in the previous subsections are particular instances of this setting where the peculiar structure of sets D i allows to efficiently compute the projection in smart ways.The Penalty Decomposition approach might be appealing for problems of this form when the constraints G(x) ∈ C are numerous and/or nontrivial and N is also large.In these cases, the brute force strategy of solving N problems with convex constraints may become computationally unsustainable and PD might represent an appealing alternative.

Computational Experiments
In this section, we report the results of an extensive experimentation aimed at demonstrating the potential and the benefits of using the Penalty Decomposition algorithm on various classes of problems.The experiments have two main goals: • analyze the behavior of penalty decomposition in different settings and understand how to make it as efficient as possible; • compare the penalty decomposition approach with the augmented Lagrangian method proposed in [25], which is, to the best of our knowledge, the only available algorithm from the literature designed to handle the general setting (1.1).
The code for the experiments has entirely been implemented in Python 3.9 and all the experiments have been run on a machine with the following specifications: Intel Xeon Processor E5-2430 v2, 6 physical cores (12 threads), 2.50 GHz, 16 GB RAM.
We considered benchmarks of problems from the classes discussed in Section 5, i.e., cardinality constrained problems, low-rank approximation problems and disjunctive programming problems.
For the Penalty Decomposition approach (Algorithm 3.1), we set an upper bound to the value of τ k equal to 10 8 .We employed for the inner loop (Algorithm 4.1) the stopping criterion whereas for the outer loop we employ Both the above stopping conditions are the ones suggested in [33] and we set in = 10 −5 and out = 10 −5 .
As unconstrained optimization solvers for the x-update step, we implemented the gradient descent algorithm with Armijo line search.We also ran experiments using the implementations of the conjugate gradient (CG, [10]), BFGS [10] and L-BFGS [32] methods available in the scipy library.For all of these algorithms, we used the stopping criterion ∇ x q τ k (u +1 , v ) ≤ solv , with solv = 10 −5 if not specified otherwise.
As for the augmented Lagrangian method (ALM) from [25], it employs as inner solver of subproblems the spectral gradient method (SGM) proposed in the same work.With reference to [25, Algorithm 3.1], we set σ = 10 −5 , γ 0 = 1, γ max = 10 12 , m = 10, τ = 2 (note that here it does not denote the penalty parameter).As for the ALM ([25, Algorithm 4.1]), we set η = 0.8.We employed the multipliers safeguarding technique, projecting the values obtained using the standard Hestenes-Powell-Rockafellar updates onto the box [−10 8 , 10 8 ].For the spectral gradient loop, we used the stopping condition max j=0,...,m−1 where here we have used the notation of the present paper.We used the same stopping condition (6.1) as the PD method with in = 10 −5 for the inner loop of the ALM, whereas for the outer loop we require dist C (G(x k+1 )) ≤ out , with out = 10 −5 .The stopping conditions have been chosen as similar as possible for the two algorithms, in order to have a fair comparison.
We also did experiments with a variant of our proposed approach, employing safeguarded Lagrange multipliers in an augmented Lagrangian fashion, i.e., we take the ALM approach from [25] and combine it with the decomposition idea to solve the resulting subproblems.The setting of multipliers and penalty parameter updates is the same as the one we employed for the ALM itself.In the following, we will show that this modification (denoted PDLM), which does not have any major impact in the convergence analysis, leads to significant benefits in practice.It is interesting to note that this finding is in contrast with the remarks that can be found in the conclusions of [33].
Finally, we point out that we did not report the values for the initial penalty parameter τ 0 and its growing rate α τ .Indeed, these parameter are quite crucial for the overall performance of both PD and ALM algorithms and have been suitably selected for each class of problems.Thus, we will report each time the specific values of these two parameters.

Sparsity Constrained Optimization Problems
We begin our numerical analysis with sparsity constrained problems.The reason we start with this class of problems is twofold: a) the original PD approach was designed for these problems and b) results are more easily and intuitively interpretable.
We considered various experimental settings with problems of this class.Firstly, we begin with the simplest possible problems, i.e., convex quadratic problems with only sparsity constraints: We randomly generated instances of problem (6.2), according to the following procedure: where n cond denotes the desired condition number of the matrix Q.We generated three instances with n cond = 10, n ∈ {10, 25, 50}, s = 3, ν = 5 to evaluate the impact of different solvers for the x-update step on the alternating minimization scheme and, in turn, on the overall PD approach.We report in Table 6.1 the results obtained by running PD equipped with different inner solvers starting from the origin.We also ran the variant with Lagrange multipliers of our approach only with L-BFGS as inner solver; here we set τ 0 = 1 and α τ = 1.1.Moreover, we considered the spectral gradient method for comparison.Note that, since there are no additional constraints, there is no need to resort to the ALM.
As expected, the use of quasi-Newton type solvers is highly beneficial: BFGS leads to much faster convergence than the simple gradient method; the L-BFGS provides an additional, substantial speed up.The presence of Lagrange multipliers also seems to be beneficial, both in terms of efficiency and of quality of the obtained solution.Based on these result, in the following we will always be using L-BFGS for the x-update step in Algorithm 4.1.
Note that the spectral gradient method clearly outperforms the PD approach in this case.This is indeed not surprising: being there no additional constraint, there is no need with the SGM to adopt a sequential penalty strategy, which is costly.
Next, we turn to a simple verification of the convergence properties of the PD approach.In particular, we consider the artificial example [6, Example 2.2], which is an instance of (6.2) with Q = E +I, being E the matrix of all ones, c = −(3, 2, 3, 12, 5) T , ν = 1 and s = 2.We ran both PD and PDLM, with α τ = 1.1 from 1000 different starting points randomly generated in the hyper-box [−10, 10] 5 .We observed that the result strongly depends on the choice of τ 0 , as we report in Table 6.2.Interestingly, both algorithms always converged to the global minimum f (x ) = −41.33 when we set τ 0 = 0.1; in fact, we observed the same result for smaller values of τ 0 .On the other hand, as τ 0 grows worse local minimizers become increasingly probable; the presence of Lagrange multipliers seems to alleviate, but not to suppress, this inconvenience.We argue that large values of τ 0 make PD schemes more dependent on the starting point: since usually x 0 = y 0 , the penalty term is at the first iteration equal to τ 0 x − x 0 2 , which binds variable x close to the start.At this point, we have devised a setting that apparently makes the PD approach efficient and effective.We therefore expect the algorithm to indeed be a good choice to resort to when: a) additional constraints are present and/or b) the projection operator is costly.In the former case, SGM needs to be employed within another se-quential scheme, namely, the ALM, which is the only alternative to the PD available from the literature; in the latter case, the advantage of PD over the ALM may not be straightforward.In fact, the two algorithms share a similar structure, sequentially solving penalized subproblems; in order to do so, unconstrained continuous optimization steps and projections onto D are repeatedly carried out; however, in PD many descent steps can be carried out before turning to the projection step; on the contrary, in the ALM we need to do the projection after every gradient step (in fact, we do it many times per iteration of the SGM to satisfy the acceptance criterion), cf. the discussion in Remark 4.6.
We now turn to sparsity constrained problems with additional constraints.In particular, we keep considering convex quadratic problems, but with simplex constraints, i.e., problems of the form where e ∈ R n denotes the vector of all ones.This is a classical sparse portfolio optimization problem [11], where Q and c denote the covariance matrix and the mean of n possible assets.Portfolio optimization problems are particularly useful to test the proposed algorithm since we can easily obtain the global optimum to be used as a reference.Indeed, we can do so exploiting the mixed-integer reformulation of the problem with binary indicator variables and big-M type constraints and employing efficient software solvers such as Gurobi [22].
We first consider synthetic problems.Using (6.3)-(6.4),we generated 10 problems for each combination of n ∈ {20, 40, 60} and n cond ∈ {10, 100, 500}, for a total of 90 problems.We set s = 4 when n = 20, s = 7 for n = 40 and s = 9 for n = 60.We set ν = 1 and use as starting point of the experiments x = (1/n, . . ., 1/n) T ; we ran PD, PDLM, ALM all with τ 0 = 1 and α τ = 1.1.We also ran Gurobi on all instances to obtain the global optimizer to be used as reference; note that Gurobi indeed finds the certified global minimum in tens of seconds.The overall results of the experiments are reported in Figure 6.1.The results concerning efficiency (runtime) are presented in the form of performance profiles [17] in Figure 6.1(a).We can observe that Penalty Decomposition with Lagrange multipliers is generally faster than the other two considered algorithms.As for the quality of the retrieved solutions, we plot in Figure 6.1(b) the cumulative distribution of the relative gap between the solution found by a solver and the certified global optimum found with Gurobi; the result of PDLM is surprisingly remarkable, as it almost always reached a value very close, and often equal to, the global optimum; on the contrary, both PD and the ALM end up with substantially suboptimal solutions in almost a half of the cases.
We conclude the analysis on sparsity constrained problems looking at the results on 6 instances of real world portfolio selection problems.In particular, the data used in the experiments consists of daily data for securities from the FTSE 100 index, from 01/2003 to 12/2007.The three datasets are referred to as DTS1, DTS2, and DTS3, and are formed by n = 12, 24, and 48 securities, respectively.We also included three datasets from the Fama/French benchmark collection (FF10, FF17, and FF48,  with n equal to 10, 17, and 48), using the monthly returns from 07/1971 to The datasets are generated as in [16].For each dataset, we define an instance of problem (6.1): the values of s and ν are set as reported in Table 6.3, and are such that the cardinality constraint is active at the optimal solution.We used again x = (1/n, . . ., 1/n) T as starting point.As for the penalty parameter, we set τ 0 = 0.01 and for the Penalty Decomposition methods, whereas we found that a larger value τ 0 = 1 was beneficial for the ALM.The parameter α τ was set to 1.01 for all methods.
The results are reported in Table 6.3 and we can observe that the trends outlined by the previous experiments are substantially confirmed.

Low-Rank Optimization Problems
In this section, we study problems as discussed in Section 5.2 where X = R m×n and D consists of a low-rank matrices space.
To begin with, we consider the class of nearest low-rank correlation matrix problems, which was already used as a benchmark in [43].In detail, the problem can be formulated as where A is a given symmetric correlation matrix.The test problems we consider are the same as in [43], and their corresponding matrix A is defined as follows: • (P1) A ij = 0.5 + 0.5 exp(−0.05|i− j|) for all i, j; • (P2) A ij = exp(−|i − j|) for all i, j; • (P3) A ij = 0.6 + 0.4 exp(−0.1|i− j|) for all i, j.
For each of the above problems, we considered the instances with n = 200 and n = 500 and a value of κ = 5, 10 and 20.We experimentally compared the ALM and some implementations of the Penalty Decomposition approach; for all these algorithms, we set τ 0 = 1 and α τ = 1.2.We also needed for these experiments to set the upper bound on the value of τ k to 10 12 .
Note that solving the X-update subproblem with an iterative solver has a significant cost, as we are considering problems with up to n × n = 250000 variables; moreover, we are dealing with ill-conditioned quadratic problems, thus we found convenient switching from L-BFGS to the CG method.We tested two settings for the X-update step with CG: a strongly inexact setting, where the CG method is stopped after at most 5 steps (PD-cg-inaccurate), or when the norm of the gradient is smaller than solv = 0.1, and a more accurate setting, where up to 20 CG steps are carried out and the tolerance for the gradient norm stopping condition is set to 0.001 (PD-cg-accurate).
In addition, we note that, in fact, the X-update subproblem min can be solved to global optimality in closed form; we thus also carried out experiments with this option (PD-exact); moreover, we also consider the strategy adopted in [43], where the constraint diag(X) = e is kept as a lower-level constraint and the X-update subproblem is still solved in closed form (PD-exact-lower-level).
We finally report that we found the introduction of Lagrange multipliers associated with constraints G(x) ∈ useful.We instead noticed that multipliers associated with the constraint X = Y are not helpful.This observation is in line with the work in [43], where only the constraint X = Y was in practice handled by the penalty approach and multipliers were reported not to be beneficial.In the experiments described in the following, only multipliers associated with the original problem constraints have been employed.The results of the experiment are reported in Tables 6.4, 6.5 and 6.6.
We can observe that the exact versions of the PD approach are the best performing ones from all perspectives, with the PD-exact-lower-level originally used in [43] standing out.This is in fact not surprising: in this case the exact method solves subproblems not only with higher accuracy, but also employing much less time than using an iterative solver.
Interestingly, however, we observe that the "inaccurate" version of the inexact PD attains runtimes that are comparable with the exact approaches, with only small drops in the quality of the retrieved solution.On the other hand, with a slightly more accurate inexact minimization we are always able to retrieve the best solution as the exact methods, with a computational effort generally comparable to that of the ALM.
We can thus deduce that a suitable configuration exists for the inexact PD approach that provides a good trade-off between solution quality and runtime.
These results are encouraging for all those settings where the exact version of the Penalty Decomposition approach is not employable by construction.
We then turn to a new class of problems, where matrices are not symmetric positive semi-definite and the X-update step requires a solver to be carried out.Specifically, we consider the low-rank based multi-task training [44] of logistic models [23].Given a collection of somewhat related binary classification tasks T 1 , . . ., T m , where X i represents the data matrix for each task and Y i the corresponding labels, we can formalize the multitask logistic regression training problem as where the t-th row W t of W denotes the weights of the logistic model for the t-th task; each model is defined as the sum of a component independently characterizing the particular task, which is regularized, and a second component that lies in a linear subspace shared by all tasks.The stronger is the regularization parameter η, the higher will be the similarity of the obtained models.By L(W t ; X t , Y t ) we denote the binary cross entropy loss function of the logistic model for task t, which is a convex function that, however, cannot be minimized in closed form.We can easily observe that the problem can be solved by Penalty Decomposition, duplicating variable V .The constraint W = U + V can also be tackled by the penalty approach.The update step of the original variables W, U, V cannot be carried out in closed form, thus we need to resort to the inexact version of the PD method.
For the experiments, we used the landmine dataset [41], consisting of 9-dimensional data points representing radar images from 29 landmine fields/tasks.Each task aims to classify points as landmine or clutter.There are 14,820 data points in total.Tasks can approximately be clustered into two classes of ground surface conditions, so we expect r = 2 to be a reasonable bound for the low-rank component of the solution.We defined four instances of problem (6.5), corresponding to values of η in {0.01, 0.1, 0.5, 2}.We examined the behavior of the inexact PD and ALM methods under different parameters configurations.In particular, we considered the following settings: • Penalty Decomposition -Lagrange multipliers associated with all constraints; τ 0 = 10 −3 , α τ = 1.We also report the results obtained by optimizing each task independently.For all PD and ALM configurations, we used as starting solution the one retrieved by single task optimization.Note that both configurations for ALM have lower precision than the default one reported at the beginning of Section 6; with this particular problem, we found the default configuration to be remarkably inefficient; however we shall underline that, in other test cases considered in this paper, these alternative configurations had led to convergence issues concerning numerical errors.The results of the experiment are reported in Figure 6.2.Note that here we are interested in the optimization process metrics, not in the out-of-sample prediction performance of the obtained models.We can observe that different setups for the algorithms allow to obtain different trade-offs between speed and solution quality.In particular, for the PD method we see that the trend observed with the correlation matrix problems are confirmed: solving the x-update subproblem up to lower accuracy allows to save computing time but at the cost of small yet not negligible sacrifice on the solution quality.A similar and  even stronger trend can be observed for the ALM.Finally, we can observe that PD appears to be superior to the ALM both in terms of efficiency and effectiveness.

Disjunctive Programming Problems
In this section, we computationally analyze the performance of the Penalty Decomposition approach on problems of the form (5.5).The main goal of this section is to compare the performance of inexact PD with that of the ALM algorithm in a setting where the projection operation is costly and is in fact responsible for the largest part of the computational burden: as highlighted in Section 5.4, it amounts to compute the projection onto each of the convex sets D i .
For the experiments, we defined the following test problem: where L denotes the (convex) logistic loss on a randomly generated dataset of 200 examples.We assume A q ∈ R s×n and b q ∈ R s have the same dimensions for q = 1, . . ., N and their coefficients are uniformly drawn from [−1, 1]; the coefficients c are randomly picked from [0, 1], whereas values for p are from [−0.5, 0.5].We set t j = 0.1 for all j.First, we consider the problem with n = 10, s = 12 and m = 1, for values of N varying in {2, 5, 10, 20, 50, 100}.In Table 6.7 we report the results obtained running PD (parameters: τ 0 = 0.1, α τ = 1.2, in = 0.01, Lagrange multipliers employed) and the ALM (parameters τ 0 = 1, α τ = 1.2 m = 4, σ = 0.01, in = 0.01), together with reference values obtained with the enumeration approach (subproblems are solved using the SLSQP method available in scipy), which allows to retrieve the certified global optimizer.For the projection steps onto sets D i we used gurobi solver.We can observe that PD was always much faster than the ALM; moreover, it always ended up finding the actual global optimizer; this does not hold true for the ALM.We remark that we verified that, as expected, the computing time is indeed entirely dominated by projection steps.
Then, we turn to the experiments on an instance where the number of nonlinear constraints shared by all components of the feasible set are numerous and dominate the complexity of solving each subproblem in the enumeration approach.In particular, we consider the previous problem with N = 50, n = 5, m = 80, s = 7.Here we set τ 0 = 0.1, α τ = 1.5, in = 0.02 for PD and τ 0 = 1, α τ = 1.5 m = 4, σ = 0.05, in = 0.1 for the ALM.The experiment is repeated 20 times for different random seeds.The results are reported in the form of performance profiles in Figure 6.3.We deduce that Penalty Decomposition can indeed be a good choice in particularly complicated settings: the global optimizer was always reached, and this result was obtained in a consistently more efficient way than the brute force approach; the ALM does not have a comparable appeal in this context, the reason arguably being the much higher frequency of it resorting to the projection operation.

Conclusions
The current paper considers a penalty decomposition scheme for optimization problems with geometric constraints.It generalizes existing penalty decomposition schemes both by taking advantage of a general abstract (and usually complicated) constraint (as opposed to having only particular instances like cardinality constraints) and by including further (though supposingly simple) constraints.The idea and the convergence theory of this method is also related to recent augmented Lagrangian techniques, but the decomposition idea turns out to be numerically superior by allowing more efficient subproblem solvers and using many less projection steps.
In principle, it should be possible to extend the decomposition idea to the class of (safeguarded) augmented Lagrangian methods.Another, and related, question is whether one can exploit additional properties of augmented Lagrangian methods in order to improve the existing convergence theory.For example, augmented Lagrangian techniques have very strong convergence properties in the convex case.The particular classes of problems discussed in this paper are nonconvex, but the nonconvexity mainly arises from the fact that the abstract set D is nonconvex.Since we deal with the complicated set D explicitly, so that all iterates are feasible with respect to this set, a natural question is therefore whether one can prove stronger convergence properties in those situations where the remaining functions and constraints are convex.This will be part of our future research.
Euclidean projection P C : Y → Y onto the nonempty, closed, and convex set C is defined by P C (y) := argmin z∈C z − y .The corresponding distance function d C : Y → R can then be written as dist C (y) := min z∈C z − y = P C (y) − y .
) := y ∈ X ∃x k → x, ∃y k → y with y k ∈ S(x k ) ∀k ∈ N .This allows to define the limiting normal cone at a point x ∈ D byN lim D (x) := lim sup v→x cone v − Π D (v) ,see [38, Sect.1.1] for further details.Writing x → D x ⇐⇒ x → x, x ∈ D for sequences converging to an element x ∈ D such that the whole sequence belongs to D, the limiting normal cone has the important robustness property lim sup x→ D x N lim D (x) = N lim D (x) (2.2)

Theorem 2 . 4 .
The following statements hold: (a) Every local minimum of (1.1) is an AM-stationary point.(b) If x is an AM-stationary point satisfying AM-regularity, then x is an M-stationary point of (1.1).
Cumulative distribution of relative gap to the certified global optimum.

Figure 6 . 1 :
Figure 6.1: Results of experiments on 90 randomly generated sparse portfolio selection problems, running PD, PDLM and the ALM.

Figure 6 . 2 :
Figure 6.2: Runtime/quality trade-off for different set-ups of PD and ALM on lowrank multitask logistic regression problems.The four problems are obtained from the landmine dataset for different values of the regularization parameter η and setting κ = 2.

Figure 6 . 3 :
Figure 6.3: Performance profiles of runtime attained by PD, ALM and the enumeration strategy (with SLSQP) on 20 disjoint programming problems.When a solver does not attain the global minimum, the corresponding runtime is considered infinite when building the profile.

Table 6 . 2
: Convergence of Penalty Decomposition methods on [6, Example 2.2] for different values of τ 0 .We report the number of times an objective value has been obtained out of 1000 runs from different starting points chosen randomly in [−10, 10] n .τ 0 Algorithm f = −41.33f = −39 f = −36.33Others

Table 6 .
3: Results of experiments on 6 real world sparse portfolio selection problems, solved using different algorithmic approaches.

Table 6 .
7: Results of experiments on disjoint programming problems, for increasing number of feasible set components N .