Level Constrained First Order Methods for Function Constrained Optimization

We present a new feasible proximal gradient method for constrained optimization where both the objective and constraint functions are given by the summation of a smooth, possibly nonconvex function and a convex simple function. The algorithm converts the original problem into a sequence of convex subproblems. Formulating those subproblems requires the evaluation of at most one gradient value of the original objective and constraint functions. Either exact or approximate subproblem solutions can be computed efficiently in many cases. An important feature of the algorithm is the constraint level parameter. By carefully increasing this level for each subproblem, we provide a simple solution to overcome the challenge of bounding the Lagrangian multipliers and show that the algorithm follows a strictly feasible solution path till convergence to the stationary point. We develop a simple, proximal gradient descent type analysis, showing that the complexity bound of this new algorithm is comparable to gradient descent for the unconstrained setting, which is new in the literature. Exploiting this new design and analysis technique, we extend our algorithms to some more challenging constrained optimization problems where 1) the objective is a stochastic or finite-sum function, and 2) structured nonsmooth functions replace smooth components of both objective and constraint functions. Complexity results for these problems also seem to be new in the literature. Finally, our method can also be applied to convex function-constrained problems where we show complexities similar to the proximal gradient method.


Introduction
In this paper, we study the following constrained optimization problem min xPR d ψ 0 pxq -f 0 pxq `χ0 pxq s.t.ψ i pxq -f i pxq `χi pxq ď η i , i " 1, . . ., m. ( where ψ i pxq is a composite function that sums up functions f i pxq and χ i pxq.Here, f i , i " 0, 1, . . ., m, are smooth functions, χ 0 pxq is a proper, convex, lower-semicontinuous (lsc) function and χ i pxq, i " 1, . . ., m, are convex continuous functions over the domain of χ 0 (i.e.dom χ 0 ).We consider that χ i , i " 0, . . ., m are 'simple' functions, namely, a feasible optimization problem of the form below min can be solved to efficiently obtain either an exact solution or an inexact solution of desired accuracy.Note that if χ i " 0, i " 1, . . ., m, then (1.2) becomes a proximal operator for function χ 0 on the intersection of balls.If we further assume χ 0 " 0, then (1.2) is a special type of quadratically constrained quadratic programming (QCQP) that can be solved efficiently because all the Hessians are identity matrices.In addition, we consider the case where constraints f i , i " 1, . . ., m, are structured nonsmooth functions that can be approximated by smooth functions (also called smoothable functions).Note that problem (1.1) covers a variety of convex and nonconvex function constrained optimization depending on the assumptions of f i , i " 0, . . ., m. Nonlinear optimization with function constraints is a classical topic in continuous optimization.While the earlier study focused on asymptotic performance, recent work has put more emphasis on the complexity analysis of algorithms, mainly driven by the growing interest in large-scale optimization and machine learning.For most of our discussion on the complexity analysis, we generally require convergence to an ϵ-approximate KKT point (c.f.Definition 3).Penalty methods [9,33,25], including augmented Lagrangian methods [22,23,34,24], is one popular approach for constrained optimization.In [8], Cartis et al. presented an exact penalty method by minimizing a sequence of convex composition functions.When the penalty weight is bounded, this method solves Op1{ϵq trust region subproblems.If the penalty weight is unbounded, the complexity is of Op1{ϵ 2.5 q to reach an ϵ-KKT point.In a subsequent work [9], the authors provided a target following method that achieves the complexity of Op1{ϵq, regardless of the growth of the penalty parameter.In [33], Wang et al. extended the penalty method for solving constrained problems where the objective takes the expectation form.Sequential quadratic programming (SQP) is another important approach for constrained optimization.Typically, SQP involves linearization of the constraints, quadratic approximation of the objective, and possibly some trust region constraint for the convergence guarantee [6,7].The recent work [12] established a unified convergence analysis of SQP (GSQP) in more general settings where the feasibility and constraint qualification may or may not hold.Different from the standard SQP, the Moving Balls Approximation (MBA) method [1] follows a feasible solution path and transforms the initial problem into a diagonal QCQP.A subsequent work [3] presented a unified analysis of MBA and other variants of SQP methods.Under the assumption of Kurdyka-Łojasiewicz (KL) property, they establish the global convergence rates which depend on the Łojasiewicz exponent.
Despite much progress in prior works, there are some significant remaining issues.Specifically, most of the analysis is carried out for only smooth optimization and requires that the exact optimal solution of the convex subproblem is readily available.Unfortunately, both assumptions can be unrealistic in many large-scale applications.To overcome these issues, [4,26,25] presented some new proximal point algorithms that iteratively solve strongly convex proximal subproblems inexactly using first-order methods.A significant computational advantage is that first-order methods only need to compute a relatively easy proximal gradient mapping in each iteration.In particular, [4] proposed to solve the proximal point subproblem by a new firstorder primal-dual method called ConEx.Under some strict feasibility assumption, they derived the total complexities of the overall algorithm for which the objective and constraints can be either stochastic or deterministic, and either nonsmooth or smooth.Notably, for nonconvex and smooth constrained problems, inexact proximal point [4] requires Op1{ϵ 1.5 q function/gradient evaluations.A similar complexity bound is obtained by the proximal point penalty method [25] when a feasible point is available.Nevertheless, at this point, it may be difficult to directly compare the efficiency of the proximal point with the earlier approach, given that very different oracles are employed in each method.The inexact proximal point method appears to be less efficient in terms of gradient and function value computations since first-order penalty method [9] and a variant of SQP [12] (where the surrogate is formed by first-order approximation) has an Op1{ϵq complexity bound.Nevertheless, it might be more efficient if the corresponding proximal mapping is much easier to solve than the subproblems in penalty or SQP methods.
In this paper, we attempt to alleviate some of the aforementioned issues in solving nonconvex constrained optimization.Our main contribution is the development of a novel Level Constrained Proximal Gradient (LCPG) method for constrained optimization, based on the following key ideas.
First, we convert the original problem (1.1) into a sequence of simple convex subproblems of the form (1.2) for which an exact or an approximate solution can be computed efficiently.In particular, solving the subproblem requires at most one gradient and function value computation for f i , i " 0, . . ., m.This phenomenon is quite similar to simple single-loop methods even though LCPG method can be multi-loop since we allow for an inexact solution of (1.2) using some kind of iterative scheme.
Second, starting from a strictly feasible initial point and carefully controlling the feasibility levels of the subproblem constraints, we ensure that LCPG follows a strictly feasible solution path.This also allows us to deal with nonsmooth constraints where χ i is not necessarily 0 and further extends LCPG to the inexact case where the subproblem admits an approximate solution.Even though subtle, the level-control design is crucial in bounding the Lagrange multipliers under the well-known Mangasarian-Fromovitz constraint qualification (MFCQ) [27,4].Subsequently, we also show asymptotic convergence of the LCPG method.
Third, we offer a new insight into the complexity analysis of LCPG as a gradient descent type method, which could be of independent interest.When the objective and constraints are nonconvex composite, we aim to find a first-order ϵ-KKT point (c.f.Definition 3) under the aforementioned MFCQ assumption.We can show that LCPG method converges in Op1{ϵq iterations.Furthermore, each subproblem requires at most one function-value and gradient computation.The net outcome is that gradient complexity of our method is of Op1{ϵq.Notice that the number of iterations required by the proximal point method under MFCQ is also Op1{ϵq (see [4,Theorem 5]).However, each iteration of this method requires Op1{ϵ 0.5 q gradient computation, and hence its total gradient complexity can be bounded by Op1{ϵ 1.5 q.This is much worse than LCPG method.We compare with some significant lines of work in Table 1.
Exploiting the intrinsic connection between LCPG and proximal gradient (without function constraints), we extend LCPG to a variety of cases.1) We can show a similar Op1{ϵq gradient complexity for an inexact LCPG method for which the subproblem is solved to a pre-specified accuracy.If we assume χ i " 0, then the corresponding subproblem (1.2) (i.e.diagonal QCQP) can be efficiently solved by a customized interior point method in logarithmic time.In the more general setting when χ i ‰ 0, we propose to solve (1.2) by the firstorder method ConEx, which has very cheap iterations.2) We also extend LCPG method to stochastic (LCSPG) and variance-reduced (LCSVRG) variants when f 0 is either stochastic or finite-sum function, respectively.LCSPG and LCSVRG require Op1{ϵ 2 q (similar to SGD [15]) and Op ?n{ϵq (similar to SVRG [18]) stochastic gradients, respectively, where n is the number of components in the finite-sum objective.The complexity of variants of LCPG method for stochastic cases can also be seen in Table 2. 3) We consider the case when function f i , i " 0, 1, . . ., m are nondifferentiable but contain a smooth saddle structure (referred to as structured nonsmooth).We extend LCPG method for such nonsmooth nonconvex function constrained problem using Nesterov's smoothing scheme [29].In this case, LCPG method requires Op1{ϵ 2 q gradients.
We show that the GD-type analysis of the LCPG method can be extended to the convex case.In particular, when the objective and constraint functions are convex, we show that LCPG method requires Op1{ϵq gradient computations for smooth and composite constrained problems, and this complexity improves to Oplog p1{ϵqq when the objective is smooth and strongly-convex.Furthermore, we develop the complexity of inexact variants of LCPG method by leveraging the analysis of gradient descent with inexact projection oracles [31].Inexact LCPG method maintains the gradient complexity of Op1{ϵq and Oplogp1{ϵqq for convex and strongly convex problems, respectively.
Throughout our analysis, we require that the Lagrange multipliers for the convex subproblems of type (1.2) are bounded.This problem is addressed in different ways in arguably all works in the literature.In this paper, we show that under the assumption of MFCQ, Lagrange multipliers associated with the sequence of subproblems remain bounded by a quantity specified as B.Even then, the value of B cannot be estimated a priori.Fortunately, this bound is not needed in the implementation of our methods.However, it plays a role in complexity analysis.Hence, our comparison with the existing complexity literature (e.g., proximal point method of [4]) is valid when bound B on the sequence of Lagrange multipliers largely depends on the problem itself and not on the sequence of subproblems.One can easily see that such uniform bounds on Lagrange multipliers hold under the strong feasibility constraint qualification [4], a similar uniform Slater's condition [26] or for nonsmooth nonconvex relaxation in the application of sparsity constrained optimization [5].The problem of comparing bounds B on Lagrange multipliers requires getting into specific applications, which is not the purpose of this paper.Hence, throughout our comparison with existing literature, we assume that bound B for different methods is of a similar order.
Comparison with MBA method Auslender et al. [1] provided a Moving Balls Approximation (MBA) method for smooth constrained problems, i.e. χ i pxq, i " 0, . . ., m, are not present.They use Lipschitz continuity of constraint gradients along with MFCQ to ensure that the subproblems satisfy Slater's conditions (see [1, Proposition 2.1(iii)]).A similar result is also used in [35] where they provide a line-search version of MBA for functions satisfying certain KL properties.The MBA method was studied for semi-algebraic functions in [3] where they used the KL-property of semi-algebraic functions.The work [1] also provides the complexity guarantee for constrained programs with a smooth and strongly convex objective.Our results differ from the past studies in the following several aspects.First, we do not assume any KL property on the class of functions, hence making the method applicable to a wider class of problems.Second, we show complexity analysis for a variety of cases, e.g., stochastic, finite-sum, or structured nonsmooth cases.Note that complexity results are not known for the MBA type method even for the purely smooth problem.Third, we show complexity results For convex problems, we consider the complexity to reach a feasible solution with Opϵq-optimality gap.

Algorithms
For nonconvex problems, we consider the complexity to reach an approximate KKT solution that satisfies }BL} 2 ´ď ϵ.Note that different works have quite different error measurements of the complementary slackness.For example, in our translation, [12,25] requires an Op ?ϵq error on the complementary slackness and feasibility.Our measure requires 0-feasibility error and Opϵq-complementary slackness error.* IQRC does not explicitly discuss the composite case.Their subproblem oracle can be upgraded to handle proximal cases relatively easily.: Different methods have different costs for solving the subproblem.Some methods require explicit gradient computations for solving the subproblems and hence, are expected to be quite computationally costly.Some methods (including ours) have simple subproblems.See Remark 11 for a detailed discussion.Hence, comparing total computational complexity is not possible.We instead compare gradient complexities to provide a realistic estimate of the computational effort of each of these methods.
for both convex and strongly convex cases which strictly subsumes the results in [1].Fourth, it should be noted that [1] also considered the efficiency of solving subproblems.They proposed an accelerated gradient method that obtains Op1{ ?ϵq complexity for solving the dual of the QCQP subproblem.However, it is unclear what accuracy is enough for ensuring asymptotic convergence of the whole algorithm.
Comparison with generalized SQP The work [12] developed the first complexity analysis of the generalized SQP (GSQP) method by using a novel ghost penalty approach.Different from our feasible method, they consider a general setting where the feasibility and constraint qualification may or may not hold.They show that SQP-type methods have an Op1{ϵ 2 q complexity for reaching some ϵ-approximate generalized stationary point.Under an extended MFCQ condition, they established an improved complexity Op1{ϵq for reaching the scaled-KKT point, which matches our complexity result under a similar MFCQ assumption.Notably, both their analysis and ours rely on MFCQ to show that the global upper bound (constant B) on the multipliers of the subproblems exists.However, to obtain the best Op1{ϵq complexity, GSQP explicitly relies on the value of such unknown upper bound to determine the stepsize, which appears to be rather challenging in practical use.In contrast, our algorithm does not involve the constant B in the algorithm implementation; we only require the Lipschitz constants of the gradients, which is standard for gradient descent methods.
Outline This paper is organized as follows: Section 2 describes notations and assumptions.It also provides various definitions used throughout the paper.Section 3 presents the LCPG method which uses exact solutions of subproblems.It also establishes asymptotic convergence and convergence rate results.Section 4.1 and Section 4.2 provides the LCSPG and LCSVRG method for stochastic and finite-sum problems, respectively.Section 5 shows the extension of LCPG for nonsmooth nonconvex function constraints.Section 6 introduces the inexact LCPG method and provides its complexity analysis when the subproblems are inexactly solved by an interior point method or first-order method.Finally, Section 7 extends LCPG method for convex optimization problems and establishes its complexity for both strongly convex and convex problems.

Notations and assumptions
Notations.R n `stands for the non-negative orthant in R n .We use ∥¨∥ to express the Euclidean norm.For a set X , we define ∥X ∥ ´" distp0, X q " inf ␣ ∥x∥, x P X ( .If X is a convex set then we denote its normal cone at x as N X pxq.Furthermore, we denote the dual cone of the normal cone at x as N X pxq.Let e be a vector full of elements one.For simplicity, we denote rms " t1, 2, . . ., mu, f pxq " rf 1 pxq, . . ., f m pxqs T , χpxq " rχ 1 pxq, . . ., χ m pxqs T , and ψpxq " rψ 1 pxq, ψ 2 pxq, . . ., ψ m pxqs T .For vectors x, y P R m , x ď y is understood as x i ď y i for i P rms. Assumption 1 (General).We assume that the optimal value of problem (1.1) is finite: ψ 0 ą ´8.Furthermore, the objective and constraint functions have the following properties.
(2.1)For functions ψ i , we denote its subdifferential as Bψ i pxq " t∇f i pxqu `Bχ i pxq, i " 0, . . ., m, where the sum is in the Minkowski sense.Note that this definition of the subdifferential for nonconvex functions was first proposed in [4].Moreover, Bψ i " t∇f i u when ψ i is a "purely" smooth nonconvex function and Bψ i " Bχ i when ψ i is a nonsmooth convex function.Hence, it is a valid definition for the subdifferential of a nonconvex function.Below, we define the KKT condition using the above subdifferential.
Definition 1 (KKT condition).We say x P dom χ 0 is a KKT point of problem (1.1) if x is feasible and there exists a vector λ P R m `such that 0 P B x Lpx, λq, 0 " 2) The values tλ i u are called Lagrange multipliers.
It is known that the KKT condition is necessary for optimality under the assumption of certain constraint qualifications (c.f.[2]).Our result will be based on a variant of the Mangasarian-Fromovitz constraint qualification, which is formally given below.
Proposition 1 (Necessary condition).Let x be a local optimal solution of Problem (1.1).If it satisfies MFCQ (2.3), then there is a vector λ P R m `such that the KKT condition (1) holds.Next, we introduce some optimality measures before formally presenting any algorithms.It is natural to characterize algorithm performance by measuring the error of satisfying the KKT condition.Towards this goal, we have the following definition.Definition 3. We say that x is an ϵ type-I (approximate) KKT point if it is feasible (i.e.ψpxq ď η), and there exists a vector λ P R m `satisfying the following conditions: ∥B x Lpx, λq∥ 2 ´ď ϵ ´řm i"1 λ i rψ i pxq ´ηi s ď ϵ.Moreover, x is a randomized ϵ type-I KKT point if both x and λ are feasible randomized primal-dual solutions that satisfy Er∥B x Lpx, λq∥ 2 ´s ď ϵ Er´ř m i"1 λ i rψ i pxq ´ηi ss ď ϵ, where the expectation is taken over the randomness of x and λ.
Besides the above definition, we invoke a second optimality measure which will help analyze the performance of a proximal algorithm (see, for example, [4]).Therein, it is arguably more convenient to approach the proximity of some nearly stationary points.Definition 4. We say that x is a pϵ, νq type-II KKT point if there exists an ϵ type-I KKT point x and ∥x´x∥ 2 ď ν.Similarly, x is a randomized pϵ, νq type-II KKT point if x is a random vector and Er∥x´x∥ 2 s ď ν.

A proximal gradient method
We present the level constrained proximal gradient (LCPG) method in Algorithm 1.The main idea of this algorithm is to turn the original nonconvex problem into a sequence of relatively easier subproblems that involve some convex surrogate functions ψ k i pxq (0 ď i ď m) and variable constraint levels η k : min Above, we take the surrogate function ψ k i pxq (0 ď i ď m) by partially linearizing ψ i pxq at x k and adding the proximal term Li  2 ∥x ´xk ∥ 2 : 2) It should be noted that our algorithm may not be well-defined if it were to be initialized by an infeasible solution x 0 .Furthermore, we require the initial point to be strictly feasible with respect to the nonlinear constraints ψpxq ď η.Therefore, we explicitly state this assumption below and assume it holds throughout the paper.
With a strictly feasible solution, we assume that the constraint levels tη k u are incrementally updated and converge to the constraint levels for the original problem: lim kÑ8 η k i " η i , i P rms.The following Lemma will be used many times throughout the rest of the paper.
Next, we present some important properties of the generated solutions in the following theorem.
Proposition 2. Suppose that Assumption 2 holds, then Algorithm 1 has the following properties.
1.The sequence ␣ x k ( is well-defined and is feasible for problem (1.1).tx k u satisfies the sufficient descent property: Moreover, the sequence of objective values ␣ ψ 0 px k q ( is monotonically decreasing and lim kÑ8 ψ 0 px k q exists.
2. There exists a vector λ k`1 P R m `such that the KKT condition holds: Proof.Part 1).First, we show that tx k u is a well-defined sequence, namely, X k X dom χ 0 is a nonempty set where X k " ␣ x : . This result clearly holds for k " 0 by Assumption 2. We show the general case pk ą 0q by induction.Suppose that x k is well-defined, i.e., X k´1 X dom χ 0 is nonempty.Then, by the definition of ψ k i , ψ k´1 i and the definition x k , we have x k P dom χ 0 and ψ k i px k q " ψ i px k q ď ψ k´1 i px k q ď η k´1 i ă η k i for all i P rms.
(3.7) Here, the first inequality follows due to the smoothness of f i pxq which ensures for all x, f i pxq ď f i px k´1 q `x∇f i px k´1 q, x ´xk´1 y `Li 2 ∥x ´xk´1 ∥ 2 , @i P rms.This is equivalent to x k P X k X dom χ 0 , implying that X k X dom χ 0 is nonempty.We conclude that x k`1 is well-defined.Hence, by induction, we conclude that tx k u is a well-defined sequence.Furthermore, in view of x k P dom χ 0 , relation (3.7) and the fact that η k i ă η i , we have x k P dom χ 0 X tx : ψ i pxq ď η i , i " 1, . . ., mu.Hence, the whole sequence tx k u remains feasible for the original problem.Now, let us apply Lemma 1 with gpxq " x∇f 0 px k q, xy `χ0 pxq `1X k pxq, y " x k and γ " L 0 .Then, for any x P dom χ 0 X X k , we have (3.8)Moreover, since f 0 p¨q is Lipschitz smooth, we have that f 0 px k`1 q ď f 0 px k q `xf 0 px k q, x k`1 ´xk y `L0 2 ∥x k`1 ´xk ∥ 2 .Summing up the above two inequalities and using the definition ψ 0 " f 0 `χ0 , we conclude (3.5).Hence, the sequence ␣ ψ 0 px k q ( is monotonically decreasing.The convergence of ␣ ψ 0 px k q ( follows from the lower boundedness assumption. Part 2).Note that (3.7) ensures the strict feasibility of x k w.r.t. the constraint set X k for the k-th subproblem.Therefore, Slater's condition for (3.1) and the optimality of x k`1 imply that there must exist a vector λ k`1 P R m `satisfying KKT condition (3.6).Hence, we complete the proof.In order to show convergence to the KKT solutions, we need the following constraint qualifications.Assumption 3 (Uniform MFCQ).All the feasible points of problem (1.1) satisfy MFCQ.
We state the main asymptotic convergence property of LCPG in the following theorem.
Theorem 1. Suppose that Assumption 3 holds, then we have the following conclusions.
1.The dual solutions tλ k`1 u are bounded from above.Namely, there exists a constant B ą 0 such that sup (3.9) 2. Every limit point of Algorithm 1 is a KKT point.
Proof.Part 1).First, we can immediately see that tx k u is a bounded sequence and hence the limit point exists.This result is from Assumption 1.3 and the feasibility of x k for problem (1.1) (c.f.Proposition 2, Part 1).Without loss of generality, we can assume lim kÑ8 x k " x.For the sake of contradiction, suppose that λ k`1 is unbounded, then passing to a subsequence if necessary, we can assume ∥λ k`1 ∥ Ñ 8 for simplicity.
(3.10) Let X :" Ş iPrms tx : ψ i pxq ď η i u X dom χ 0 be the feasible domain for problem (1.1).Due to the fact that x k P X (Proposition 2), boundedness of X (Assumption 1.3) and strong convexity of ψ k 0 , there exists l 0 P R such that X Ă tx : ψ k 0 pxq ă l 0 u for all k.Then, using (3.10) for all x P dom χ 0 X tψ k 0 pxq ď l 0 u and dividing both sides by ∥λ k`1 ∥, we have (3.11) Let us take k Ñ 8 on both sides of (3.11).Note that for all x P dom χ 0 X tψ k 0 pxq ď l 0 u, we have lim where (3.12) is due to boundedness of ψ k 0 pxq on dom χ 0 X tψ k 0 pxq ď l 0 u and boundedness of ψ k 0 px k`1 q since x k`1 P X which is a bounded set.Moreover, (3.13) and (3.14) use the continuity of f i pxq and χ i pxq, i P rms.Next, we consider the sequence tu k " λ k`1 {∥λ k`1 ∥u.Since ∥u k ∥ is a bounded sequence, it has a convergent subsequence.Let ū be a limit point of tu k u and the subsequence tj k u Ď t1, 2, ..., u such that lim kÑ8 u j k " ū.Since the subsequence of a convergent sequence is also convergent, we pass to the subsequence j k in (3.11) and apply (3.12), (3.13) and (3.14), yielding ř m i"1 ūi for all x P dom χ 0 X tψ k 0 pxq ď l 0 u.Hence, x minimizes 0 pxq ď l 0 u.Now noting x P X Ă tψ k 0 pxq ă l 0 u and using the stationarity condition for optimality of x, we have 0 P ř m i"1 ūi ) where we dropped the constraint ψ k 0 pxq ď l 0 due to complementary slackness and the fact that ψ k 0 pxq ă l 0 .Let B " ti : ūi ą 0u, then we must have lim kÑ8 λ j k i " lim kÑ8 ūi ∥λ j k ∥ " 8 for i P B. Based on complementary slackness, we have ψ j k i px j k `1q " η j k i for any i P B for large enough k.Due to (3.14), we have the limit: ψ i pxq " η i .Therefore, the i-th constraint is active at x, i.e. i P Apxq.In view of (3.16), there exists subgradients v i P Bψ i pxq, i P rms and v 0 P N dom χ0 pxq such that 0 " v 0 `řiPB ūi v i .
(3.17)However, equation (3.17) contradicts the MFCQ assumption.This is because MFCQ guarantees the existence of zP ´N dom χ0 pxq such that xz, v i y ă 0 for all i P Apxq, which implies where the first inequality follows since z P ´Ndom χ0 pxq and v 0 P N dom χ0 pxq implying that xz, v 0 y ď 0, the second inequality follow since ūi ě 0 and v i P Bψ i pxq, and the last strict inequality follows due to uniform MFCQ (c.f.Assumption 3 and relation (2.3)) and ūi ą 0 for at least one i P B. This clearly leads to a contradiction.Hence, we conclude that tλ k`1 u is a bounded sequence and conclude the proof.Part 2).Without loss of generality, we assume that x is the only limit point.Since tλ k`1 u is a bounded sequence, there exists a limit point λ.Passing to a subsequence if necessary, we have λ k`1 Ñ λ.

From the optimality condition
Let us take k Ñ 8 on both sides of (3.18).We note that lim kÑ8 ∥x k ´xk`1 ∥ " 0 due to Proposition 2, lim kÑ8 χ i px k`1 q " χ i pxq due to the continuity of χ i (i P rms), and χ 0 pxq ď lim inf kÑ8 χ 0 px k q due to the lower semi-continuity of χ 0 p¨q.It then follows that @ ∇f 0 pxq `řm i"1 λi ∇f i pxq, x ´xD `χ0 pxq ´χ0 pxq `xλ , χpxq ´χpxq D ď 0.
Remark 1.To show the existence of a limit point x, we use Assumption 1.3 to ensure that the sequence x k remains in a bounded domain.For the sake of conciseness, we henceforth assume the existence of a limit point x and do not delve into the technical assumption used to ensure this condition.Moreover, it should be noted that the boundedness property can be obtained under other assumptions, e.g., assuming the compactness of sublevel set tx : ψ 0 pxq ď ψ 0 px 0 qu and using the sufficient descent condition (3.5), we can immediately show the existence of x.However, it appears to be more challenging to show convergence via this approach when sufficient descent condition fails (e.g., in the forthcoming stochastic optimization).
3.1 Dependence of B on the constraint qualification.In Theorem 1, we proved existence of a bound B on the dual multiplier.However, the size of that bound still remains unknown.Through Example 1 below, we observe that the limiting behaviour of the sequence λ k (which largely governs the size of B) is closely tied to the magnitude of the number cpxq, where cpxq :" ´min ∥z∥ď1 max vPBψipxq xv, zy.
Here, the inner max follows from the relation (2.3) and outer min tries to find the best possible z that ensures MFCQ.It is observed that if cpxq is a large positive number, then MFCQ is strongly satisfied and B is a reasonable bound.In contrast, if cpxq is close to 0, then B can get quite large.
Example 1.Consider a two dimensional optimization problem with SCAD constraint: min x ψ 0 pxq subject to ψ 1 pxq ď η 1 where ψ 0 pxq " 7 ´x1 and ψ 1 pxq " β∥x∥ This function fits our framework with the smoothness parameter 1 θ´1 .Lets consider β " 1, θ " 5, the level η 1 " 3 and limit point x " p5, 0q.Clearly, the constraint is active at x and h is implying cpxq " 0. Furthermore, no λ can be found satisfying the KKT condition for all t P r´1, 1s.Hence, as we get close to this limit point, bound on ∥λ k ∥ will get arbitrarily large.Easy way to see this fact is to construct a subproblem at the limit point itself.After observing the feasible region for the subproblem at p5, 0q, it is clear that it has only one feasible solution p5, 0q which gives rise to degeneracy.See Figure 1 for more details.Figure 1a shows the well-behaved subproblem at an interior point while Figure 1b show the degeneracy at the limit.However, as we change level η 1 to any value either above or below 3, we do not get any violation of MFCQ.It also gives nondegenerate feasible sets at limit point and λ k remains bounded for all k.See Figure 2 below for more details.In particular, if η 1 " 2.5 ă 3, then x " p3, 0q is the limit point.At this point, we have Bψpxq " t " 0.5 t ȷ : t P r´1, 1su.Moreover, we have Choosing the unit vector z " pz 1 , z 2 q " p´1, 0q, we obtain that point x satisfies MFCQ with cpxq " 0.5.Hence, even when the search point reaches the limit point x (i.e., ϵ Ñ 0), the λ k still exists.(See, in particular, Figure 2b whose subproblem at x has a nonempty interior).In view of the above example, we see that the limiting behavior of ∥λ k ∥ (and by implication the order of B) is closely related to the "strength" of the constraint qualification MFCQ at the limit point.In order to get an apriori bound B, we use a somewhat stronger yet verifiable constraint qualification called as strong feasibility which is a slight modification of the CQ proposed in [4, Assumption 3].

Assumption 4 (Strong feasibility CQ).
There exists x P X :" In view of Assumption 1.3, we note that X is a bounded set.Hence, D X and (consequently) Assumption 4 are well-defined.See [4] for a connection between Assumption 4 and Assumption 3. Below, we show that strong feasibility CQ leads to a fixed apriori bound on λ k .Lemma 2. Suppose Assumption 4 is satisfied.Then, ∥λ k ∥ 1 ď B :" where first inequality uses f i pxq ě f i px k q `x∇f i px k q, x ´xk y ´Li 2 ∥x ´xk ∥ 2 (follows due to L i -smoothness of f i ), and second inequality follows by Assumption 4 along with the fact that x k P X (see Proposition 2).
In view of (3.24), we have strict feasibility of subproblem (3.1) for all k implying that λ k`1 exists.Hence, we have x k`1 " argmin x ψ k 0 pxq `xλ k`1 , ψ k pxqy.Then, for all x P dom χ 0 , we have , ψ k pxq ´ηk y where equality follows from the complementary slackness of the KKT condition, and inequality is due to optimality of x k`1 .Using x " x in the above relation and combining with (3.24), we obtain where first inequality follows by (3.24) for i " 0 and ψ k 0 pxq ě ψ 0 pxq, and second inequality follows by the definition of ψ 0 and D X .Combining the above relation with (3.25), we get the result.Hence, we conclude the proof.
The discussion above implies that the value of B is intricately related to the constraint qualification.While uniform MFCQ is unverifiable and does not allow for a priori bounds on B, it is widely used in nonlinear programming to ensure the existence of such a bound [2].As observed in Figure 1b and Figure 2b, the actual value of B depends on the closeness of the MFCQ violation at the limit point.This situation is rare, but the current assumptions do not eliminate that possibility.Problems of this nature are ill-conditioned, and to our knowledge, no algorithm can ensure bounds on the dual in such a situation.The existing literature deals with this issue in two ways: One track assumes existence of B (similar to Theorem 1) and performs the complexity or convergence analysis; A second track assumes a stronger constraint qualification that removes the ill-conditioned problems and shows more explicit bound on the dual (similar to Lemma 2).We perform our analysis for both cases.To conclude, we henceforth assume that the bound B is a constant and do not delve into the discussion on related constraint qualification.To substantiate that the bound B is small, we perform detailed numerical experiments in Section 8.

3.2
Convergence rate analysis of LCPG method.Our main goal now is to develop some non-asymptotic convergence rates for Algorithm 1.
Lemma 3. In Algorithm 1, for k " 1, 2, ..., we have Proof.Let L k be the Lagrangian function of subproblem (3.1): (3.27) Using (2.1) and (3.27), we have Using the smoothness of f i pxq, the optimality condition 0 P B x L k px k`1 , λ k`1 q and the triangle inequality, we obtain (3.29) Hence we conclude the proof.
In view of Lemma 3, we derive the complexity of LCPG to attain approximate KKT solutions in the following theorem.Theorem 2. Let α k ą 0 pk " 0, 1, .., Kq be a non-decreasing sequence and suppose that Assumption 3 holds, then there exists a constant B ą 0 such that where D " . Moreover, if we choose the index k P t0, 1, ..., Ku with probability Pp k " kq " Proof.From the sufficient descent property (3.5), we have where the second inequality uses the monotonicity of sequence ψ 0 px k q.In view of Theorem 1 and Cauchy-Schwarz inequality, we have xλ k`1 , Ly ď ∥λ k`1 ∥∥L∥ ď B∥L∥.This relation and (3.26) implies Combining the above inequality with (3.33) immediately yields (3.30).
Next, we bound the error of complementary slackness.We have where the first inequality uses the triangle inequality, the second inequality uses complementary slackness and the Lipschitz smoothness of f i p¨q, and the last inequality follows from Cauchy-Schwartz inequality and boundedness of λ k`1 .Summing up (3.34) weighted by α k for k " 0, ..., K, we have Combining the above result with (3.33) gives (3.31).Finally, the fact that x k`1 is a randomized ϵ K type-I KKT point for ϵ K defined in (3.32) is immediately follows from (3.30), (3.31) and Definition 3.
The following corollary shows that the output of Algorithm 1 is a randomized Op1{Kq KKT point under more specific parameter selection.
Corollary 1.In Algorithm 1, suppose that all the assumptions of Theorem 2 hold.Set δ k " . Moreover, for i P rms and k ě 0, we have Plugging these values into (3.32)gives us the desired conclusion.
Remark 2. Corollary 1 shows that the gradient complexity of LCPG for smooth composite constrained problems is on a par with that of gradient descent for unconstrained optimization problems.To the best of our knowledge, this is the first complexity result for a constrained problem where the constraint functions can be nonsmooth and nonconvex.Note that the convergence rate involves the unknown bound B on the Lagrangian multipliers.The presence of such a constant is not new in nonlinear programming literature [1,8,13,9,12].Fortunately, we can safely implement LCPG method since the step-size scheme does not rely on B. On the other hand, the bound B is often a problem-dependent quantity.E.g., in [4] authors show a class of problems for which an a priori bound B can be established, or [5] shows the exact value of B for a class of nonconvex relaxations of sparse optimization problems.In such cases, our comparisons are arguably fair.Hence, throughout the paper, we make comparative statements under the assumption that B largely depends on the problem.

Stochastic optimization
The goal of this section is to extend our proposed framework to stochastic constrained optimization where the objective f 0 is an expectation function: f 0 pxq -E ξPΞ rF px, ξqs.(4.1)Here, F px, ξq is differentiable and ξ denotes a random variable following a certain distribution Ξ.Directly evaluating either the objective f 0 or its gradient can be computationally challenging due to the stochastic nature of the problem.To address this, we introduce the following additional assumptions.

Level constrained stochastic proximal gradient
In Algorithm 2, we present a stochastic variant of LCPG for solving problem 1.1 with f 0 defined by (4.1).As observed in (4.2) and (4.3), the LCSPG method uses a mini-batch of random samples to estimate the true gradient in each iteration.It should be noted that the value f 0 px k q is presented in (4.3)only for the ease of description, it is not required when solving (3.1).Note that the proximal point method of [4] does not need to account for a stochastic nonconvex problem Algorithm 2 Level constrained stochastic proximal gradient (LCSPG) Set ψ k 0 pxq by For i " 1, . . ., m, set ψ k i pxq by (3.2);
separately since they solve corresponding stochastic convex subproblems using a ConEx method developed in their work.On the contrary, LCSPG directly applies to stochastic nonconvex function constrained problems and convex subproblems are deterministic in nature.Hence, we need to develop asymptotic convergence and convergence rates for the LCSPG method separately.Let ζ k " G k ´∇f px k q denote the error of gradient estimation.We have The following proposition summarizes some important properties of the generated solutions of LCSPG.Proposition 3. In Algorithm 2, for any β k P p0, 2γ k ´L0 q, we have Moreover, there exists a vector λ k`1 P R m `such that the KKT condition (3.6) (with ψ k 0 defined in (4.3)) holds.
Proof.By the KKT condition, x k`1 is the minimizer of L k p¨, λ k`1 q.Therefore, we have Placing x " x k in (4.5) and using (3.27), we have where the second inequality is due to the complementary slackness ) and Lipschitz smoothness of f 0 , we have Above, the last inequality uses the fact ´a 2 x 2 `bx ď b 2 2a for any x, b P R, a ą 0. Showing the existence of the KKT condition follows a similar argument of proving part 2, Proposition 2.
We prove a technical result in the following lemma which plays a crucial role in proving dual boundedness.Lemma 4. Let tX k u kě1 be a sequence of random vectors such that ErX k s " 0 for all k ě 1 and Proof.We prove this result by contradiction.If the result does not hold then there exists ϵ ą 0 and c ą 0 However, due to Chebyshev's inequality, we have Pp∥X k ∥ ě ϵq ď . Therefore, we have The above relation contradicts (4.8).Hence, we have lim kÑ8 X k " 0 a.s.
In the following theorem, we present the main asymptotic property of LCSPG.
Proof.First, we fix notations.Let pΩ, F, Pq be the probability space defined over the sampling minibatches B 0 , B 1 , . . .,.Let E k r¨s be the expectation conditioned on the sub σ-algebra generating B 0 , B 1 , . . ., B k´1 .Applying it to (4.4) gives In view of the super-martingale convergence theorem [30], we have that lim kÑ8 ψ 0 px k q exists and is finite a.s., when ∥x s`1 ´xs ∥ 2 for k ě 0 and C 0 " 0. We have Applying the super-martingale convergence theorem [30] again we can show that the limit of ψ 0 px k q`C k exists a.s.Together with (4.9) and lower-boundedness of β k and 2γ k ´βk ´L0 , we have that lim kÑ8 ∥x k`1 ´xk ∥ 2 " 0, a.s.
Next, we prove the boundedness of ∥λ k ∥.Let us consider the events ) , B " ) .
We just argued PpBq " 1.It is easy to see that if both conditions (i) PpAq " 1 and (ii) U Ď A c Y B c hold, then we have PpUq ď PpA c q `PpB c q " 0. Hence tλ k u is a bounded sequence a.s.Since tb ´1 k u is summable, we have Hence, using Lemma 4, we have that lim kÑ8 ζ k " 0 a.s.Due to the boundedness of ∇f px k q, we have that G k " ζ k `∇f px k q is bounded, a.s.
We prove condition (ii) by contradiction.Suppose that our claim fails.We take an element ω P U XpAXBq and then pass to a subsequence tj k u such that lim kÑ8 ∥λ j k pωq∥ " 8.In the rest of the proof, we skip ω for brevity.Passing to another subsequence if necessary, let x be a limit point of tx j k u.By our presumption, x satisfies MFCQ.Moreover, the KKT condition implies that where we denote u k " λ j k `1 ∥λ j k `1∥ .Since tu k u is bounded, passing to a subsequence if needed, we have lim kÑ8 u k " ū.Since ω P A X B, tG j k u is bounded and t γj k 2 ∥x j k `1 ´xj k ∥ 2 u converges to 0. Therefore, taking k Ñ 8 on both sides of (4.11), we have xū, χpxqy ď @ ū, ψpxq `x∇ψpxq, x ´xy `L 2 ∥x ´x∥ 2 `χpxq D , @x P dom χ 0 .(4.12) Analogous to the proof of Theorem 1, it is easy to show that x violates MFCQ, which however, contradicts Assumption 3. As a consequence of this argument, we have U Ď A c Y B c .Hence, we claim that the event sup k ∥λ k ∥ ă 8 will happen a.s. and complete our proof of the boundedness condition.
Next, we prove asymptotic convergence to KKT solutions.For any random element ω, let xpωq be any limit point of tx k u.Passing to some subsequence if necessary, we assume that lim kÑ8 x k " x and lim kÑ8 λ k`1 " λ.
Moreover, we have Combining the above two results, we have ,Lyq `xζ k , x ´xk y.Taking k Ñ 8 in the above relation and noting that almost surely we have lim kÑ8 ζ k " 0 and lim kÑ8 ∥x k xk`1 ∥ " 0, then x∇f 0 pxq `∇f pxqλ k`1 , x ´xy `χ0 pxq ´χ0 pxq `xλ k`1 , χpxq ´χpxqy ď 0, a.s.Using an argument similar to the one in Theorem 1, we can show that x is almost surely a KKT point.
Our next goal is to develop the iteration complexity of Algorithm 2. To achieve this goal, we need to assume that the dual is uniformly bounded, namely, condition (3.9) holds for all the random events.While this condition is stronger than the almost sure boundedness of λ k`1 shown by Theorem 3, it is indeed satisfied in many scenarios, e.g., when strong feasibility (Assumption 4) holds or other scenarios described in [4,5].We present the main complexity result in the following theorem.
We next obtain more specific convergence rate by choosing the parameters properly.
Proof.Plugging in the value of γ k , α k , β k in the relation (4.13) and taking expectation over all the randomness, we have 16L0 pK `1q.Moreover, due to the random sampling of k, we have ´.
Combining the above two results, we have Second, plugging in the values of γ k , β k and δ k in (4.14), we have It then follows from (4.18) and the definition of k that This completes our proof.
Remark 3. In order to obtain some ϵ-error in satisfying the type I KKT condition, LCSPG requires a number of Opε ´2q calls to the SFO, which matches the complexity bound of stochastic gradient descent for unconstrained nonconvex optimization [15].Moreover, due to minibatching, LCSPG obtains an even better Opε ´1q complexity in the number of evaluations of f i pxq and ∇f i pxq (i P rms).

Level constrained stochastic variance reduced gradient descent
We consider the finite sum problem: where each F px, ξ i q is Lipschitz smooth with the parameter L 0 , i " 1, 2, . . ., n.To further improve the convergence performance in this setting, we present a new variant of the stochastic gradient method by extending the stochastic variance reduced gradient descent to the constrained setting.
Algorithm 3 Level constrained stochastic variance-reduced gradient descent (LCSVRG) if k%T "" 0 then 4: G k " ∇f 0 px k q; 5: Sample a mini-batch B k of size b uniformly at random and compute end if

9:
Update x k`1 by (3.1) and update η k`1 by (3.3); 10: end for We present the level constrained stochastic variance-reduced gradient descent (LCSVRG) in Algorithm 3, which extends the nonconvex variance reduced mirror descent (see [20]) to handle nonlinear constraint.Algorithm 3 can be viewed as a double-loop algorithm in which the outer loop computes the full gradient ∇f px k q once every T iterations and the nested loop performs stochastic proximal gradient updates based on an unbiased estimator of the true gradient.In this view, we let k indicate the t-th iteration at the r-th epoch, for some values t and r.Then we use k and pr, tq interchangeably throughout the rest of this section.We keep the notation ζ k (or ζ pr,jq ) for expressing G k ´∇f px k q and note that ζ pr,0q " 0.
Our next goal is to develop some iteration complexity results of LCSVRG.We skip the asymptotic analysis since it is similar to that of LCSPG.The following Lemma (see [20,Lemma 6.10]) presents a key insight of Algorithm 3 that the variance is controlled by the point distances.We provide proof for completeness.
Lemma 5.In Algorithm 3, G k is an unbiased estimator of ∇f 0 px k q.Moreover, Let pr, tq correspond to k.If t ą 0, then we have Proof.We prove the first part by induction.When k " 0, we have G 0 " ∇f 0 px 0 q.Then for k ą 0, if k%T "" 0, we have G k " ∇f px k q by definition.Otherwise, we have Next, we estimate the variance of the stochastic gradient.Appealing to (4.20), we have , where the third equality uses the independence of B k and ζ k´1 , the first inequality uses the bound Varpxq ď E∥x∥ 2 , and the second inequality uses the Lipschitz smoothness of F p¨, ξq.Taking expectation over all the randomness generating B pr,1q , B pr,2q , . . ., B pr,tq , we have The next Lemma shows that the generated solutions satisfy a property of sufficient descent on expectation.
We present the main convergence property of Algorithm 3 in the next theorem.
Above, the first inequality applies (4.26) and uses x pr,T q " x pr`1,0q while the second inequality uses the monotonicity of tψ 0 px k qu and an argument similar to (3.33).
Using the provided parameter setting, we have L " Remark 4. It is interesting to compare the performance of LCSVRG with the other level constrained first-order methods in the finite sum setting (4.19).Similar to LCPG, LCSVRG runs for Opε ´1q iterations to compute Type-I ϵ-KKT point.Moreover, LCSVRG has an appealing feature that the number of stochastic gradient ∇F px, ξq computed can be significantly reduced for a large value of n.Specifically, Algorithm 3 requires a full gradient ∇f 0 pxq every T iterations, which contributes N 1 " O `nP K T T˘" Op ?nKq stochastic gradient computations.During the other iterations, Algorithm 3 invokes a batch of size b " OpT q each time, exhibiting a complexity of N 2 " O `bK ˘" Op ?nKq.Therefore, the total number of stochastic gradient computations is N " N 1 `N2 " O `?nK ˘.This is better than the OpnKq stochastic gradients needed by LCPG.Moreover, it is better than the bound OpK 2 q of LCSPG when K is at an order larger than Ωp ?nq, which corresponds to a higher accuracy regime of ϵ ! 1 ?n .The complexities of all the proposed algorithms for getting some ε-KKT solutions are listed in Table 2.
Remark 5.While we mainly discuss the finite-sum objective (4.19), it is possible to extend the variance reduction technique to handle the expectation-based objective (4.1) and improve the Opε ´2q bound of LCSPG to Opε ´3{2 q.To achieve this goal, we impose an additional assumption that F px, ξq is L 0 -Lipschitz smooth for each ξ in the support set.We choose to omit a detailed discussion on this particular extension, as the technical development for this can be readily derived from the arguments in Sec.6.5.2.[21] and our previous analysis.
g i pxq ´hi pxq: 1) h i is an L hi -Lipschitz-smooth convex function and 2) g i is a structured nonsmooth convex function: g i pxq " max yiPYi xA i x, y i y ´pi py i q, where A i P R aiˆn is a linear mapping, Y i Ă R ai is a convex compact set and p i : Y i Ñ R is a convex continuous function.In view of such a nonsmooth structure, we can not simply apply the LCPG method, as the crucial quadratic upper-bound on f i pxq does not hold in the nonsmooth cases.However, as pointed out by Nesterov [29], the nonsmooth convex function g i can be closely approximated by a smooth convex function.Let us denote p y i -argmin yiPYi ∥y i ∥, D Yi -max yiPYi ∥y i ´p y i ∥ and define the approximation function g βi i pxq :" max yiPYi xA i x, y i y ´ppy i q ´βi 2 ∥y i ´p y i ∥ 2 , f βi i pxq :" g βi i pxq ´hi pxq, where β i ą 0. Given some properly chosen smoothing parameter β i , we propose to apply LCPG to solve the following smooth approximation problem: min x ψ β0 0 pxq " f β0 0 pxq `χ0 pxq s.t.ψ βi i pxq " f βi i pxq `χi pxq ď η i i " 1, . . ., m. (5.1) Prior to the analysis of our algorithm, we need to develop some properties of the smooth function f βi i .We first present a key Lemma which builds some important connection between the quadratic approximation of smooth function and Lipschitz smoothness.The proof is left in Appendix A. Lemma 7. Suppose pp¨q is continuously differentiable function satisfying ´µ 2 ∥x ´y∥ 2 ď ppxq ´ppyq ´x∇ppyq, x ´yy ď L 2 ∥x ´y∥ 2 , (5.2) for all x, y.Then, pp¨q satisfies ∥∇ppxq ´∇ppyq∥ ď maxtL, µu∥x ´y∥. ( In smooth approximation, it is shown in [29] that g βi i is a Lipschitz smooth function and it approximates the function value of g i with some Opβ i q-error: ∥∇g βi i pxq ´∇g βi i pzq∥ ď L βi gi ∥x ´z∥, @x, z, L βi gi :" βi . (5.5) Similar properties of f βi i are developed in the following proposition.
Proposition 4. We have the following properties about the approximation function f βi i pβ i ą 0q.
1. Let βi P r0, β i s, then we have (5.6) 2. f βi i pxq has upper curvature L βi gi and negative lower curvature ´Lhi , namely, 2 }x ´y} 2 , (5.7) (5.8) 3. f βi i is Lipschitz smooth with modulus L βi i :" maxtL βi gi , L hi u.Namely, for any x, y, we have ∥∇f β i pxq ´∇f β i pyq∥ ď L βi i ∥x ´y∥. (5.9) Proof.Part 1.If β ă β, then by definition we have f βi i pxq ě f βi i pxq.On the other hand, using the boundedness of Y i , we have Yi .Combining the above two results gives the desired inequality.Part 2. Since g βi i and h i are both convex and smooth functions, we have g βi i px 1 q ď g βi i px 2 q `x∇g βi i px 2 q, x 1 ´x2 y `Lβ i gi 2 ∥x 1 ´x2 ∥ 2 , h i px 1 q ď h i px 2 q ´x∇h i px 2 q, x 1 ´x2 y.

Summing up the above two inequalities and noting the definition of f βi
i , ∇f βi i , we conclude that f βi i has an upper curvature of L βi gi .Similarly, using convexity of g βi i and smoothness of h i , we obtain that f βi i has a negative lower curvature ´Lhi .
Part 3. The Lipschitz continuity (5.9) is an immediate consequence of part 2) and Lemma 7.
Remark 6.When βi " 0, Relation (5.6) reads Together with Assumption 2, it can be seen that x 0 is also strictly feasible for problem (5.1).This justifies that LCPG is well-defined for problem (5.1).
Remark 7. The Lipschitz constant of ∇f βi i can be derived in a different way.Since ∇g βi i and ∇h i are L βi gi and L hi Lipschitz continuous, respectively, we can show by triangle inequality that ∇f βi i pxq is L βi gi `Lhi -Lipschitz continuous.In contrast, by exploiting the asymmetry between lower and upper curvature, Proposition 4 derived a slightly sharper bound on the gradient Lipschitz constant.
Throughout this section, we choose specific Yi is constant for all i P rms.Hence, we can define the additive approximation factor above as ν :" 2 , i P rms.
(5.10)Note that (5.4) provides an approximation error for function values, or the so-called zeroth-order oracle of function g i .However, convergence results for nonconvex optimization are generally given in terms of firstorder stationarity measure, implying that we need approximation for first-order oracle for the function f i and consequently function g i .Below we discuss a widely used approximate subdifferential for convex functions and generalize it for nonsmooth nonconvex functions.
Definition 5 (ν-subdifferential).We say that a vector v P R n is a ν-subgradient of the convex function pp¨q at x if for any z, we have ppzq ě ppxq `xv, z ´xy ´ν.
The set of all ν-subgradients of p at x is called the ν-subdifferential, denoted by B ν ppxq.Moreover, we define ν subdifferential of nonconvex function f i as B ν f i pxq :" B ν g i pxq `t´∇h i pxqu where the addition of sets is in Minkowski sense.
Finally, we define a generalization of type-I KKT convergence criterion for structured nonsmooth nonconvex function constrained optimization problem: Definition 6.We say that a point x is an pϵ, νq type-III KKT point of (1.1) if there exists λ ě 0 satisfying the following conditions: ∥rψpxq ´ηs `∥1 ď ϵ.
The ϵ-subdifferential and the type-III KKT point are essential for associating smooth approximation with the original nonsmooth problem.We build some important properties in the following proposition.Proposition 5. Let β i and ν satisfy (5.10).
Proof.Part 1.It suffices to show ∇g β i pxq P B ν g i pxq.Due to the convexity of g βi i and (5.4), we have g i pzq ě g βi i pzq ě g βi i pxq `x∇g βi i pxq, z ´xy ě g i pxq `x∇g βi i pxq, z ´xy ´βiD 2 Y i 2 , where the first inequality follows from the first relation in (5.4), and the third inequality follows from second relation in (5.4).Noting the definition of ν-subgradient, we conclude the proof.
Proof.Our analysis resembles the proof of Theorem 2. Using a similar argument in (3.33), we have Combining this result with Lemma 3 we obtain where ∆ " ψ 0 px 0 q ´ψ0 px ˚q, α k ě 0 and L β px, λqψ β0 0 pxq `řm i"1 λ i pψ βi i pxq ´ηi q.Taking δ k " η´η 0 pk`1qpk`2q and α k " k `1 in (5.15), we see that x k`1 is a Type-I ϵ-KKT point for Using the definition of k and Proposition 5 we obtain the desired result.
6 Inexact LCPG LCPG requires the exact optimal solution of subproblem (3.1), which, however, poses a great challenge when the subproblem is difficult to solve.To alleviate such an issue, we consider an inexact variant of LCPG method for which the update of x k`1 only solves problem (3.1) approximately.This section is organized as follows.
First, we present a general convergence property of inexact LCPG when the subproblem solutions satisfy certain approximation criterion.Next, we analyze the efficiency of inexact LCPG when the subproblems are handled by different external solvers.When the subproblem is a quadratically constrained quadratic program (QCQP), we propose an efficient interior point algorithm by exploiting the diagonal structure.When the subproblem has general proximal components, we propose to solve it by first-order methods.Particularly, we consider solving the subproblem by the constraint extrapolation (ConEx) method and develop the total iteration complexity of ConEx-based LCPG.
6.1 Convergence analysis under an inexactness criterion Throughout the rest of this section, we will denote the exact primal-dual solution of (3.1) as pr x k`1 , r λ k`1 q.We use the following criterion for measuring the accuracy of subproblem solutions.Definition 7. We say that a point x is an ϵ-solution of (3.1) if The following theorem shows asymptotic convergence to stationarity for inexact LCPG method under mild assumptions.Since the proof is similar to the previous argument, we present the details in Appendix B for the sake of completeness.Note that the theorem applies to a general nonconvex problem and hence applies to convex problems as well.
Theorem 7. Suppose that Assumption 3 holds and let x k`1 be an ϵ k -solution of (3.1) satisfying ϵ k ă min iPrms δ k i .Then all the conclusions of Theorem 1 still hold.Then the dual sequence t λk u is uniformly bounded by a constant B ą 0.Moreover, every limit point of inexact LCPG is a KKT point.
Under the inexactness condition in Definition 7, we establish the complexity of inexact LCPG in the following theorem.Theorem 8.Under the assumptions of Theorem 7, we have where ∆ " ř K k"0 α k rψ 0 px k q ´ψ0 px k`1 q `εk s.Moreover, if we choose the index k P t0, 1, . . ., Ku with probability Pp k " kq " α k {p ř K i"0 α i q, then x k is a randomized pϵ, δq type-II KKT point with In particular, using Proof.Using (3.8) with x k`1 replaced by r x k`1 (the optimal solution of problem (3.1)) and adding f px k q L0 2 ∥x k ´r x k`1 ∥ 2 on both sides, we have where the second inequality follows from ψ k 0 px k q " ψ 0 px k q as well as x k`1 being an ϵ k -solution (see Definition 7) of subproblem (3.1), and the third inequality follows from the fact that ψ k 0 pxq ě ψ 0 pxq for all x P dom χ 0 .
Remark 8. Compared to the convergence result (3.31) for exact LCPG, we have to control the accumulated error in ∆ for the inexact case (6.4).However, we need an even more stringent condition on the error to ensure asymptotic convergence.Specifically, we assume ε k to be smaller than the level increments δ k i to ensure that each subsequent subproblem is strictly feasible.As long as the subproblems are solved deterministically with sufficient accuracy, we can ensure such feasibility as well as the boundedness of the dual.
Remark 9. Note that the convergence analysis of the inexact method for the stochastic case will go through in a similar fashion.In particular, the subproblems of LCSPG are still deterministic in nature.Hence, a deterministic error can be easily incorporated into the analysis of the stochastic outer loop.In particular, Proposition 3 will have an additional ϵ k in the RHS.We can use ϵ k " min iPrms δ k i 2 to ensure the strict feasibility.Following the analysis in Theorem 4, we will get the additional term ř K k"0 α k ϵ k .Note that we have identical policies for α k in the above analysis and Corollary 2. Furthermore, since δ k used above and in Corollary 2 are the same, we have identical values of ϵ k as well.Following the above development, we can easily bound the additional ř K k"0 α k ϵ k term.
6.2 Solving the subproblem with the interior point method Our goal is to develop an efficient interior point algorithm to solve problem (3.1) when χ i pxq " 0, i P rms.Without loss of generality, we express the subproblem as the following QCQP: min We assume that the initial solution x of such problem is strictly feasible, namely, there exists δ ą 0 such that g i pxq ď ´δ, i " 1, 2, . . .m. (6.12)Let e 1 " r1, 0, . . ., 0s T P R d`1 .With a slight abuse of notation, we can formulate (6.11) as the following problem min e T 1 u s.t.g0 puq " g 0 pxq ´η ď 0, gi puq " g i pxq ď 0, i P rms, gm`1 puq " Lm`1 2 ∥u ´p0, a m`1 q T ∥ 2 ´bm`1 ď 0, u " pη, xq P R ˆRd .(6.13) Here we set artificial variables L m`1 " 1, a m`1 " 0 and b m`1 " 1 2 R 2 for some sufficiently large R. We explicitly add such a ball constraint to ensure bound on pη, xq.Note that the bound R always exists since our domain is compact and the objective is Lipschitz continuous.Our goal is to apply the path-following method to solve (6.13).We denote ϕpuq " ´řm`1 i"0 log ´g i puq ( Since each gi puq is convex quadratic in u, ϕpuq is a self-concordant barrier with υ " m `2.The key idea of the path-following algorithm is to approximately solve a sequence of penalized problems min u ϕ τ puq :" τ η `ϕpuq ( with increased values of τ , and generate a sequence of strictly feasible solution u τ close to the central path-a trajectory composed of the minimizers u τ " argmin u ϕ τ puq. We apply a standard path-following algorithm (See [28, Chapter 4]) for solving (6.13), and outline the overall procedure in Algorithm 4. This algorithm consists of two main steps: 1. Initialization: We seek a solution u 0 near the analytic center (i.e.minimizer of ϕpuq).To this end, we solve a sequence of auxiliary problems φτ puq " τ w T u `ϕpuq where w " ´∇ϕpûq.It can be readily seen that û is in the central path of this auxiliary problem with τ " 1. Performing a reverse path-following scheme ( decreasing rather than increasing τ ), we gradually converge to the analytic center.
2. Path-following: We solve a sequence of penalized problems with an increasing value of τ by a damped version of Newton's method, which ensures the solutions in the proximity of the central path.

Solving the Newton Equation
First, we calculate the gradient and Hessian map of ϕ t p¨q: Note that computing the gradient ∇ϕ t puq takes Opdmq, hence the computation burden is from forming and solving the Newton systems.This is divided into two cases.
1. m ă d.Then the Hessian is the sum of a low rank matrix and a diagonal matrix.Based on the Sherman-Morrison-Woodbury formula, we have (6.16) Computing the product N T Γ ´1N takes Opm 2 dq while performing Cholesky factorization takes Opm 3 q.Therefore, the overall complexity of each Newton step is Opm 3 `m2 dq " Opm 2 dq.

m ě d.
In such case, we can directly compute N N T in Opmd 2 q and then perform Cholesky factorization ∇ 2 ϕ τ pxq " LL T in Opd 3 q , followed by two triangle systems.Hence the overall complexity of a Newton step is Opd 3 `md 2 q " Opmd 2 q.
Let t `" pM `1q}û´u ˚} pM `1q}û´u ˚}`δ , then from the above analysis, we know that the point u `" u ˚`1 t `p û ´u˚q " û `δpû´u ˚q pM `1q}û´u ˚} must be a feasible solution.Using the last constraint }u} ď R, we immediately obtain the bound (6.18).
Using [28, Theorem 4.5.1] and Proposition 6, we can derive the total complexity of solving the diagonal QCQP.
Theorem 9.Under the assumptions of Proposition 6, the total number of Newton steps to get an ε solution is Corollary 3. In the inexact LCPG method, assume that the subproblems are solved by Algorithm 4 and the returned solution satisfies the inexactness requirement in Theorem 8.Then, to get an Opϵ, ϵq Type-II KKT point, the overall arithmetic cost of Algorithm 4 is Proof.According to Theorem 8, the total number of LCPG is K " Op1{εq.In the k-th iteration of LCPG, we set the error criteria ν " Op 1 k 2 q and ε " Op 1 k 2 q.Theorem 9 implies that the number of Newton steps is N k " Op ?m lnpkqq.Therefore, the total number of Newton steps in Combining this result with (6.17) gives us the desired bound.Remark 10.First, at the k-th step of LCPG, we need logpkq iterations of interior point methods, of which the complexity order is equally contributed by the two phases of IPM.Specifically, we first require Oplnpkqq Newton steps to pull the iterates from near the boundary to the proximity of the central path, and then require Oplnpkqq to obtain an Op1{k 2 q-accurate solution.Second, it is interesting to consider the case when the constraint is far less than the feature dimensionality, namely, m !d.We observe that the total computation O `dm 2.5 K ln K ȋs linear in dimensionality.Third, despite the simplicity, the basic barrier method offers a relatively stronger approximate solution than what is needed in Theorem 8, the feasibility of the solution path allows us to weaken the assumption to εk " 0. Nevertheless, besides our approach, it is possible to employ long-step and infeasible primal-dual interior point methods which may give a better empirical performance.
6.3 Solving subproblems with the first-order method In this section, we use a previously proposed ConEx method [4] to solve the subproblem (3.1) when general proximal functions χ i are present.Then, we analyze the overall complexity of LCPG method with ConEx method as a subproblem solver.First, we formally state the extended version of problem (6.11) as follows: min xPX ϕ 0 pxq -g 0 pxq `χ0 pxq s.t.ϕ i pxq -g i pxq `χi pxq ď 0, i " 1, . . ., m.
For the application of ConEx for the subproblem, we need access to a convex compact set X such that X i dom χ i Ď X.Moreover, X is a "simple" set in the sense that it allows easy computation of the proximal operator of χ 0 pxq `ři"1 w i χ i pxq for any given weights w i , i " 1, . . ., m.Such assumptions are not very restrictive as many machine learning and engineering problems explicitly seek the optimal solution from a bounded set.Under these assumptions, we apply ConEx to solve the subproblem (3.1) of LCPG.We now reproduce a simplified optimality guarantee of the ConEx method below without necessarily going into the details of the algorithm.
Theorem 10. [4] Let x be the output of ConEx after T iterations for problem (6.19).Assume that ϕ 0 is a strongly convex function and pr x, r λq is the optimal primal-dual solution.Moreover, Let B be a parameter of the ConEx method which satisfies B ą ∥ r λ∥.Then, the solution x satisfies Even though ConEx can be applied to a wider variety of convex function constrained problems, it has two vital and intricate issues that need to be addressed in our context: 1.The solution path of ConEx can be arbitrarily infeasible in the early iterations, while the successive iterations make the solutions infeasibility smaller.Note that the approximation criterion in Definition 7 requires guarantees on the amount of infeasibility.This implies ConEx has to run a significant number of iterations before getting sufficiently close to the feasible set.
2. Since ConEx is a primal-dual method, its convergence guarantees depend on the optimal dual solution λ ˚.Moreover, a bound on the dual, Bpą ∥λ ˚∥q, is required to implement the algorithm to achieve an accelerated convergence rate of Op1{T 2 q for strongly convex problems.
From Theorem 10, it is clear that ConEx requires a bound B. This requirement naturally leads to two cases: (1) bound B can be estimated apriori, e.g., see Lemma 2; and (2) bound B is known to exist but cannot be estimated, e.g., see Theorem 1.Both cases have different convergence rates for the subproblem which leads to different overall computational complexity.
Case 1: B can be estimated apriori.In this case, we do not need to estimate B k as in (6.20).Using the bound B, we can get accelerated convergence of ConEx in accordance with Theorem 10 which leads to better performance of the LCPG method.The corollary below formally states the total computational complexity of LCPG method for this case.
Corollary 4. If an explicit value of B is known, the LCPG method with ConEx as subproblem solver obtains Op 1 K , 1 K q type-II KKT point in OpK 2 q computations.Proof.According to Theorem 10, the required ConEx iterations for each subproblem can be bounded by Since B is a constant, we have T k " Opϵ ´1{2 k q " Opkq.Finally, we have total computations ř K k"1 T k " OpK 2 q.Hence, we conclude the proof.
Case 2: B is known to exist but cannot be estimated.For the subproblem (3.1), we can easily find B k ą ∥ r λ k`1 ∥ by using the difference in levels of successive iterations.This bound is weak, especially in the limiting case as it does not take into account Proposition 7.For subproblem (3.1), we have (6.20) Proof.By Slater's condition, we know that r λ k`1 exists.Then, due to saddle point property of pr x k`1 , r λ k`1 q, we have for all x P X ψ k 0 pxq `r λ k`1 rψ k pxq ´ηk`1 s ě ψ k 0 pr x k`1 q `r λ k`1 rψ k pr x k`1 q ´ηk`1 s " ψ k 0 pr x k`1 q, where the equality follows by complementary slackness.Using x " x k in the above relation and noting that x k satisfies ψpx k q ď ψ k´1 px k q ď η k , we have η k`1 ´ψk px k q " η k`1 ´ψpx k q ě η k`1 ´ηk " δ k implying that where second inequality follows from r λ k`1 ě 0 and δ k ą 0 and last inequality follows due to the fact that ∥ r λ k`1 ∥ 1 ě ∥ r λ k`1 ∥.We can further upper bound the LHS of the above relation as follows ψ k 0 px k q ´ψk 0 pr x k`1 q " ψ 0 px k q ´ψk 0 pr x k`1 q ď ψ 0 px k q ´ψ0 pr x k`1 q ď ψ 0 px k q ´ψ0 , where the last inequality follows since r x k`1 is feasible for the original problem (1.1).Combining the above two relations, we obtain (6.20).Hence, we conclude the proof.
We now state the final computation complexity of LCPG withConEx which uses the bound in (6.20).
Corollary 5.If an explicit value of B is not known, the LCPG method with ConEx as subproblem solver obtains Op 1 K , 1 K q type-II KKT point in OpK 4 q computations.
Proof.Using Proposition 7, we can set B k :" Finally, in view of (6.9) and the fact that ř 8 i"0 ϵ i ď ∥η ´η0 ∥ implies that B k ď 1 δ k i ˚rψ 0 px 0 q ´ψ0 `∥η ´η0 ∥s for all k.Moreover, for all k ď K, we have ϵ k " δ k i 2 .Hence, we get T k " Opϵ ´3{2 k q " Opk 3 q.Finally, we have ř K k"1 T k " OpK 4 q which is the overall computational complexity of LCPG method with ConEx as subproblem solver to obtain pOp 1 K q, Op 1 K qq type-II KKT point.Remark 11. [Gradient complexity vs. computational complexity] Note that evaluating the gradient of ψ k i pxq is relatively simple since it does not involve any new computation of ∇f i pxq.In that sense, the entire inner loop requires only one ∇f i computation; hence the total gradient complexity of ∇f i equals the total outer loops of inexact LCPG.On the other hand, inner loop computation does contribute to the problem's computational complexity.However, such iterations are expected to be very cheap given the ease of obtaining gradients for the QP subproblem (3.1) with identity hessian matrices.

LCPG for convex optimization
In this section, we establish the complexity of LCPG (i.e., Algorithm 1) when the objective f 0 and constraint f i , i P rms are convex.In particular, we consider two convex problems, depending on whether f 0 is convex or strongly convex.To provide a combined analysis of the two cases, we assume the following: Assumption 6. f 0 pxq is µ 0 -convex function for some µ 0 ě 0. Namely, f 0 pxq ě f 0 pyq `x∇f 0 pyq, x ´yy `µ0 2 ∥x ´y∥ 2 , for any x, y P R d .Note that if µ 0 " 0 then f 0 is simply a convex function.Now we provide the convergence rate of LCPG to optimality.
For more generality, we consider an inexact variant of LCPG for which an approximate solution in terms of Definition 7 is returned in each iteration.Let pr x k`1 , r λ k`1 q be the saddle-point solution L k px, r λq, i.e., r x k`1 is an exact solution of the subproblem (3.1).First, we extend the three-point inequality in Lemma 1 for an inexact solution.
Using the above lemma, we provide the main convergence property of LCPG for convex optimization.Lemma 9. Let x be feasible solution.Then, we have ψ 0 px k`1 q ´ψ0 pxq ď x r λ k`1 , ψpxq ´ηk y `L0`x r Proof.Note that where (i) follows from the definition of ψ k 0 , (ii) follows since x k`1 is an ϵ k solution of (3.1), (iii) follows by complementary slackness for the optimal primal-dual solution for (3.1) and (iv) follows from Lemma 8.In particular, we use gpxq `γ 2 ∥x ´z∥ 2 " ψ k 0 pxq `xr λ k`1 , ψ k pxq ´ηk y with z " x k , z `" x k`1 , ϵ " ϵ k and γ " L 0 `xr λ k`1 , Ly.Note that x k`1 is an ϵ k -approximate solution for min xPR d ψ k 0 pxq `xr λ k`1 , ψ k pxq ´ηk y due to Definition 7.
Theorem 11.Consider general convex optimization problems with µ 0 " 0. Suppose Assumption 3 is satisfied and set δ k " pη´η 0 q pk`1qpk`2q .Then we have Proof.From Lemma 9 with µ 0 " 0 for convex part and ψpx ˚q ď η, we have Dividing both sides by L0`x r λ k`1 ,Ly 2 , we have Note that the sequence t r λ k`1 u is uniformly bounded above such that ∥ r λ k`1 ∥ ď B for all k ě 0. Using this fact and the above relation, we have Using these values in (7.9), we have k`1 ∥x k`1 ´x˚∥ .(7.10) Due to the optimality of the exact solution r x k`1 , we have ψ k 0 pr x k`1 q ď ψ k 0 px k q " ψ 0 px k q.We also have ψ 0 px k`1 q ď ψ k 0 px k`1 q ď ψ k 0 pr x k`1 q `ϵk .Combining these two relations, we get: ψ 0 px k`1 q ď ψ 0 px k q `ϵk .Effectively, inexact LCPG method is almost (up to an additive error of ϵ k ) a descent method.Using this relation recursively, we have pk`2qpK`1q .Using the above relation in (7.10), we have pk`2qpK`1q .Multiplying the above relation by k `1 and summing from k " 0 to K ´1, we have After rearranging, this relation implies (7.8).Hence, we conclude the proof.

Let us denote
Multiplying both sides of the above inequality by and noting that η ´ηk " ρ k pη ´η0 q (follows by the choice of δ k ), we obtain L0`B∥L∥´µ0 q k ď Γ k ď p L0´aµ0 L0´µ0 q k .Moreover, we have 2ϵ k ď ap1 ´ρqρ k ∥η ´η0 ∥ and ρ " L0´µ0 2pL0´aµ0q .Using these relations in (7.13), we have Similar to the convex part, we also have . Using the above relation into (7.14),we have Combining the above two relations we obtain the desired result (7.11).

Numerical study
In this section, we conduct some preliminary studies to examine our theoretical results and the performance of the LCPG method.The experiments are run on CentOS with Intel Xeon (2.60 GHz) and 128 GB memory.
8.1 A simulated study on the QCQP In the first experiment, we compare LCPG with some established open-source solvers such as CVXPY [11] and DCCP [32].We consider the penalized Quadratically Constrained Quadratic Program (QCQP) described as follows, min where each Q i (0 ď i ď m) is an n ˆn matrix, b 0 , b 1 , . . ., b m are n-dimensional real vectors, α is a positive weight on the ℓ 1 norm penalty, which helps to promote sparse solution.In the first setting, we consider a convex constrained problem where each Q i is a positive semidefinite matrix.We set Q i " V DV T where V is an n ˆn random sparse matrix with density 0.01, and its nonzero entries are uniformly distributed in r0, 1s.D is a diagonal matrix whose diagonal elements are uniformly distributed in between r0, 100s.We set b i " 10e `v, where e is a vector of ones and v P N p0, I nˆn q is sampled from standard Gaussian distribution.We set c i " ´10 to make x " 0 a strictly feasible initial solution.Furthermore, we add a ball constraint to ensure that the domain is a compact set.We set r " ?20 and α " 1.We fix m " 10 and explore different dimensions n from the set t500, 1000, 2000, 3000, 4000u.
We solve Problem (8.1) by both CVXPY and LCPG.Both use the initial solution x " 0. For CVXPY, we use MOSEK as the internal solver due to its superior performance in quadratic optimization.In LCPG, for simplicity, we also solve the diagonal quadratic subproblem by MOSEK through CVXPY.Note that calling the external API repetitively for each LCPG subproblem only causes more overheads to run LCPG.Nonetheless, as we shall see, the standard IPM solvers can still fully leverage the diagonal structure and exhibit fast convergence.
In Table 3, we present the experiment results of the compared algorithms.The final objective, the norm of the dual solution (DNorm), and for LCPG, the maximum dual norm in the solution path (Max DNorm) are reported.All values represent the average of 5 independent runs.From the results, we observe that while LCPG does not outperform CVXPY for the small-size problem (n " 500), LCPG becomes increasingly favorable as the problem dimension increases.This justifies the empirical advantage of our proposed approach as we do not need to construct a full Hessian matrix.Moreover, interestingly, we observe that the dual solution norm t}λ k }u is increasing, reaching the maximum at the last iteration.This accounts for the equal values of DNorm and Max DNorm.Meanwhile, in all the cases, the dual remains bounded and the reported dual norm closely aligns with the solution returned by CVXPY.This result confirms our intuition that the dual bound is intricately tied to the nature of problems.In the second setting of this experiment, we examine the performance of LCPG on nonconvex constrained optimization.Specifically, we express Q i as the difference of two matrices: Q i " P i ´Si , where P i is generated in the same manner as Q i in the first setting, and S i " 10I nˆn .Given the construction of the quadratic components, it is natural to view the function 1  2 x T Q i x `bT i x `ci as a difference of two convex quadratic functions: 1 2 x T P i x `bT i x `ci ´1 2 x T S i x.Leveraging this decomposition, we apply the DC programming, and more specifically, the DCCP framework to solve (8.1).Each convex subproblem of DCCP is solved by MOSEK through the CVXPY interface.In Table 4, we describe the performance of LCPG and the DCCP algorithm.It can be observed that LCPG compares favorably against the DCCP solver.Furthermore, the boundedness of the dual for both algorithms is also observed, which is consistent with our intuition.

Study of gradient complexities
In the next experiment, our primary goal is to examine the main theoretical results, namely, the gradient complexities of LCPG, its stochastic variants LCSPG and LCSVRG.We apply all these algorithms to a sparsity-induced finite-sum problem, wherein a nonconvex constraint is incorporated into the supervised learning framework to actively enforce a sparse solution.The optimization problem is as follows where f i pxq is a smooth loss function associated with the i-th sample, ψ 1 pxq is the difference between ℓ 1 -penalty and a convex smooth function gpxq.Employing a difference-of-convex constraint is seen as a tighter relaxation of the cardinality constraint ∥x∥ 0 ď κ than the ℓ 1 relaxation.The appealing properties of difference-of-convex penalties have been demonstrated in various studies [14,36,17,16,5,37].In view of the concave structure of ´gpxq, there is a strong asymmetry between the lower and upper curvature of ´gpxq, namely, the following ´Lg 2 ∥y ´x∥ 2 ´∇gpyq T px ´yq ´gpyq ď ´gpxq ď ´gpyq ´∇gpyq T px ´yq (8.3) holds for certain L g ą 0. Note that this is much stronger than the L g smoothness condition which adds an extra Lg 2 ∥y ´x∥ 2 on the right-hand side of (8.3).Due to this feature, one can impose a tighter piece-wise linear surrogate function constraint β∥x∥ 1 ´gpx k q ´∇gpx k qpx ´xk q ď η k 1 in the LCPG subproblem.It should be noted that our analysis is readily adaptable to accommodate this scenario since it is the smoothness, as opposed to concavity/convexity, that plays a central role in our convergence analysis and that remains valid.An empirical advantage of this approach is that we now have a tractable subproblem solvable in nearly linear time.See more discussion in [5].
Our experiment considers the task of binary classification with logistic loss, denoted by f i pxq " logp1 èxpp´b i pa T i xqq, where a i P R d , b i P t1, ´1u, 1 ď i ď n.We use the SCAD penalty gpxq " ř d j"1 h β,θ px j q where h β,θ p¨q is defined in (3.22).We use the real-sim dataset from the LibSVM repository [10] and the covtype data from the UCI repository [19].For the latter, we formulate a binary classification task by distinguishing class "3" from the other classes.We set β " 2, θ " 5, and η 1 " σd, with σ P t0.4,0.6u for covtype and σ P t0.1, 0.2u for real-sim dataset.For each algorithm, we use its theoretically suggested batch size and stepsize.for a fair comparison, we count n evaluations of the stochastic gradient as an effective pass over the dataset and plot the objective value over the number of effective passes.Figure 3 plots the convergence result of the compared algorithms.It can be readily seen that LCSPG performs better than LCPG in terms of the number of gradient samples, and LCSVRG achieves the best performance among the three algorithms.The empirical findings further confirm our theoretical complexity analysis.

Conclusion
In this work, we presented a new LCPG method for nonconvex function constrained optimization which can achieve gradient complexity of the same order as that of unconstrained nonconvex problems.The key ingredient in our algorithm design is the use of constraint levels to ensure the subproblem feasibility, which allows us to overcome a well-known difficulty in bounding the Lagrange multipliers in the presence of nonsmooth constraints.Moreover, a merit of our convergence analysis is its striking similarity with that of gradient descent methods for unconstrained problems.Therefore, we can easily extend our method to minimizing stochastic, finite-sum and structured nonsmooth functions with nonconvex function constraints; many of the complexity results were not known before.Another important feature of our work is that the method can deal with complex scenarios where the subproblems are not exactly solvable.To the best of our knowledge, existing work on sequential convex optimization (SQP, MBA) only assume the subproblems to be exactly solved.We provided a detailed complexity analysis of LCPG when the subproblems are inexactly solved by customized interior point method and first order method.Finally, we clearly distinguished the notion of gradient complexity from that of computational complexity.In terms of gradient complexity, all of our proposed methods are state of the art and easy to implement.Whether the computational complexity can be further improved for composite case remains an open problem which we leave as a future direction.ď ş 1 t"0 ∥∇ 2 ppy `tpx ´yqqpx ´yq∥dt ď ş 1 t"0 ∥∇ 2 ppy `tpx ´yqq∥∥x ´y∥dt ď maxtL, µu∥x ´y∥.Now, taking n Ñ 8 and noting that ∇p n pxq Ñ ∇ppxq for all x, we have (5.3).Hence, we conclude the proof.

B Proof of Theorem 7
First of all, using the definition of ϵ k and the fact that ψ k i px k`1 q ď η k i `ϵk for all i P rms, we have ψ k i px k q " ψ i px k q ď ψ k´1 i px k q ď η k´1 i `ϵk´1 ă η k´1 i `δk´1 i " η k i , i " 1, 2, . . ., m. Hence x k is strictly feasible solution of (3.1).Due to Slater condition, there exists a pair of optimal primal and dual solutions, which we denote by xk`1 and λ k`1 .We first prove the following lemma.Our argument will be based on the following key result.Proof.Using optimality of r x k`1 , strong convexity of ψ k 0 , feasibility of x k for the subproblem (3.1) and ψ k 0 px k`1 q ď ψ k 0 pr x k`1 q `ϵk , we have, ψ 0 px k`1 q ď ψ k 0 px k`1 q ď ψ k 0 pr x k`1 q `ϵk ď ψ k 0 px k q ´L0 2 ∥x k ´r x k`1 ∥ 2 `ϵk " ψ 0 px k q ´L0 2 ∥x k ´r x k`1 ∥ 2 `ϵk .Since ϵ k is summable, we have that ∥x k ´r x k`1 ∥ 2 is summable.This implies ∥x k ´r x k`1 ∥ Ñ 0. Since ϵ k Ñ 0 and r x k`1 is a unique optimal solution of (3.1), we have ∥x k`1 ´r x k`1 ∥ Ñ 0. Then, using Cauchy-Schwarz inequality, we have ∥x k`1 ´xk ∥ ď ∥x k`1 ´r x k`1 ∥ `∥r x k`1 ´xk ∥ and hence, ∥x k`1 ´xk ∥ Ñ 0. Hence, we conclude the proof.Now, we show boundedness of λk`1 .Assume, for the sake of contradiction, that λk`1 is unbounded.Let x be a limit point of tx k u.Passing to a subsequence if necessary, we have x k Ñ x.Using Lemma 10, we have r x k`1 Ñ x and x k`1 Ñ x.Then, we have ψ k 0 px k`1 q `xλ k`1 , ψ k px k`1 qy ď ψ k 0 pr x k`1 q `xλ k`1 , ψ k pr x k`1 qy `ϵk ď ψ k 0 pxq `xλ k`1 , ψ k pxqy `ϵk Note that the above relation is comparable to (3.10) up to an error term of ϵ k .Following the arguments in the proof of Theorem 1 (Part 1, (3.10) onwards) and noting that ϵ k is summable, we conclude that t λk`1 u is bounded.Now, we prove limit point of tx k u is a KKT point.Since L k px k`1 , λk`1 q ď L k pr x k`1 , λk`1 q `ϵk , we rewrite (3.18) as @ ∇f 0 px k q `řm i"1 λk`1 i ∇f i px k q, x k`1 ´xD `χ0 px k`1 q ´χ0 pxq `xλ k`1 , χpx k`1 q ´χpxq D ď @ ∇f 0 px k q `řm i"1 λk`1 i ∇f i px k q, r x k`1 ´xD `χ0 pr x k`1 q ´χ0 pxq `xλ k`1 , χpr x k`1 q ´χpxq D `ϵk ď L0`x λk`1 ,Ly 2 " ∥x ´xk ∥ 2 ´∥r x k`1 ´xk ∥ 2 ´∥x ´r x k`1 ∥ 2 ‰ `ϵk .(B.2) Let x be a limit point of sequence tx k u.Since λk`1 is bounded, we assume limit point λ.Without loss of generality, we have x k Ñ x and λk`1 Ñ λ.Then, in view of Lemma 10, we have lim kÑ8 x k`1 Ñ x and lim kÑ8 r x k`1 Ñ x.Taking limit k Ñ 8 in (B.2), we have @ ∇f 0 pxq `řm i"1 λi ∇f i pxq, x ´xD `χ0 pxq ´χ0 pxq `xλ , χpxq ´χpxq D ď 0. Note that the above equation matches with (3.19) exactly.From here, we follow the proof of Theorem 1 (Part 2, (3.19) onwards) to conclude first-order stationarity of px, λq.A similar argument can show complimentary slackness.Hence, we have that px, λq is KKT-solution and we conclude the proof.

Figure 1 :
Figure 1: The nonconvex constraint ψ 1 pxq ď η 1 where η 1 " 3. The dotted blue curves are the subproblem constraint for two different points.Since the MFCQ is violated at p5, 0q, the subproblem reduces to a single feasible point at the limit point p5, 0q.

Figure 2 :
Figure 2: The nonconvex constraint ψ 1 pxq ď η 1 where η 1 " 2.5.The dotted blue curves are subproblem constraint for two different points.Since the MFCQ is satisfied, the limiting subproblem constraint at p3, 0q is still a full dimensional set with nonempty interior.

. 9 )
Using δ k " η´η 0 pk`1qpk`2q and ϵ k " ∥η´η 0 ∥ 2pk`1qpk`2q , we have x k is strictly feasible solutions for (3.1) for all k.Hence, under Assumption 3, we can follow the steps of Theorem 1 to show uniform bound B on sequence t∥ r λ k ∥u.
.16) Putting this relation and (4.15) together, we have

Table 3 :
Comparison of algorithms on convex quadratic problems.Running time is measured in seconds.