Abstract
We present a new feasible proximal gradient method for constrained optimization where both the objective and constraint functions are given by summation of a smooth, possibly nonconvex function and a convex simple function. The algorithm converts the original problem into a sequence of convex subproblems. Formulating those subproblems requires the evaluation of at most one gradientvalue of the original objective and constraint functions. Either exact or approximate subproblems solutions can be computed efficiently in many cases. An important feature of the algorithm is the constraint level parameter. By carefully increasing this level for each subproblem, we provide a simple solution to overcome the challenge of bounding the Lagrangian multipliers and show that the algorithm follows a strictly feasible solution path till convergence to the stationary point. We develop a simple, proximal gradient descent type analysis, showing that the complexity bound of this new algorithm is comparable to gradient descent for the unconstrained setting which is new in the literature. Exploiting this new design and analysis technique, we extend our algorithms to some more challenging constrained optimization problems where (1) the objective is a stochastic or finitesum function, and (2) structured nonsmooth functions replace smooth components of both objective and constraint functions. Complexity results for these problems also seem to be new in the literature. Finally, our method can also be applied to convex function constrained problems where we show complexities similar to the proximal gradient method.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In this paper, we study the following constrained optimization problem
where \(\psi _{i}(x)\) is a composite function that sums up functions \(f_i(x)\) and \(\chi _{i}(x)\). Here, \(f_i, i ={0}, 1, \dots , m\), are smooth functions, \(\chi _0(x)\) is a proper, convex, lowersemicontinuous (lsc) function and \(\chi _{i}(x), i=1,\dots ,m\), are convex continuous functions over the domain of \(\chi _{0}\) (i.e. \({\textrm{dom}}_{\chi _{0}}\)). We consider that \(\chi _{i}, i =0, \dots , m\) are ‘simple’ functions, namely, a feasible optimization problem of the form below
can be solved to efficiently obtain either an exact solution or an inexact solution of desired accuracy. Note that if \(\chi _{i} =0, i= 1, \dots , m\), then (1.2) becomes a proximal operator for function \(\chi _{0}\) on the intersection of balls. If we further assume \(\chi _{0} = 0\), then (1.2) is a special type of quadratically constrained quadratic programming (QCQP) that can be solved efficiently because all the Hessians are identity matrices. In addition, we consider the case where constraints \(f_i, i =1, \dots , m\), are structured nonsmooth functions which can be approximated by smooth functions (also called smoothable functions). Note that problem (1.1) covers a variety of convex and nonconvex function constrained optimization depending on the assumptions of \(f_i, i = 0, \dots , m\).
Nonlinear optimization with function constraints is a classical topic in continuous optimization. While the earlier study focused on the asymptotic performance, recent work has put more emphasis on the complexity analysis of algorithms, mainly driven by the growing interest in largescale optimization and machine learning. For most of our discussion on the complexity analysis, we generally require convergence to an \(\epsilon \)approximate KKT point (c.f. Definition 3). Penalty methods [9, 25, 33], including augmented Lagrangian methods [22,23,24, 34], is one popular approach for constrained optimization. In [8], Cartis et al. presented an exact penalty method by minimizing a sequence of convex composition functions. When the penalty weight is bounded, this method solves \({\mathcal {O}}(1/\epsilon )\) trust region subproblems. If the penalty weight is unbounded, the complexity is of \({\mathcal {O}}(1/\epsilon ^{2.5})\) to reach an \(\epsilon \)KKT point. In a subsequent work [9], the authors provided a target following method that achieves the complexity of \({\mathcal {O}}(1/\epsilon )\), regardless of the growth of the penalty parameter. In [33], Wang et al. extended the penalty method for solving constrained problems where the objective takes the expectation form. Sequential quadratic programming (SQP) is another important approach for constrained optimization. Typically, SQP involves linearization of the constraints, quadratic approximation of the objective, and possibly some trust region constraint for the convergence guarantee [6, 7]. The recent work [12] established a unified convergence analysis of SQP (GSQP) in more general settings where the feasibility and constraint qualification may or may not hold. Different from the standard SQP, the Moving Balls Approximation (MBA) method [1] follows a feasible solution path and transforms the initial problem into a diagonal QCQP. A subsequent work [3] presented a unified analysis of MBA and other variants of SQP methods. Under the assumption of KurdykaŁojasiewicz (KL) property, they establish the global convergence rates which depend on the Łojasiewicz exponent.
Despite much progress in prior works, there are some significant remaining issues. Specifically, most of the analysis is carried out for only smooth optimization and requires that the exact optimal solution of the convex subproblem is readily available. Unfortunately, both assumptions can be unrealistic in many largescale applications. To overcome these issues, [4, 25, 26] presented some new proximal point algorithms that iteratively solve strongly convex proximal subproblems inexactly using firstorder methods. A significant computational advantage is that firstorder methods only need to compute a relatively easy proximal gradient mapping in each iteration. In particular, [4] proposed to solve the proximal point subproblem by a new firstorder primaldual method called ConEx. Under some strict feasibility assumption, they derived the total complexities of the overall algorithm for which the objective and constraints can be either stochastic or deterministic, and either nonsmooth or smooth. Notably, for nonconvex and smooth constrained problems, inexact proximal point [4] requires \({\mathcal {O}}(1/\epsilon ^{1.5})\) function/gradient evaluations. A similar complexity bound is obtained by the proximal point penalty method [25] when a feasible point is available. Nevertheless, at this point, it may be difficult to directly compare the efficiency of the proximal point with the earlier approach, given that very different oracles are employed in each method. The inexact proximal point method appears to be less efficient in terms of gradient and function value computations since firstorder penalty method [9] and a variant of SQP [12] (where the surrogate is formed by firstorder approximation) has an \({\mathcal {O}}(1/\epsilon )\) complexity bound. Nevertheless, it might be more efficient if the corresponding proximal mapping is much easier to solve than the subproblems in penalty or SQP methods.
In this paper, we attempt to alleviate some of the aforementioned issues in solving nonconvex constrained optimization. Our main contribution is the development of a novel Level Constrained Proximal Gradient (LCPG) method for constrained optimization, based on the following key ideas.
First, we convert the original problem (1.1) into a sequence of simple convex subproblems of the form (1.2) for which an exact or an approximate solution can be computed efficiently. In particular, solving the subproblem requires at most one gradient and function value computation for \(f_i, i =0, \dots , m\). This phenomenon is quite similar to simple singleloop methods even though LCPG method can be multiloop since we allow for an inexact solution of (1.2) using some kind of iterative scheme.
Second, starting from a strictly feasible initial point and carefully controlling the feasibility levels of the subproblem constraints, we ensure that LCPG follows a strictly feasible solution path. This also allows us to deal with nonsmooth constraints where \(\chi _i\) is not necessarily 0 and further extends LCPG to the inexact case where the subproblem admits an approximate solution. Even though subtle, the levelcontrol design is crucial in bounding the Lagrange multipliers under the wellknown Mangasarian–Fromovitz constraint qualification (MFCQ) [4, 27]. Subsequently, we also show asymptotic convergence of the LCPG method.
Third, we offer a new insight into the complexity analysis of LCPG as a gradient descent type method, which could be of independent interest. When the objective and constraints are nonconvex composite, we aim to find a firstorder \(\epsilon \)KKT point (c.f. Definition 3) under the aforementioned MFCQ assumption. We can show that LCPG method converges in \(O(1/\epsilon )\) iterations. Furthermore, each subproblem requires at most one functionvalue and gradient computation. The net outcome is that gradient complexity of our method is of \(O(1/\epsilon )\). Notice that the number of iterations required by the proximal point method under MFCQ is also \(O(1/\epsilon )\) (see [4, Theorem 5]). However, each iteration of this method requires \(O(1/\epsilon ^{0.5})\) gradient computation, and hence its total gradient complexity can be bounded by \(O(1/\epsilon ^{1.5})\). This is much worse than LCPG method. We compare with some significant lines of work in Table 1.
Exploiting the intrinsic connection between LCPG and proximal gradient (without function constraints), we extend LCPG to a variety of cases. (1) We can show a similar \(O(1/\epsilon )\) gradient complexity for an inexact LCPG method for which the subproblem is solved to a prespecified accuracy. If we assume \(\chi _{i} = 0\), then the corresponding subproblem (1.2) (i.e. diagonal QCQP) can be efficiently solved by a customized interior point method in logarithmic time. In the more general setting when \(\chi _{i} \ne 0\), we propose to solve (1.2) by the firstorder method ConEx, which has very cheap iterations. (2) We also extend LCPG method to stochastic (LCSPG) and variancereduced (LCSVRG) variants when \(f_0\) is either stochastic or finitesum function, respectively. LCSPG and LCSVRG require \(O(1/\epsilon ^2)\) (similar to SGD [15]) and \(O(\sqrt{n}/\epsilon )\) (similar to SVRG [18]) stochastic gradients, respectively, where n is the number of components in the finitesum objective. The complexity of variants of LCPG method for stochastic cases can also be seen in Table 2. (3) We consider the case when function \(f_i, i =0,1, \dots , m\) are nondifferentiable but contain a smooth saddle structure (referred to as structured nonsmooth). We extend LCPG method for such nonsmooth nonconvex function constrained problem using Nesterov’s smoothing scheme [29]. In this case, LCPG method requires \(O(1/\epsilon ^2)\) gradients.
We show that the GDtype analysis of the LCPG method can be extended to the convex case. In particular, when the objective and constraint functions are convex, we show that LCPG method requires \(O(1/\epsilon )\) gradient computations for smooth and composite constrained problems, and this complexity improves to \(O(\log {({1}/{\epsilon })})\) when the objective is smooth and stronglyconvex. Furthermore, we develop the complexity of inexact variants of LCPG method by leveraging the analysis of gradient descent with inexact projection oracles [31]. Inexact LCPG method maintains the gradient complexity of \(O(1/\epsilon )\) and \(O(\log (1/\epsilon ))\) for convex and strongly convex problems, respectively.
Throughout our analysis, we require that the Lagrange multipliers for the convex subproblems of type (1.2) are bounded. This problem is addressed in different ways in arguably all works in the literature. In this paper, we show that under the assumption of MFCQ, Lagrange multipliers associated with the sequence of subproblems remain bounded by a quantity specified as B. Even then, the value of B cannot be estimated a priori. Fortunately, this bound is not needed in the implementation of our methods. However, it plays a role in complexity analysis. Hence, our comparison with the existing complexity literature (e.g., proximal point method of [4]) is valid when bound B on the sequence of Lagrange multipliers largely depends on the problem itself and not on the sequence of subproblems. One can easily see that such uniform bounds on Lagrange multipliers hold under the strong feasibility constraint qualification [4], a similar uniform Slater’s condition [26] or for nonsmooth nonconvex relaxation in the application of sparsity constrained optimization [5]. The problem of comparing bounds B on Lagrange multipliers requires getting into specific applications, which is not the purpose of this paper. Hence, throughout our comparison with existing literature, we assume that bound B for different methods is of a similar order.
Comparison with MBA method Auslender et al. [1] provided a Moving Balls Approximation (MBA) method for smooth constrained problems, i.e. \(\chi _i(x), i=0, \dots , m\), are not present. They use Lipschitz continuity of constraint gradients along with MFCQ to ensure that the subproblems satisfy Slater’s conditions (see [1, Proposition 2.1(iii)]). A similar result is also used in [35] where they provide a linesearch version of MBA for functions satisfying certain KL properties. The MBA method was studied for semialgebraic functions in [3] where they used the KLproperty of semialgebraic functions. The work [1] also provides the complexity guarantee for constrained programs with a smooth and strongly convex objective. Our results differ from the past studies in the following several aspects. First, we do not assume any KL property on the class of functions, hence making the method applicable to a wider class of problems. Second, we show complexity analysis for a variety of cases, e.g., stochastic, finitesum, or structured nonsmooth cases. Note that complexity results are not known for the MBA type method even for the purely smooth problem. Third, we show complexity results for both convex and strongly convex cases which strictly subsumes the results in [1]. Fourth, it should be noted that [1] also considered the efficiency of solving subproblems. They proposed an accelerated gradient method that obtains \({\mathcal {O}}(1/\sqrt{\epsilon })\) complexity for solving the dual of the QCQP subproblem. However, it is unclear what accuracy is enough for ensuring asymptotic convergence of the whole algorithm.
Comparison with generalized SQP The work [12] developed the first complexity analysis of the generalized SQP (GSQP) method by using a novel ghost penalty approach. Different from our feasible method, they consider a general setting where the feasibility and constraint qualification may or may not hold. They show that SQPtype methods have an \({\mathcal {O}}(1/\epsilon ^2)\) complexity for reaching some \(\epsilon \)approximate generalized stationary point. Under an extended MFCQ condition, they established an improved complexity \({\mathcal {O}}(1/\epsilon )\) for reaching the scaledKKT point, which matches our complexity result under a similar MFCQ assumption. Notably, both their analysis and ours rely on MFCQ to show that the global upper bound (constant B) on the multipliers of the subproblems exists. However, to obtain the best \({\mathcal {O}}(1/\epsilon )\) complexity, GSQP explicitly relies on the value of such unknown upper bound to determine the stepsize, which appears to be rather challenging in practical use. In contrast, our algorithm does not involve the constant B in the algorithm implementation; we only require the Lipschitz constants of the gradients, which is standard for gradient descent methods.
Outline This paper is organized as follows: Sect. 2 describes notations and assumptions. It also provides various definitions used throughout the paper. Section 3 presents the LCPG method which uses exact solutions of subproblems. It also establishes asymptotic convergence and convergence rate results. Section 4.1 and Sect. 4.2 provides the LCSPG and LCSVRG method for stochastic and finitesum problems, respectively. Section 5 shows the extension of LCPG for nonsmooth nonconvex function constraints. Section 6 introduces the inexact LCPG method and provides its complexity analysis when the subproblems are inexactly solved by an interior point method or firstorder method. Finally, Sect. 7 extends LCPG method for convex optimization problems and establishes its complexity for both strongly convex and convex problems.
2 Notations and assumptions
Notations. \(\mathbb {R}^n_+\) stands for the nonnegative orthant in \(\mathbb {R}^n\). We use \(\Vert \cdot \Vert _{}^{}\) to express the Euclidean norm. For a set \({\mathcal {X}}\), we define \(\Vert {\mathcal {X}}\Vert _{}^{}=\text {dist}(0,{\mathcal {X}})=\inf \big \{\Vert x\Vert _{}^{},x\in {\mathcal {X}}\big \}\). If \({\mathcal {X}}\) is a convex set then we denote its normal cone at x as \(N_{\mathcal {X}}(x)\). Furthermore, we denote the dual cone of the normal cone at x as \(N^*_{\mathcal {X}}(x)\). Let e be a vector full of elements one. For simplicity, we denote \([m]=\{1,2,\ldots , m\}\), \(f(x)=[f_{1}(x),\ldots ,f_{m}(x)]^{\textrm{T}}\), \(\chi (x)=[\chi _{1}(x),\ldots ,\chi _{m}(x)]^{\textrm{T}}\), and \(\psi (x)=[\psi _{1}(x),\psi _{2}(x),\ldots ,\psi _{m}(x)]^{\textrm{T}}\). For vectors \(x, y\in \mathbb {R}^m\), \(x\le y\) is understood as \(x_i\le y_i\) for \(i\in [m]\).
Assumption 1
(General) We assume that the optimal value of problem (1.1) is finite: \(\psi _{0}^{*}>\infty \). Furthermore, the objective and constraint functions have the following properties.

1:
\(\chi _{0}\) is a proper, convex, and lower semicontinuous (lsc) function. Moreover, we assume that for all \(i = 1, \dots , m\), the function \(\chi _{i}(x)\) is convex continuous over \({\textrm{dom}}_{\chi _{0}}\).

2:
\(f_{i}(x)\) is \(L_{i}\)Lipschitz smooth on \({\textrm{dom}}_{\chi _{0}}\): \(\Vert \nabla f_{i}(x)\nabla f_{i}(y)\Vert _{}^{}\le L_{i}\Vert xy\Vert _{}^{}\) for any \(x,y\in {\textrm{dom}}_{\chi _{0}}\). For brevity, we denote \(L=[L_{1},\ldots ,L_{m}]^{\textrm{T}}\).

3:
The feasible set for (1.1), i.e., \(\bigcap _{i\in [m]}\{x:\psi _i(x) \le \eta _i\} \cap {\textrm{dom}}_{\chi _{0}}\) is nonempty and compact.^{Footnote 1}
The Lagrangian function of problem (1.1) is denoted by
For functions \(\psi _i\), we denote its subdifferential as
where the sum is in the Minkowski sense. Note that this definition of the subdifferential for nonconvex functions was first proposed in [4]. Moreover, \(\partial \psi _i = \{\nabla f_i \}\) when \(\psi _i\) is a “purely” smooth nonconvex function and \(\partial \psi _i = \partial \chi _i\) when \(\psi _i\) is a nonsmooth convex function. Hence, it is a valid definition for the subdifferential of a nonconvex function. Below, we define the KKT condition using the above subdifferential.
Definition 1
(KKT condition) We say \(x\in {\textrm{dom}}_{\chi _{0}}\) is a KKT point of problem (1.1) if x is feasible and there exists a vector \(\lambda \in \mathbb {R}_{+}^{m}\) such that
The values \(\{\lambda _{i}\}\) are called Lagrange multipliers.
It is known that the KKT condition is necessary for optimality under the assumption of certain constraint qualifications (c.f. [2]). Our result will be based on a variant of the Mangasarian–Fromovitz constraint qualification, which is formally given below.
Definition 2
(MFCQ) We say that a point x satisfies the Mangasarian–Fromovitz constraint qualification for (1.1) if there exists a vector \(z \in N^*_{{\textrm{dom}}_{\chi _{0}}}(x)\) such that
where \({\mathcal {A}}(x)=\{i:1\le i\le m,\psi _{i}(x)=\eta _{i}\}\).
Proposition 1
(Necessary condition) Let x be a local optimal solution of Problem (1.1). If it satisfies MFCQ (2.3), then there is a vector \(\lambda \in \mathbb {R}_{+}^{m}\) such that the KKT condition (1) holds.
Next, we introduce some optimality measures before formally presenting any algorithms. It is natural to characterize algorithm performance by measuring the error of satisfying the KKT condition. Towards this goal, we have the following definition.
Definition 3
We say that x is an \(\epsilon \) typeI (approximate) KKT point if it is feasible (i.e. \(\psi (x)\le \eta \)), and there exists a vector \(\lambda \in \mathbb {R}_{+}^{m}\) satisfying the following conditions:
Moreover, x is a randomized \({\epsilon }\) typeI KKT point if both x and \(\lambda \) are feasible randomized primaldual solutions that satisfy
where the expectation is taken over the randomness of x and \(\lambda \).
Besides the above definition, we invoke a second optimality measure which will help analyze the performance of a proximal algorithm (see, for example, [4]). Therein, it is arguably more convenient to approach the proximity of some nearly stationary points.
Definition 4
We say that x is a \({(\epsilon , \nu )}\) typeII KKT point if there exists an \({\epsilon }\) typeI KKT point \(\hat{x}\) and \(\Vert x\hat{x}\Vert _{}^{2}\le \nu \). Similarly, x is a randomized \({(\epsilon , \nu )}\) typeII KKT point if \(\hat{x}\) is a random vector and \(\mathbb {E}[\Vert x\hat{x}\Vert _{}^{2} ]\le {\nu }\).
3 A proximal gradient method
We present the level constrained proximal gradient (LCPG) method in Algorithm 1. The main idea of this algorithm is to turn the original nonconvex problem into a sequence of relatively easier subproblems that involve some convex surrogate functions \(\psi ^k_i(x)\) (\(0\le i\le m\)) and variable constraint levels \(\eta ^k\):
Above, we take the surrogate function \(\psi _{i}^{k}(x)\) (\(0\le i\le m\)) by partially linearizing \(\psi _i(x)\) at \(x^k\) and adding the proximal term \(\tfrac{L_i}{2}\Vert xx^k\Vert _{}^{2}\):
It should be noted that our algorithm may not be welldefined if it were to be initialized by an infeasible solution \(x^0\). Furthermore, we require the initial point to be strictly feasible with respect to the nonlinear constraints \(\psi (x)\le \eta \). Therefore, we explicitly state this assumption below and assume it holds throughout the paper.
Assumption 2
(Strict feasibility) There exist a point \(x^0\in {\textrm{dom}}_{\chi _{0}}\) and a vector \(\eta ^0\in \mathbb {R}^m\) satisfying
With a strictly feasible solution, we assume that the constraint levels \(\{\eta ^k\}\) are incrementally updated and converge to the constraint levels for the original problem:
The following Lemma will be used many times throughout the rest of the paper.
Lemma 1
[Threepoint inequality] Let \(g:\mathbb {R}^d\rightarrow (\infty , \infty ]\) be a proper lsc convex function and
then for any \(x\in \mathbb {R}^d\), we have
Next, we present some important properties of the generated solutions in the following theorem.
Proposition 2
Suppose that Assumption 2 holds, then Algorithm 1 has the following properties.

1.
The sequence \(\big \{ x^{k}\big \}\) is welldefined and is feasible for problem (1.1). \(\{x^k\}\) satisfies the sufficient descent property:
$$\begin{aligned} \tfrac{L_{0}}{2}\Vert x^{k+1}x^{k}\Vert _{}^{2}\le \psi _{0}(x^{k})\psi _{0}(x^{k+1}). \end{aligned}$$(3.5)Moreover, the sequence of objective values \(\big \{\psi _{0}(x^{k})\big \}\) is monotonically decreasing and \(\lim _{k\rightarrow \infty }\psi _{0}(x^{k})\) exists.

2.
There exists a vector \(\lambda ^{k+1}\in \mathbb {R}_{+}^{m}\) such that the KKT condition holds:
$$\begin{aligned} \begin{aligned} \partial \psi _{0}^{k}(x^{k+1})+\sum _{i=1}^{m}\lambda _{i}^{k+1}\partial \psi _{i}^{k}(x^{k+1})&\ni 0\\ \lambda _{i}^{k+1}\big [\psi _{i}^{k}(x^{k+1})\eta _{i}^{k}\big ]&=0,\quad i\in [m]. \end{aligned} \end{aligned}$$(3.6)
Proof
Part 1). First, we show that \(\{x^k\}\) is a welldefined sequence, namely, \({\mathcal {X}}_k\cap {\textrm{dom}}_{\chi _{0}}\) is a nonempty set where \({\mathcal {X}}_k=\big \{ x:\psi _{i}^{k}(x)\le \eta _{i}^{k}\big \}\). This result clearly holds for \(k = 0\) by Assumption 2. We show the general case \((k>0)\) by induction. Suppose that \(x^k\) is welldefined, i.e., \({\mathcal {X}}_{k1}\cap {\textrm{dom}}_{\chi _{0}}\) is nonempty. Then, by the definition of \(\psi _{i}^{k}, \psi _{i}^{k1}\) and the definition \(x^k\), we have \(x^k \in {\textrm{dom}}_{\chi _{0}}\) and
Here, the first inequality follows due to the smoothness of \(f_i(x)\) which ensures for all x,
This is equivalent to \(x^k \in {\mathcal {X}}_k\cap {\textrm{dom}}_{\chi _{0}}\), implying that \({\mathcal {X}}_k \cap {\textrm{dom}}_{\chi _{0}}\) is nonempty. We conclude that \(x^{k+1}\) is welldefined. Hence, by induction, we conclude that \(\{x^k\}\) is a welldefined sequence. Furthermore, in view of \(x^k\in {\textrm{dom}}_{\chi _{0}}\), relation (3.7) and the fact that \(\eta ^k_i < \eta _i\), we have \(x^k \in {\textrm{dom}}_{\chi _{0}} \cap \{x: \psi _{i}(x)\le \eta _{i},\ i = 1, \dots ,m\}\). Hence, the whole sequence \(\{x^k\}\) remains feasible for the original problem.
Now, let us apply Lemma 1 with \(g(x)=\langle \nabla f_{0}(x^{k}), x\rangle +\chi _{0}(x)+\textbf{1}_{{\mathcal {X}}_k}(x)\), \(y=x^k\) and \(\gamma =L_{0}\). Then, for any \(x \in {\textrm{dom}}_{\chi _{0}} \cap {\mathcal {X}}_k\), we have
Placing \(x=x^{k}\) in the above relation, we have
Moreover, since \(f_{0}(\cdot )\) is Lipschitz smooth, we have that
Summing up the above two inequalities and using the definition \(\psi _{0}=f_{0}+\chi _{0}\), we conclude (3.5). Hence, the sequence \(\big \{\psi _{0}(x^{k})\big \}\) is monotonically decreasing. The convergence of \(\big \{\psi _{0}(x^{k})\big \}\) follows from the lower boundedness assumption.
Part 2). Note that (3.7) ensures the strict feasibility of \(x^k\) w.r.t. the constraint set \({\mathcal {X}}_k\) for the kth subproblem. Therefore, Slater’s condition for (3.1) and the optimality of \(x^{k+1}\) imply that there must exist a vector \(\lambda ^{k+1} \in \mathbb {R}^m_+\) satisfying KKT condition (3.6). Hence, we complete the proof. \(\square \)
In order to show convergence to the KKT solutions, we need the following constraint qualifications.
Assumption 3
[Uniform MFCQ] All the feasible points of problem (1.1) satisfy MFCQ.
We state the main asymptotic convergence property of LCPG in the following theorem.
Theorem 1
Suppose that Assumption 3 holds, then we have the following conclusions.

1.
The dual solutions \(\{\lambda ^{k+1}\}\) are bounded from above. Namely, there exists a constant \(B>0\) such that
$$\begin{aligned} \sup _{0\le k\le \infty }\Vert \lambda ^{k+1}\Vert _{}^{}<B. \end{aligned}$$(3.9) 
2.
Every limit point of Algorithm 1 is a KKT point.
Proof
Part 1). First, we can immediately see that \(\{x^k\}\) is a bounded sequence and hence the limit point exists. This result is from Assumption 1.3 and the feasibility of \(x^k\) for problem (1.1) (c.f. Proposition 2, Part 1). Without loss of generality, we can assume \(\lim _{k\rightarrow \infty }x^{k}=\bar{x}\). For the sake of contradiction, suppose that \(\lambda ^{k+1}\) is unbounded, then passing to a subsequence if necessary, we can assume \(\Vert \lambda ^{k+1}\Vert _{}^{}\rightarrow \infty \) for simplicity. Note that we also have \(\lim _{k\rightarrow 0}\Vert x^{k+1}x^{k}\Vert _{}^{2}=0\) due to the sufficient descent property (3.5). From the KKT condition (3.6), we have
Let \(X:= \bigcap _{i\in [m]}\{x:\psi _i(x) \le \eta _i\} \cap {\textrm{dom}}_{\chi _{0}}\) be the feasible domain for problem (1.1). Due to the fact that \(x^k \in X\) (Proposition 2), boundedness of X (Assumption 1.3) and strong convexity of \(\psi ^k_{0}\), there exists \(l_0 \in \mathbb {R}\) such that \(X \subset \{x: \psi _{0}^k(x) < l_0\}\) for all k. Then, using (3.10) for all \(x \in {\textrm{dom}}_{\chi _{0}} \cap \{\psi _{0}^k(x) \le l_0\}\) and dividing both sides by \(\Vert \lambda ^{k+1}\Vert _{}^{}\), we have
Let us take \(k\rightarrow \infty \) on both sides of (3.11). Note that for all \(x \in {\textrm{dom}}_{\chi _{0}} \cap \{\psi _0^k(x) \le l_0\}\), we have
where (3.12) is due to boundedness of \(\psi _{0}^k(x)\) on \({\textrm{dom}}_{\chi _{0}} \cap \{\psi _{0}^k(x) \le l_0\}\) and boundedness of \(\psi _{0}^k(x^{k+1})\) since \(x^{k+1} \in X\) which is a bounded set. Moreover, (3.13) and (3.14) use the continuity of \(f_i(x)\) and \(\chi _i(x)\), \(i\in [m]\). Next, we consider the sequence \(\{u^{k}=\lambda ^{k+1}/\Vert \lambda ^{k+1}\Vert _{}^{}\}\). Since \(\Vert u^{k}\Vert _{}^{}\) is a bounded sequence, it has a convergent subsequence. Let \(\bar{u}\) be a limit point of \(\{u^{k}\}\) and the subsequence \(\{j_{k}\}\subseteq \{1,2,...,\}\) such that \(\lim _{k\rightarrow \infty }u^{j_{k}}=\bar{u}\). Since the subsequence of a convergent sequence is also convergent, we pass to the subsequence \(j_{k}\) in (3.11) and apply (3.12), (3.13) and (3.14), yielding
for all \(x \in {\textrm{dom}}_{\chi _{0}} \cap \{\psi _0^k(x) \le l_0\}\). Hence, \(\bar{x}\) minimizes \(\sum _{i=1}^{m}\bar{u}_{i}\big [\langle \nabla f_{i}(\bar{x}),x\bar{x}\rangle +\tfrac{L_{i}}{2}\Vert x\bar{x}\Vert _{}^{2}+\chi _{i}(x)\big ]\) on \({\textrm{dom}}_{\chi _{0}}\cap \{\psi _0^k(x) \le l_0\}\). Now noting \(\bar{x} \in X \subset \{\psi _0^k(x) < l_0\}\) and using the stationarity condition for optimality of \(\bar{x}\), we have
where we dropped the constraint \(\psi _0^k(x) \le l_0\) due to complementary slackness and the fact that \(\psi _{0}^k(\bar{x}) < l_0\).
Let \({\mathcal {B}}=\{i:\bar{u}_{i}>0\}\), then we must have \(\lim _{k\rightarrow \infty }\lambda _{i}^{j_{k}}=\lim _{k\rightarrow \infty }\bar{u}_{i}\Vert \lambda ^{j_{k}}\Vert _{}^{}=\infty \) for \(i\in {\mathcal {B}}\). Based on complementary slackness, we have \(\psi _{i}^{j_{k}}(x^{j_{k}+1})=\eta _{i}^{j_{k}}\) for any \(i\in {\mathcal {B}}\) for large enough k. Due to (3.14), we have the limit: \(\psi _{i}(\bar{x})=\eta _{i}\). Therefore, the ith constraint is active at \(\bar{x}\), i.e. \(i\in {\mathcal {A}}(\bar{x})\). In view of (3.16), there exists subgradients \(v_{i}\in \partial \psi _{i}(\bar{x}), i \in [m]\) and \(v_0 \in N_{{\textrm{dom}}_{\chi _{0}}}(\bar{x})\) such that
However, equation (3.17) contradicts the MFCQ assumption. This is because MFCQ guarantees the existence of \(z {\in N^*_{{\textrm{dom}}_{\chi _{0}}}(\bar{x})}\) such that \(\langle z,v_{i}\rangle <0\) for all \(i\in {\mathcal {A}}(\bar{x})\), which implies
where the first inequality follows since \(z \in N_{{\textrm{dom}}_{\chi _{0}}}(\bar{x})\) and \(v_0 \in N_{{\textrm{dom}}_{\chi _{0}}}(\bar{x})\) implying that \(\langle z, v_0\rangle \le 0\), the second inequality follow since \(\bar{u}_i \ge 0\) and \(v_i \in \partial \psi _i(\bar{x})\), and the last strict inequality follows due to uniform MFCQ (c.f. Assumption 3 and relation (2.3)) and \(\bar{u}_i > 0\) for at least one \(i \in {\mathcal {B}}\). This clearly leads to a contradiction. Hence, we conclude that \(\{\lambda ^{k+1}\}\) is a bounded sequence and conclude the proof.
Part 2). Without loss of generality, we assume that \(\bar{x}\) is the only limit point. Since \(\{\lambda ^{k+1}\}\) is a bounded sequence, there exists a limit point \(\bar{\lambda }\). Passing to a subsequence if necessary, we have \(\lambda ^{k+1}\rightarrow \bar{\lambda }\).
From the optimality condition \(0\in \partial _{x}{\mathcal {L}}_{k}(x^{k+1},\lambda ^{k+1})\), for any x, we have
Let us take \(k\rightarrow \infty \) on both sides of (3.18). We note that \(\lim _{k\rightarrow \infty }\Vert x^{k}x^{k+1}\Vert _{}^{}=0\) due to Proposition 2, \(\lim _{k\rightarrow \infty }{\chi _i(x^{k+1})=\chi _i(\bar{x})}\) due to the continuity of \(\chi _{i}\) (\(i\in [m]\)), and \(\chi _{0}(\bar{x})\le \liminf _{k\rightarrow \infty }\chi _{0}(x^{k})\) due to the lower semicontinuity of \(\chi _{0}(\cdot )\). It then follows that
In other words, \(\bar{x}\) is the minimizer of convex optimization problem \(\min _x\big \langle \nabla f_0(\bar{x})+\sum _{i=1}^{m}\bar{\lambda }_i\nabla f_{i}(\bar{x}),x\big \rangle +\chi _{0}(x)+\bar{\lambda }\big [\chi (x)\eta \big ]\) over \({\textrm{dom}}_{\chi _{0}}\). Hence we have
In addition, using the complementary slackness, we have
due to the convergence \(\lim _{k\rightarrow \infty }\lambda ^{k+1}=\bar{\lambda }\), \(\lim _{k\rightarrow \infty }\eta ^{k}=\eta \), \(\lim _{k\rightarrow \infty }\chi (x^{k+1})=\chi (\bar{x})\) and \(\lim _{k\rightarrow \infty }\Vert x^{k+1}x^{k}\Vert _{}^{}=0\). Putting (3.20) and (3.21) together, we conclude that \((\bar{x},\bar{\lambda })\) satisfies the KKT condition. \(\square \)
Remark 1
To show the existence of a limit point \(\bar{x}\), we use Assumption 1.3 to ensure that the sequence \(x^k\) remains in a bounded domain. For the sake of conciseness, we henceforth assume the existence of a limit point \(\bar{x}\) and do not delve into the technical assumption used to ensure this condition. Moreover, it should be noted that the boundedness property can be obtained under other assumptions, e.g., assuming the compactness of sublevel set \(\{x: \psi _{0}(x) \le \psi _{0}(x^0)\}\) and using the sufficient descent condition (3.5), we can immediately show the existence of \(\bar{x}\). However, it appears to be more challenging to show convergence via this approach when sufficient descent condition fails (e.g., in the forthcoming stochastic optimization).
3.1 Dependence of B on the constraint qualification.
In Theorem 1, we proved existence of a bound B on the dual multiplier. However, the size of that bound still remains unknown. Through Example 1 below, we observe that the limiting behaviour of the sequence \(\lambda ^k\) (which largely governs the size of B) is closely tied to the magnitude of the number \(c(\bar{x})\), where
Here, the inner max follows from the relation (2.3) and outer min tries to find the best possible z that ensures MFCQ. It is observed that if \(c(\bar{x})\) is a large positive number, then MFCQ is strongly satisfied and B is a reasonable bound. In contrast, if \(c(\bar{x})\) is close to 0, then B can get quite large.
Example 1
Consider a two dimensional optimization problem with SCAD constraint: \(\min _{x} \psi _0(x)\) subject to \(\psi _1(x) \le \eta _1\) where \(\psi _0(x) = 7x_1\) and \(\psi _1(x) = \beta \Vert x\Vert _{1}^{}  \sum _{i =1}^2 h_{\beta , \theta } (x_i)\). Note that
This function fits our framework with the smoothness parameter \(\tfrac{1}{\theta 1}\). Lets consider \(\beta = 1, \theta = 5\), the level \(\eta _1 = 3\) and limit point \(\bar{x} = (5,0)\). Clearly, the constraint is active at \(\bar{x}\) and h is \(\tfrac{1}{4}\)smooth function. Then, \(\partial \psi (\bar{x}) = \{\begin{bmatrix}0 \\ t \end{bmatrix}: t \in [1,1]\}\) as per the definition. Then, one can see that uniform MFCQ is violated at the limit point \(\bar{x}\). Indeed,
implying \(c(\bar{x}) = 0\). Furthermore, no \(\lambda \) can be found satisfying the KKT condition
for all \(t \in [1,1]\). Hence, as we get close to this limit point, bound on \(\Vert \lambda ^k\Vert _{}^{}\) will get arbitrarily large. Easy way to see this fact is to construct a subproblem at the limit point itself. After observing the feasible region for the subproblem at (5, 0), it is clear that it has only one feasible solution (5, 0) which gives rise to degeneracy. See Fig. 1 for more details. Figure 1a shows the wellbehaved subproblem at an interior point while Fig. 1b show the degeneracy at the limit.
However, as we change level \(\eta _1\) to any value either above or below 3, we do not get any violation of MFCQ. It also gives nondegenerate feasible sets at limit point and \(\lambda ^k\) remains bounded for all k. See Fig. 2 below for more details. In particular, if \(\eta _1 = 2.5 < 3\), then \(\bar{x} = (3,0)\) is the limit point. At this point, we have \(\partial \psi (\hat{x}) = \{\begin{bmatrix}0.5 \\ t \end{bmatrix}: t \in [1,1]\}\). Moreover, we have
Choosing the unit vector \(z = (z_1, z_2) = (1,0)\), we obtain that point \(\bar{x}\) satisfies MFCQ with \(c(\bar{x}) = 0.5\). Hence, even when the search point reaches the limit point \(\hat{x}\) (i.e., \(\epsilon \rightarrow 0\)), the \(\lambda ^k\) still exists. (See, in particular, Fig. 2b whose subproblem at \(\bar{x}\) has a nonempty interior).
The view of the above example, we see that the limiting behavior of \(\Vert \lambda ^k\Vert _{}^{}\) (and by implication the order of B) is closely related to the “strength” of the constraint qualification MFCQ at the limit point. In order to get an apriori bound B, we use a somewhat stronger yet verifiable constraint qualification called as strong feasibility which is a slight modification of the CQ proposed in [4, Assumption 3].
Assumption 4
[Strong feasibility CQ] There exists \(\hat{x} \in {\mathcal {X}}:= \bigcap _{i\in [m]}\{x:\psi _i(x) \le \eta _i\} \cap {\textrm{dom}}_{\chi _{0}}\) such that
where \(D_{\mathcal {X}}:= \max _{x_1, x_2 \in {\mathcal {X}}} \Vert x_1x_2\Vert _{}^{}\) is the diameter of the set \({\mathcal {X}}\).
In view of Assumption 1.3, we note that \({\mathcal {X}}\) is a bounded set. Hence, \(D_{\mathcal {X}}\) and (consequently) Assumption 4 are welldefined. See [4] for a connection between Assumption 4 and Assumption 3. Below, we show that strong feasibility CQ leads to a fixed apriori bound on \(\lambda ^k\).
Lemma 2
Suppose Assumption 4 is satisfied. Then, \(\Vert \lambda ^k\Vert _{1}^{} \le B:= \tfrac{\psi _0(\hat{x})  \psi _0^* + L_0D_{\mathcal {X}}^2}{L_{\min }D_{\mathcal {X}}^2}\)
Proof
Note that
where first inequality uses \(f_i(\hat{x}) \ge f_i(x^k) + \langle \nabla f_i(x^k), \hat{x}  x^k\rangle  \tfrac{L_i}{2} \Vert \hat{x} x^k\Vert _{}^{2}\) (follows due to \(L_i\)smoothness of \(f_i\)), and second inequality follows by Assumption 4 along with the fact that \(x^k \in {\mathcal {X}}\) (see Proposition 2).
In view of (3.24), we have strict feasibility of subproblem (3.1) for all k implying that \(\lambda ^{k+1}\) exists. Hence, we have \(x^{k+1} = \mathop {\text {argmin}}\limits _{x} \psi ^k_0(x) + \langle \lambda ^{k+1}, \psi ^k(x)\rangle \). Then, for all \(x \in {\textrm{dom}}_{\chi _{0}}\), we have
where equality follows from the complementary slackness of the KKT condition, and inequality is due to optimality of \(x^{k+1}\). Using \(x = \hat{x}\) in the above relation and combining with (3.24), we obtain
Finally, note that
where first inequality follows by (3.24) for \(i = 0\) and \(\psi ^k_0(x) \ge \psi _0(x)\), and second inequality follows by the definition of \(\psi _0^*\) and \(D_{\mathcal {X}}\). Combining the above relation with (3.25), we get the result. Hence, we conclude the proof. \(\square \)
The discussion above implies that the value of B is intricately related to the constraint qualification. While uniform MFCQ is unverifiable and does not allow for a priori bounds on B, it is widely used in nonlinear programming to ensure the existence of such a bound [2]. As observed in Fig. 1b and Fig. 2b, the actual value of B depends on the closeness of the MFCQ violation at the limit point. This situation is rare, but the current assumptions do not eliminate that possibility. Problems of this nature are illconditioned, and to our knowledge, no algorithm can ensure bounds on the dual in such a situation. The existing literature deals with this issue in two ways: One track assumes existence of B (similar to Theorem 1) and performs the complexity or convergence analysis; A second track assumes a stronger constraint qualification that removes the illconditioned problems and shows more explicit bound on the dual (similar to Lemma 2. We perform our analysis for both cases. To conclude, we henceforth assume that the bound B is a constant and do not delve into the discussion on related constraint qualification. To substantiate that the bound B is small, we perform detailed numerical experiments in Sect. 8.
3.2 Convergence rate analysis of LCPG method
Our main goal now is to develop some nonasymptotic convergence rates for Algorithm 1.
Lemma 3
In Algorithm 1, for \(k=1,2,...,\) we have
Proof
Let \({\mathcal {L}}_{k}\) be the Lagrangian function of subproblem (3.1):
Using (2.1) and (3.27), we have
Using the smoothness of \(f_i(x)\), the optimality condition \(0\in \partial _x{\mathcal {L}}_{k}(x^{k+1},\lambda ^{k+1})\) and the triangle inequality, we obtain
Hence we conclude the proof. \(\square \)
In view of Lemma 3, we derive the complexity of LCPG to attain approximate KKT solutions in the following theorem.
Theorem 2
Let \(\alpha _k>0\) \((k=0,1,..,K)\) be a nondecreasing sequence and suppose that Assumption 3 holds, then there exists a constant \(B>0\) such that
where \(D=\sqrt{\tfrac{\psi _{0}(x^{0})\psi _{0}^{*}}{L_{0}}}\). Moreover, if we choose the index \(\hat{k}\in \{0,1,...,K\}\) with probability \(\mathbb {P}(\hat{k}=k)=\alpha _k/(\sum _{k=0}^K\alpha _k)\), then \(x^{\hat{k}+1}\) is a randomized \(\epsilon _K\) typeI KKT point with
Proof
From the sufficient descent property (3.5), we have
where the second inequality uses the monotonicity of sequence \(\psi _0(x^k)\). In view of Theorem 1 and CauchySchwarz inequality, we have \(\langle \lambda ^{k+1},L\rangle \le \Vert \lambda ^{k+1}\Vert _{}^{}\Vert L\Vert _{}^{} \le B\Vert L\Vert _{}^{}\). This relation and (3.26) implies
Combining the above inequality with (3.33) immediately yields (3.30).
Next, we bound the error of complementary slackness. We have
where the first inequality uses the triangle inequality, the second inequality uses complementary slackness and the Lipschitz smoothness of \(f_{i}(\cdot )\), and the last inequality follows from CauchySchwartz inequality and boundedness of \(\lambda ^{k+1}\). Summing up (3.34) weighted by \(\alpha _k\) for \(k=0,...,K\), we have
Combining the above result with (3.33) gives (3.31). Finally, the fact that \(x^{\hat{k}+1}\) is a randomized \(\epsilon _K\) typeI KKT point for \(\epsilon _K\) defined in (3.32) is immediately follows from (3.30), (3.31) and Definition 3. \(\square \)
The following corollary shows that the output of Algorithm 1 is a randomized \({\mathcal {O}}(1/K)\) KKT point under more specific parameter selection.
Corollary 1
In Algorithm 1, suppose that all the assumptions of Theorem 2 hold. Set \(\delta ^{k}=\tfrac{\eta \eta ^{0}}{(k+1)(k+2)}\) and \(\alpha _k=k+1\). Then \(x^{\hat{k}+1}\) is a randomized \(\epsilon \) TypeI KKT point with
Proof
Notice that \(\alpha _K=K+1\), \(\sum _{k=0}^K \alpha _k = \tfrac{(K+1)(K+2)}{2}\). Moreover, for \(i\in [m]\) and \(k\ge 0\), we have
which implies that \(\sum _{k=0}^{K}\alpha _k\Vert \eta \eta ^{k}\Vert _{}^{}=(K+1) \Vert \eta \eta ^0\Vert _{}^{}\). Plugging these values into (3.32) gives us the desired conclusion. \(\square \)
Remark 2
Corollary 1 shows that the gradient complexity of LCPG for smooth composite constrained problems is on a par with that of gradient descent for unconstrained optimization problems. To the best of our knowledge, this is the first complexity result for a constrained problem where the constraint functions can be nonsmooth and nonconvex. Note that the convergence rate involves the unknown bound B on the Lagrangian multipliers. The presence of such a constant is not new in nonlinear programming literature [1, 8, 9, 12, 13]. Fortunately, we can safely implement LCPG method since the stepsize scheme does not rely on B. On the other hand, the bound B is often a problemdependent quantity. E.g., in [4] authors show a class of problems for which an a priori bound B can be established, or [5] shows the exact value of B for a class of nonconvex relaxations of sparse optimization problems. In such cases, our comparisons are arguably fair. Hence, throughout the paper, we make comparative statements under the assumption that B largely depends on the problem.
4 Stochastic optimization
The goal of this section is to extend our proposed framework to stochastic constrained optimization where the objective \(f_0\) is an expectation function:
Here, \(F(x,\xi )\) is differentiable and \(\xi \) denotes a random variable following a certain distribution \(\varXi \). Directly evaluating either the objective \(f_0\) or its gradient can be computationally challenging due to the stochastic nature of the problem. To address this, we introduce the following additional assumptions.
Assumption 5
The information of \(f_{0}\) is available via a stochastic firstorder oracle (SFO). Given any input x and a random sample \(\xi \), SFO outputs a stochastic gradient \(\nabla F(x,\xi )\) such that
for some \(\sigma \in (0,\infty )\).
4.1 Level constrained stochastic proximal gradient
In Algorithm 2, we present a stochastic variant of LCPG for solving problem 1.1 with \(f_0\) defined by (4.1). As observed in (4.2) and (4.3), the LCSPG method uses a minibatch of random samples to estimate the true gradient in each iteration. It should be noted that the value \(f_0(x^k)\) is presented in (4.3) only for the ease of description, it is not required when solving (3.1).
Note that the proximal point method of [4] does not need to account for a stochastic nonconvex problem separately since they solve corresponding stochastic convex subproblems using a ConEx method developed in their work. On the contrary, LCSPG directly applies to stochastic nonconvex function constrained problems and convex subproblems are deterministic in nature. Hence, we need to develop asymptotic convergence and convergence rates for the LCSPG method separately.
Let \(\zeta ^{k}=G^{k}\nabla f(x^{k})\) denote the error of gradient estimation. We have
The following proposition summarizes some important properties of the generated solutions of LCSPG.
Proposition 3
In Algorithm 2, for any \(\beta _{k}\in (0,2\gamma _{k}L_{0})\), we have
Moreover, there exists a vector \(\lambda ^{k+1}\in \mathbb {R}^m_+\) such that the KKT condition (3.6) (with \(\psi _0^k\) defined in (4.3)) holds.
Proof
By the KKT condition, \(x^{k+1}\) is the minimizer of \({\mathcal {L}}_{k}(\cdot ,\lambda ^{k+1})\). Therefore, we have
Placing \(x=x^{k}\) in (4.5) and using (3.27), we have
where the second inequality is due to the complementary slackness \(\lambda _{i}^{k+1}\big [\psi _{i}^{k}(x^{k+1})\eta _{i}^{k}\big ]=0\) and strict feasibility \(\lambda _{i}^{k+1}\big [\psi _{i}^{k}(x^{k})\eta _{i}^{k}\big ]=\lambda _{i}^{k+1}\big [\psi _{i}(x^{k})\eta _{i}^{k}\big ]<0.\) Using (4.6) and Lipschitz smoothness of \(f_0\), we have
Above, the last inequality uses the fact \(\tfrac{a}{2} x^2+bx\le \tfrac{b^2}{2a}\) for any \(x,b\in \mathbb {R}, a>0\). Showing the existence of the KKT condition follows a similar argument of proving part 2, Proposition 2. \(\square \)
We prove a technical result in the following lemma which plays a crucial role in proving dual boundedness.
Lemma 4
Let \(\{X_k\}_{k \ge 1}\) be a sequence of random vectors such that \(\mathbb {E}[X_k] = {0}\) for all \(k \ge 1\) and \(\sum _{k=1}^\infty \sigma _k^2 \le M < \infty \) where \(\sigma _k:= \sqrt{\mathbb {E}[\Vert X_k\Vert _{}^{2}]}\). Then, \(\lim _{k\rightarrow \infty } X_k = 0\) almost surely (a.s.).
Proof
We prove this result by contradiction. If the result does not hold then there exists \(\epsilon >0\) and \(c > 0\) such that
However, due to Chebyshev’s inequality, we have \(\mathbb {P}(\Vert X_k\Vert _{}^{} \ge \epsilon ) \le \tfrac{\sigma _k^2}{\epsilon ^2}\). Since \(\sigma _k^2\) is summable, there exists \(k_0\) such that \(\sum _{k =k_0}^\infty \mathbb {P}(\Vert X_k\Vert _{}^{} \ge \epsilon ) \le \sum _{k =k_0}^\infty \tfrac{\sigma _k^2}{\epsilon ^2} < c\). Therefore, we have
The above relation contradicts (4.8). Hence, we have \(\lim _{k\rightarrow \infty } X_k = 0\) a.s. \(\square \)
In the following theorem, we present the main asymptotic property of LCSPG.
Theorem 3
Suppose that \(\sum _{k=0}^\infty b_k^{1}<\infty \), then we have that \(\lim _{k\rightarrow \infty }\Vert x^{k+1}x^k\Vert _{}^{}=0\), a.s. Moreover, suppose that Assumption 3 holds, \(\gamma _k<\infty \), \(\beta _k\) is lower bounded and \(2\gamma _k\beta _kL_0>0\), then we have that (1) \(\sup _{k} \Vert \lambda ^k\Vert _{}^{}<\infty \) a.s., and (2) all the limit points of Algorithm 2 satisfy the KKT condition, a.s.
Proof
First, we fix notations. Let \((\varOmega , {\mathcal {F}}, \mathbb {P})\) be the probability space defined over the sampling minibatches \(B_0,B_1,\ldots ,\). Let \(\mathbb {E}_k[\cdot ]\) be the expectation conditioned on the sub \(\sigma \)algebra generating \(B_0,B_1,\ldots , B_{k1}\). Applying it to (4.4) gives
In view of the supermartingale convergence theorem [30], we have that
Let \(C_{k+1}=\sum _{s=0}^k\tfrac{2\gamma _s\beta _sL_0}{2}\Vert x^{s+1}x^s\Vert _{}^{2}\) for \(k\ge 0\) and \(C_0=0\). We have
Applying the supermartingale convergence theorem [30] again we can show that the limit of \(\psi _0(x^k)+C_k\) exists a.s. Together with (4.9) and lowerboundedness of \(\beta _k\) and \(2\gamma _k\beta _kL_0\), we have that \(\lim _{k\rightarrow \infty }\Vert x^{k+1}x^k\Vert _{}^{2}=0\), a.s.
Next, we prove the boundedness of \(\Vert \lambda ^k\Vert _{}^{}\). Let us consider the events
We just argued \(\mathbb {P}({\mathcal {B}})=1\). It is easy to see that if both conditions (i) \(\mathbb {P}({\mathcal {A}})=1\) and (ii) \({\mathcal {U}}\subseteq {\mathcal {A}}^c\cup {\mathcal {B}}^c\) hold, then we have \(\mathbb {P}({\mathcal {U}})\le \mathbb {P}({\mathcal {A}}^c)+\mathbb {P}({\mathcal {B}}^c)=0\). Hence \(\{\lambda ^k\}\) is a bounded sequence a.s.
Since \(\{b_k^{1}\}\) is summable, we have \(\sum _{k=1}^\infty \mathbb {E}[\Vert \zeta ^k\Vert _{}^{2}] \le \tfrac{\sigma ^2}{b_k} < \infty \). Hence, using Lemma 4, we have that \(\lim _{k\rightarrow \infty } \zeta ^k = 0\) a.s. Due to the boundedness of \(\nabla f(x^k)\), we have that \(G^k=\zeta ^k+\nabla f(x^k)\) is bounded, a.s.
We prove condition (ii) by contradiction. Suppose that our claim fails. We take an element \(\omega \in {\mathcal {U}}\cap ({\mathcal {A}}\cap {\mathcal {B}})\) and then pass to a subsequence \(\{j_k\}\) such that \(\lim _{k\rightarrow \infty } \Vert \lambda ^{j_k}(\omega )\Vert _{}^{}=\infty \). In the rest of the proof, we skip \(\omega \) for brevity. Passing to another subsequence if necessary, let \(\bar{x}\) be a limit point of \(\{x^{j_k}\}\). By our presumption, \(\bar{x}\) satisfies MFCQ. Moreover, the KKT condition implies that
Dividing both sides by \(\Vert \lambda ^{j_k+1}\Vert _{}^{}\) gives
where we denote \({u}^{k}=\tfrac{\lambda ^{j_k+1}}{\Vert \lambda ^{j_k+1}\Vert }\). Since \(\{u^k\}\) is bounded, passing to a subsequence if needed, we have \(\lim _{k\rightarrow \infty }u^{k}=\bar{u}\). Since \(\omega \in {\mathcal {A}}\cap {\mathcal {B}}\), \(\{G^{j_k}\}\) is bounded and \(\{\tfrac{\gamma _{j_k}}{2}\Vert x^{j_k+1}x^{j_k}\Vert _{}^{2}\}\) converges to 0. Therefore, taking \(k\rightarrow \infty \) on both sides of (4.11), we have
Analogous to the proof of Theorem 1, it is easy to show that \(\bar{x}\) violates MFCQ, which however, contradicts Assumption 3. As a consequence of this argument, we have \({\mathcal {U}}\subseteq {\mathcal {A}}^c \cup {\mathcal {B}}^c\). Hence, we claim that the event \(\sup _k \Vert \lambda ^k\Vert _{}^{}<\infty \) will happen a.s. and complete our proof of the boundedness condition.
Next, we prove asymptotic convergence to KKT solutions. For any random element \(\omega \), let \(\bar{x}(\omega )\) be any limit point of \(\{x^k\}\). Passing to some subsequence if necessary, we assume that \(\lim _{k\rightarrow \infty }x^k=\bar{x}\) and \(\lim _{k\rightarrow \infty }\lambda ^{k+1}=\bar{\lambda }\).
Moreover, we have
Combining the above two results, we have
Taking \(k\rightarrow \infty \) in the above relation and noting that almost surely we have \(\lim _{k\rightarrow \infty }\zeta ^k = 0\) and \(\lim _{k\rightarrow \infty } \Vert x^kx^{k+1}\Vert =0\), then
Using an argument similar to the one in Theorem 1, we can show that \(\bar{x}\) is almost surely a KKT point. \(\square \)
Our next goal is to develop the iteration complexity of Algorithm 2. To achieve this goal, we need to assume that the dual is uniformly bounded, namely, condition (3.9) holds for all the random events. While this condition is stronger than the almost sure boundedness of \(\lambda ^{k+1}\) shown by Theorem 3, it is indeed satisfied in many scenarios, e.g., when strong feasibility (Assumption 4) holds or other scenarios described in [4, 5]. We present the main complexity result in the following theorem.
Theorem 4
Suppose that condition (3.9) holds. Then, the sequence \(\{(x^{k+1},\lambda ^{k+1})\}\) satisfies that
Proof
First, appealing to (4.3), (2.1) and (3.27), we have
It follows that
In view of the above result and basic inequality \((a+b)^{2}\le 2a^{2}+2b^{2}\), we have
Let us denote an auxiliary sequence \(C_k= {\left\{ \begin{array}{ll} \psi _0(x^0) &{} k=0\\ \psi _0(x^k) \sum _{s=0}^{k1}\tfrac{\Vert \zeta ^s\Vert ^2}{2\beta _s} &{} k>0 \end{array}\right. }.\) Proposition 3 implies that
Putting this relation and (4.15) together, we have
Summing up (4.17) over \(k=0, 1,\ldots ,K\) weighted by \(\alpha _k\) leads to
Moreover, since \(\{C_k\}\) is monotonically decreasing, we have
Combining the above two relations leads to our first result (4.13).
For the second part, note that (3.34) remains valid in the stochastic setting. Putting (3.34) and (4.16) together, we obtain
Multiplying both ends by \(\alpha _k\) and then summing up the resulting terms over \(k=0,\ldots , K\) gives (4.14). \(\square \)
We next obtain more specific convergence rate by choosing the parameters properly.
Corollary 2
In Algorithm LCSPG, set \(\gamma _{k}=L_{0}\), \(\beta _{k}=L_{0}/2\), \(\alpha _{k}=k+1\), \(b_{k}=K+1\) and \(\delta ^k = \tfrac{(\eta \eta ^0)}{(k+1)(k+2)}\). Then \(x^{\hat{k}+1}\) is a randomized \(\epsilon \) typeI KKT point with
Proof
Plugging in the value of \(\gamma _k\), \(\alpha _k\), \(\beta _k\) in the relation (4.13) and taking expectation over all the randomness, we have
Moreover, due to the random sampling of \(\hat{k}\), we have
Combining the above two results, we have
Second, plugging in the values of \(\gamma _k\), \(\beta _k\) and \(\delta ^k\) in (4.14), we have
It then follows from (4.18) and the definition of \(\hat{k}\) that
This completes our proof. \(\square \)
Remark 3
In order to obtain some \(\epsilon \)error in satisfying the type I KKT condition, LCSPG requires a number of \({\mathcal {O}}(\varepsilon ^{2})\) calls to the SFO, which matches the complexity bound of stochastic gradient descent for unconstrained nonconvex optimization [15]. Moreover, due to minibatching, LCSPG obtains an even better \({\mathcal {O}}(\varepsilon ^{1})\) complexity in the number of evaluations of \(f_i(x)\) and \(\nabla f_i(x)\) (\(i\in [m]\)).
4.2 Level constrained stochastic variance reduced gradient descent
We consider the finite sum problem:
where each \(F(x,\xi _{i})\) is Lipschitz smooth with the parameter \(L_{0}\), \(i=1,2,\ldots , n\). To further improve the convergence performance in this setting, we present a new variant of the stochastic gradient method by extending the stochastic variance reduced gradient descent to the constrained setting.
We present the level constrained stochastic variancereduced gradient descent (LCSVRG) in Algorithm 3, which extends the nonconvex variance reduced mirror descent (see [20]) to handle nonlinear constraint. Algorithm 3 can be viewed as a doubleloop algorithm in which the outer loop computes the full gradient \(\nabla f(x^{k})\) once every T iterations and the nested loop performs stochastic proximal gradient updates based on an unbiased estimator of the true gradient. In this view, we let k indicate the tth iteration at the rth epoch, for some values t and r. Then we use k and (r, t) interchangeably throughout the rest of this section. We keep the notation \(\zeta ^{k}\) (or \(\zeta ^{(r,j)}\)) for expressing \(G^{k}\nabla f(x^{k})\) and note that \(\zeta ^{(r,0)}=0\).
Our next goal is to develop some iteration complexity results of LCSVRG. We skip the asymptotic analysis since it is similar to that of LCSPG. The following Lemma (see [20, Lemma 6.10]) presents a key insight of Algorithm 3 that the variance is controlled by the point distances. We provide proof for completeness.
Lemma 5
In Algorithm 3, \(G^{k}\) is an unbiased estimator of \(\nabla f_0(x^{k})\). Moreover, Let (r, t) correspond to k. If \(t>0\), then we have
Proof
We prove the first part by induction. When \(k=0\), we have \(G^{0}=\nabla f_0(x^{0})\). Then for \(k>0\), if \(k\%T==0\), we have \(G^{k}=\nabla f(x^{k})\) by definition. Otherwise, we have
by induction hypothesis \(\mathbb {E}_{k1}\big [G^{k1}\big ]=\nabla f(x^{k1})\).
Next, we estimate the variance of the stochastic gradient. Appealing to (4.20), we have
where the third equality uses the independence of \(B_{k}\) and \(\zeta ^{k1}\), the first inequality uses the bound \({\text {Var}}(x)\le \mathbb {E}\Vert x\Vert ^{2}\), and the second inequality uses the Lipschitz smoothness of \(F(\cdot ,\xi )\). Taking expectation over all the randomness generating \(B_{(r,1)},B_{(r,2)},\ldots ,B_{(r,t)}\), we have
\(\square \)
The next Lemma shows that the generated solutions satisfy a property of sufficient descent on expectation.
Lemma 6
Assume that \(\gamma _{k}=\gamma \) and \(\beta _{k}=\beta \) and \(\tilde{L} :=\tfrac{2\gamma  \beta  L_{0}}{2}\tfrac{L_{0}^{2}(T1)}{2\beta b}>0.\) Then we have
Proof
In view of (3), at the jth iteration of the rth epoch, we have
Summing up the above result over \(j=0,1,2,...,t\) (\(t<T\)) and using Lemma 5, we have
Here we use \({\textstyle {\sum }}_{j=0}^{1}\cdot =0\). \(\square \)
We present the main convergence property of Algorithm 3 in the next theorem.
Theorem 5
Suppose that condition (3.9) and assumptions of Lemma 6 hold, \(b\ge 2T\) and \(K=r_0T+j_0\) for some \(r_0, j_0 \ge 0\). Let \(\{\alpha _k\}\) be a nondecreasing sequence and \(\{\alpha _{(r,j)}\}\) be its equivalent form in (r, j) notations. Suppose that \(\alpha _{(r,j)}=\alpha _{(r,0)}\) for \(j=1,2,...,T1\). Then we have
Moreover, if we take \(T=\lceil \sqrt{n}\rceil \), \(b=8T,\gamma =L_{0},\beta =L_{0}/2,\) and \(\alpha _k= T\,\lfloor k/T\rfloor +1\), and set \(\delta ^{k}=\tfrac{\eta \eta ^{0}}{(k+1)(k+2)}\). Then \(x^{\hat{k}+1}\) is a randomized TypeI \(\epsilon \)KKT point with
Proof
First, using Lemma 5 and the assumption that \(b\ge 2T\), for any \(t\le T1\) we have
Note that (4.15) still holds. Therefore, combining (4.15) and (4.25) leads to
It then follows from Lemma 6 that
Let \(K=r_0T+j_0\). Summing up the above inequality weighted by \(\alpha _k\) and exchanging the notation \(\alpha _k \leftrightarrow \alpha _{(r,j)}\), then we have
Above, the first inequality applies (4.26) and uses \(x^{(r,T)}=x^{(r+1,0)}\) while the second inequality uses the monotonicity of \(\{\psi _0(x^k)\}\) and an argument similar to (3.33).
The second part is similar to the argument of Theorem 4. Particularly, combining (3.34) and (4.21) gives
Consequently, using the above relation and an argument similar to show (3.33), we deduce
Therefore, we complete the proof of (4.22) and (4.23).
Using the provided parameter setting, we have \(\tilde{L}=\tfrac{2\gamma L_{0}\beta }{2}\tfrac{L_{0}^{2}(T1)}{2\beta b}\ge \tfrac{L_{0}}{4}\tfrac{L_{0}}{8}=\tfrac{L_{0}}{8}\). Moreover, since \(\alpha _k=T\,\lfloor k/T\rfloor + 1\), we have \(\alpha _{k}\le T\cdot k / T+1\le k+1\). It is easy to check
\(\square \)
Remark 4
It is interesting to compare the performance of LCSVRG with the other two level constrained firstorder methods in the finite sum setting (4.19). Similar to LCPG, LCSVRG runs for \({\mathcal {O}}(\varepsilon ^{1})\) iterations to compute TypeI \(\epsilon \)KKT point. Moreover, LCSVRG has an appealing feature that the number of stochastic gradient \(\nabla F(x,\xi )\) computed can be significantly reduced for a large value of n. Specifically, Algorithm 3 requires a full gradient \(\nabla f_0(x)\) every T iterations, which contributes \(N_{1}={\mathcal {O}}\big (n\big \lceil \tfrac{K}{T}\big \rceil \big )={\mathcal {O}}(\sqrt{n}K)\) stochastic gradient computations. During the other iterations, Algorithm 3 invokes a batch of size \(b={\mathcal {O}}(T)\) each time, exhibiting a complexity of \(N_{2}={\mathcal {O}}\big (bK\big )={\mathcal {O}}(\sqrt{n}K)\). Therefore, the total number of stochastic gradient computations is \(N=N_{1}+N_{2}={\mathcal {O}}\big (\sqrt{n}K\big ).\) This is better than the O(nK) stochastic gradients needed by LCPG. Moreover, it is better than the bound \(O(K^2)\) of LCSPG when K is at an order larger than \(\varOmega (\sqrt{n})\), which corresponds to a higher accuracy regime of \(\epsilon \ll \tfrac{1}{\sqrt{n}}\). The complexities of all the proposed algorithms for getting some \(\varepsilon \)KKT solutions are listed in Table 2.
Remark 5
While we mainly discuss the finitesum objective (4.19), it is possible to extend the variance reduction technique to handle the expectationbased objective (4.1) and improve the \({\mathcal {O}}(\varepsilon ^{2})\) bound of LCSPG to \({\mathcal {O}}(\varepsilon ^{3/2})\). To achieve this goal, we impose an additional assumption that \(F(x,\xi )\) is \(L_0\)Lipschitz smooth for each \(\xi \) in the support set. We choose to omit a detailed discussion on this particular extension, as the technical development for this can be readily derived from the arguments in Sec. 6.5.2. [21] and our previous analysis.
5 Smooth optimization of nonsmooth constrained problems
In this section, we consider the constrained problem (1.1) with nonsmooth objective and nonsmooth constraint functions. We assume that \(f_i\) (\(i\in \{0,1,...,m\}\)) exhibits a differenceofconvex (DC) structure \(f_i(x):= g_i(x)  h_i(x)\): 1) \(h_i\) is an \(L_{h_i}\)Lipschitzsmooth convex function and 2) \(g_i\) is a structured nonsmooth convex function:
where \(A_i \in \mathbb {R}^{a_i\times n}\) is a linear mapping, \({\mathcal {Y}}_i \subset \mathbb {R}^{a_i}\) is a convex compact set and \(p_i:{\mathcal {Y}}_i \rightarrow \mathbb {R}\) is a convex continuous function. In view of such a nonsmooth structure, we can not simply apply the LCPG method, as the crucial quadratic upperbound on \(f_i(x)\) does not hold in the nonsmooth cases. However, as pointed out by Nesterov [29], the nonsmooth convex function \(g_i\) can be closely approximated by a smooth convex function. Let us denote \(\widehat{y}_i :=\mathop {\text {argmin}}\limits _{y_i \in {\mathcal {Y}}_i} \Vert y_i\Vert _{}^{}\), \(D_{{\mathcal {Y}}_i} :=\max _{y_i \in {\mathcal {Y}}_i} \Vert y_i\widehat{y}_i\Vert _{}^{}\) and define the approximation function
Given some properly chosen smoothing parameter \(\beta _i\), we propose to apply LCPG to solve the following smooth approximation problem:
Prior to the analysis of our algorithm, we need to develop some properties of the smooth function \(f_i^{\beta _i}\). We first present a key Lemma which builds some important connection between the quadratic approximation of smooth function and Lipschitz smoothness. The proof is left in Appendix A.
Lemma 7
Suppose \(p(\cdot )\) is continuously differentiable function satisfying
for all x, y. Then, \(p(\cdot )\) satisfies
In smooth approximation, it is shown in [29] that \(g_i^{\beta _i}\) is a Lipschitz smooth function and it approximates the function value of \(g_i\) with some \({\mathcal {O}}(\beta _i)\)error:
Similar properties of \(f_i^{\beta _i}\) are developed in the following proposition.
Proposition 4
We have the following properties about the approximation function \(f_i^{\beta _i}\) \((\beta _i>0)\).

1.
Let \( \bar{\beta }_i \in [0, \beta _i]\), then we have
$$\begin{aligned} f_i^{\beta _i}(x) \le f_i^{\bar{\beta }_i}(x) \le f_i^{\beta _i}(x) + \tfrac{(\beta _i\bar{\beta }_i) D_{{\mathcal {Y}}_i}^2}{2}. \end{aligned}$$(5.6) 
2.
\(f_i^{\beta _i}(x)\) has upper curvature \(L^{\beta _i}_{g_i}\) and negative lower curvature \(L_{h_i}\), namely,
$$\begin{aligned} f_i^{\beta _i}(x)&\le f_i^{\beta _i}(y)+\langle \nabla f_i^{\beta _i}(y),xy\rangle +\tfrac{L^{\beta _i}_{g_i}}{2}\Vert xy\Vert ^{2}, \end{aligned}$$(5.7)$$\begin{aligned} f_i^{\beta _i}(x)&\ge f_i^{\beta _i}(y)+\langle \nabla f_i^{\beta _i}(y),xy)\tfrac{L_{h_i}}{2}\Vert xy\Vert ^{2}. \end{aligned}$$(5.8) 
3.
\(f_i^{\beta _i}\) is Lipschitz smooth with modulus \(L_i^{\beta _i}:= \max \{ L_{g_i}^{\beta _i}, L_{h_i} \}\). Namely, for any x, y, we have
$$\begin{aligned} \Vert \nabla f_i^\beta (x)  \nabla f_i^\beta (y)\Vert _{}^{} \le L_i^{\beta _i} \Vert xy\Vert _{}^{}. \end{aligned}$$(5.9)
Proof
Part 1. If \(\bar{\beta }<\beta \), then by definition we have \(f_i^{\bar{\beta }_i}(x) \ge f_i^{\beta _i}(x)\). On the other hand, using the boundedness of \({\mathcal {Y}}_i\), we have
Combining the above two results gives the desired inequality.
Part 2. Since \(g^{\beta _i}_{i}\) and \(h^i\) are both convex and smooth functions, we have
Summing up the above two inequalities and noting the definition of \(f^{\beta _i}_i, \nabla f^{\beta _i}_i\), we conclude that \(f^{\beta _i}_{i}\) has an upper curvature of \(L^{\beta _i}_{g_i}\). Similarly, using convexity of \(g^{\beta _i}_{i}\) and smoothness of \(h_i\), we obtain that \(f^{\beta _i}_{i}\) has a negative lower curvature \(L_{h_i}\).
Part 3. The Lipschitz continuity (5.9) is an immediate consequence of part 2) and Lemma 7. \(\square \)
Remark 6
When \(\bar{\beta _i}=0\), Relation (5.6) reads \(f_i^{\beta _i}(x) \le f_i(x) \le f_i^{\beta _i}(x) + \tfrac{\beta _i D_{{\mathcal {Y}}_i}^2}{2}.\) Together with Assumption 2, it can be seen that \(x^0\) is also strictly feasible for problem (5.1). This justifies that LCPG is welldefined for problem (5.1).
Remark 7
The Lipschitz constant of \(\nabla f^{\beta _i}_i\) can be derived in a different way. Since \(\nabla g^{\beta _i}_i\) and \(\nabla h_i\) are \(L_{g_i}^{\beta _i}\) and \(L_{h_i}\) Lipschitz continuous, respectively, we can show by triangle inequality that \(\nabla f^{\beta _i}_{i}(x)\) is \(L_{g_i}^{\beta _i} + L_{h_i}\)Lipschitz continuous. In contrast, by exploiting the asymmetry between lower and upper curvature, Proposition 4 derived a slightly sharper bound on the gradient Lipschitz constant.
Throughout this section, we choose specific \(\beta _i\) to ensure \(\beta _iD_{{\mathcal {Y}}_i}^2\) is constant for all \(i \in [m]\). Hence, we can define the additive approximation factor above as
Note that (5.4) provides an approximation error for function values, or the socalled zerothorder oracle of function \(g_i\). However, convergence results for nonconvex optimization are generally given in terms of firstorder stationarity measure, implying that we need approximation for firstorder oracle for the function \(f_i\) and consequently function \(g_i\). Below we discuss a widely used approximate subdifferential for convex functions and generalize it for nonsmooth nonconvex functions.
Definition 5
[\(\nu \)subdifferential] We say that a vector \(v \in \mathbb {R}^n\) is a \(\nu \)subgradient of the convex function \(p(\cdot )\) at x if for any z, we have
The set of all \(\nu \)subgradients of p at x is called the \(\nu \)subdifferential, denoted by \(\partial ^{\nu }p(x)\). Moreover, we define \(\nu \) subdifferential of nonconvex function \(f_i\) as \(\partial ^{\nu } f_i(x):= \partial ^{\nu } g_i(x) + \{\nabla h_i(x)\}\) where the addition of sets is in Minkowski sense.
Finally, we define a generalization of typeI KKT convergence criterion for structured nonsmooth nonconvex function constrained optimization problem:
Definition 6
We say that a point x is an \((\epsilon ,\nu )\) typeIII KKT point of (1.1) if there exists \(\lambda \ge 0\) satisfying the following conditions:
Moreover, we say that x is a randomized \((\epsilon ,\nu )\) typeIII KKT point of (1.1) if (5.11), (5.12) and (5.13) are satisfied in expectation.
The \(\epsilon \)subdifferential and the typeIII KKT point are essential for associating smooth approximation with the original nonsmooth problem. We build some important properties in the following proposition.
Proposition 5
Let \(\beta _i\) and \(\nu \) satisfy (5.10).

1.
For any \(x\in \mathbb {R}^d\), we have \(\nabla f_i^{\beta _i}(x) \in \partial ^{\nu }f_i(x)\), \(i=0,1,2,...,m\).

2.
Suppose that x is a (randomized) TypeI \(\epsilon \)KKT point of problem (5.1) and \(\lambda \) is the associated dual variable with bound \(\Vert \lambda \Vert \le B\), then x is a (randomized) TypeIII \((\bar{\epsilon }, \nu )\)KKT point of problem (1.1) for \(\bar{\epsilon }=\max \{\epsilon +B\nu , m\nu \}\).
Proof
Part 1. It suffices to show \(\nabla g_i^\beta (x) \in \partial ^{\nu }g_i(x)\). Due to the convexity of \(g_i^{\beta _i}\) and (5.4), we have
where the first inequality follows from the first relation in (5.4), and the third inequality follows from second relation in (5.4). Noting the definition of \(\nu \)subgradient, we conclude the proof.
Part 2. It suffices to show the conversion of randomized TypeI KKT points to randomized TypeIII KKT points. Suppose that x is a randomized TypeI \(\epsilon \)KKT point and we have \(\Vert \lambda \Vert _1\le B\). Using Part 1 it is easy to see \(\partial {\mathcal {L}}(x, \lambda ) \subseteq \partial ^\nu \psi _0(x)+{\textstyle {\sum }}_{i=1}^m \partial ^\nu \psi _i(x)\), therefore, we have
Using Proposition 4, we have
This implies
Summing up this inequality over \(i=1,2,3,...,m\) and taking expectation with all the randomness, we have the
Moreover, we have
\(\square \)
Now, we are ready to discuss the convergence rate of LCPG for nonsmooth nonconvex function constrained optimization.
Theorem 6
Assume that \(\beta _i, \nu \) satisfy (5.10) and set \(\delta ^{k}=\tfrac{\eta \eta ^{0}}{(k+1)(k+2)}\) when running LCPG to solve problem (5.1). Denote \(c_i = \Vert A_i\Vert _{}^{2}D_{{\mathcal {Y}}_i}^2\) \((0\le i \le m)\), \(c=[c_1,c_2,..., c_m]^T\) and let \(\Vert \lambda ^k\Vert _1\le B\). Suppose that \(\nu =o(\tfrac{c_i}{L_{h_i}})\) for \(i=0,1,2,...,m\), then \(x^{\hat{k}+1}\) is a randomized TypeIII \((\bar{\epsilon },\nu )\)KKT point with
Proof
Our analysis resembles the proof of Theorem 2. Using a similar argument in (3.33), we have
Combining this result with Lemma 3 we obtain
where \(\varDelta =\psi _0(x^0)\psi _0(x^*)\), \(\alpha _k\ge 0\) and \({\mathcal {L}}^\beta (x, \lambda ) :=\psi _{0}^{\beta _0}(x) + {\textstyle {\sum }}_{i=1}^m\lambda _{i}(\psi ^{\beta _i}_i(x) \eta _i)\). Taking \(\delta ^{k}=\tfrac{\eta \eta ^{0}}{(k+1)(k+2)}\) and \(\alpha _k=k+1\) in (5.15), we see that \(x^{\hat{k}+1}\) is a TypeI \(\epsilon \)KKT point for
Noting that \(L_{g_i}^{\beta _i} = \tfrac{\Vert A_i\Vert _{}^{2}}{\beta _i} = \tfrac{\Vert A_i\Vert _{}^{2}D_{{\mathcal {Y}}_i}^2}{2\nu }=\tfrac{c_i}{2\nu }\) and \(\nu =o(\tfrac{c_i}{L_{h_i}})\), we have \(\tfrac{(L_{0}^{\beta _0}+B\Vert L^{\beta }\Vert _{}^{})^{2}}{L_0^{\beta _0}}={\mathcal {O}}\big (\tfrac{(c_0+B\Vert c\Vert _{}^{})^2}{c_0\nu }\big )\) and \(\tfrac{\Vert L^{\beta }\Vert _{}^{}}{L_0^{\beta _0}}=\tfrac{\Vert c\Vert _{}^{}}{c_0}\). Using the definition of \(\hat{k}\) and Proposition 5 we obtain the desired result. \(\square \)
6 Inexact LCPG
LCPG requires the exact optimal solution of subproblem (3.1), which, however, poses a great challenge when the subproblem is difficult to solve. To alleviate such an issue, we consider an inexact variant of LCPG method for which the update of \(x^{k+1}\) only solves problem (3.1) approximately. This section is organized as follows. First, we present a general convergence property of inexact LCPG when the subproblem solutions satisfy certain approximation criterion. Next, we analyze the efficiency of inexact LCPG when the subproblems are handled by different external solvers. When the subproblem is a quadratically constrained quadratic program (QCQP), we propose an efficient interior point algorithm by exploiting the diagonal structure. When the subproblem has general proximal components, we propose to solve it by firstorder methods. Particularly, we consider solving the subproblem by the constraint extrapolation (ConEx) method and develop the total iteration complexity of ConExbased LCPG.
6.1 Convergence analysis under an inexactness criterion
Throughout the rest of this section, we will denote the exact primaldual solution of (3.1) as \((\widetilde{x}^{k+1}, \widetilde{\lambda }^{k+1})\). We use the following criterion for measuring the accuracy of subproblem solutions.
Definition 7
We say that a point x is an \(\epsilon \)solution of (3.1) if
The following theorem shows asymptotic convergence to stationarity for inexact LCPG method under mild assumptions. Since the proof is similar to the previous argument, we present the details in Appendix B for the sake of completeness. Note that the theorem applies to a general nonconvex problem and hence applies to convex problems as well.
Theorem 7
Suppose that Assumption 3 holds and let \(x^{k+1}\) be an \(\epsilon _k\)solution of (3.1) satisfying \(\epsilon _k < \min _{i \in [m]} \delta _{i}^k\). Then all the conclusions of Theorem 1 still hold. Then the dual sequence \(\{\tilde{\lambda }^{k}\}\) is uniformly bounded by a constant \(B>0\). Moreover, every limit point of inexact LCPG is a KKT point.
Under the inexactness condition in Definition 7, we establish the complexity of inexact LCPG in the following theorem.
Theorem 8
Under the assumptions of Theorem 7, we have
where \(\tilde{\varDelta }={\textstyle {\sum }}_{k=0}^K\alpha _{k}[\psi _0(x^k)\psi _0(x^{k+1})+{\varepsilon }_k]\). Moreover, if we choose the index \(\hat{k} \in \{0,1, \dots , K\}\) with probability \(\mathbb {P}(\hat{k} = k) = \alpha _k/({\textstyle {\sum }}_{i=0}^K\alpha _{i}) \), then \(x^{\hat{k}}\) is a randomized \((\epsilon ,\delta )\) typeII KKT point with
In particular, using \(\alpha _k = k+1\), \(\epsilon _k = \min _{i \in [m]} \tfrac{\delta ^k_i}{2}\) and \(\delta _{i}^k = \tfrac{\eta _i  \eta ^k_i}{(k+1)(k+2)}\), we have \(x^{\hat{k}}\) is \((\epsilon , \delta )\) typeII KKT point of (1.1) where
Proof
Using (3.8) with \(x^{k+1}\) replaced by \(\widetilde{x}^{k+1}\) (the optimal solution of problem (3.1)) and adding \(f(x^k) + \tfrac{L_0}{2}\Vert x^k\widetilde{x}^{k+1}\Vert _{}^{2}\) on both sides, we have
where the second inequality follows from \(\psi _{0}^k(x^k) = \psi _{0}(x^k)\) as well as \(x^{k+1}\) being an \(\epsilon _k\)solution (see Definition 7) of subproblem (3.1), and the third inequality follows from the fact that \(\psi _{0}^k(x) \ge \psi _{0}(x)\) for all \(x \in {\textrm{dom}}{\chi _{0}}\).
Using Lemma 3 (again \(x^{k+1}\) is replaced by \(\widetilde{x}^{k+1}\)), noting that \(\epsilon _{k}\) satisfies the requirements of Theorem 7 implying that \(\Vert \widetilde{\lambda }^k\Vert _{}^{} \le B\) and using (6.6), we have
Similar to the argument of (3.34), we have
Multiplying (6.6), (6.7) and (6.8) by \(\alpha _{k}\) and summing over \(k=0, 1, \ldots , K\) give (6.1), (6.2) and (6.3).
We derive a convergence rate based on the specified parameters. First, from relation (6.6), we note that \(\psi _{0}(x^{k+1}) \le \psi _{0}(x^k) + \epsilon _k\). Hence, we have by induction that
By setting \(\alpha _k = (k+1)\) and \(\epsilon _{k} = \min _{i \in [m]} \tfrac{\delta _{i}^k}{2} = \min _{i \in [m]} \tfrac{\eta _i\eta ^k_i}{2(k+1)(k+2)}\) for all \(k \ge 0\) (note that \(\epsilon _k\) satisfies the requirement of Theorem 7), we have
Here, (i), (ii) follows from (6.9) and \(\alpha _{k+1}\alpha _k = 1 (> 0)\), (iii) follows \((Ki)\) is 0 at \(i = K\), (iv) follows by observing \(\epsilon _k \le \tfrac{\Vert \eta \eta ^0\Vert _{}^{}}{2(k+1)(k+2)}\) and last inequality follows since \({\textstyle {\sum }}_{k =0}^\infty \tfrac{1}{(k+1)(k+2)} = 1\) and \(\alpha _K = K+1\).
Applying same arguments as those in Corollary 1, we have \({\textstyle {\sum }}_{k =0}^K\alpha _k\Vert \eta \eta ^k\Vert _{}^{} = \alpha _K\Vert \eta \eta ^0\Vert _{}^{}\). Using this relation along with \({\textstyle {\sum }}_{k =0}^K\alpha _k = \tfrac{(K+1)(K+2)}{2}\) and (6.10) inside (6.4), we have (6.5). Hence, we conclude the proof. \(\square \)
Remark 8
Compared to the convergence result (3.31) for exact LCPG, we have to control the accumulated error in \(\tilde{\varDelta }\) for the inexact case (6.4). However, we need an even more stringent condition on the error to ensure asymptotic convergence. Specifically, we assume \({\varepsilon }_k\) to be smaller than the level increments \(\delta _{i}^k\) to ensure that each subsequent subproblem is strictly feasible. As long as the subproblems are solved deterministically with sufficient accuracy, we can ensure such feasibility as well as the boundedness of the dual.
Remark 9
Note that the convergence analysis of the inexact method for the stochastic case will go through in a similar fashion. In particular, the subproblems of LCSPG are still deterministic in nature. Hence, a deterministic error can be easily incorporated into the analysis of the stochastic outer loop. In particular, Proposition 3 will have an additional \(\epsilon _k\) in the RHS. We can use \(\epsilon _k = \min _{i \in [m]} \tfrac{\delta ^k_i}{2}\) to ensure the strict feasibility. Following the analysis in Theorem 4, we will get the additional term \({\textstyle {\sum }}_{k=0}^K \alpha _k\epsilon _k \). Note that we have identical policies for \(\alpha _k\) in the above analysis and Corollary 2. Furthermore, since \(\delta _k\) used above and in Corollary 2 are the same, we have identical values of \(\epsilon _k\) as well. Following the above development, we can easily bound the additional \({\textstyle {\sum }}_{k=0}^K \alpha _k\epsilon _k \) term.
6.2 Solving the subproblem with the interior point method
Our goal is to develop an efficient interior point algorithm to solve problem (3.1) when \(\chi _i(x)=0\), \(i\in [m]\). Without loss of generality, we express the subproblem as the following QCQP:
We assume that the initial solution \(\hat{x}\) of such problem is strictly feasible, namely, there exists \(\delta >0 \) such that
Let \(e_1=[1,0,\ldots ,0]^\textrm{T}\in \mathbb {R}^{d+1}\). With a slight abuse of notation, we can formulate (6.11) as the following problem
Here we set artificial variables \(L_{m+1}=1, a_{m+1}=0\) and \(b_{m+1}=\tfrac{1}{2}R^2\) for some sufficiently large R. We explicitly add such a ball constraint to ensure bound on \((\eta , x)\). Note that the bound R always exists since our domain is compact and the objective is Lipschitz continuous. Our goal is to apply the pathfollowing method to solve (6.13). We denote
Since each \(\tilde{g}_i(u)\) is convex quadratic in u, \(\phi (u)\) is a selfconcordant barrier with \(\upsilon =m+2\). The key idea of the pathfollowing algorithm is to approximately solve a sequence of penalized problems
with increased values of \(\tau \), and generate a sequence of strictly feasible solution \(u_\tau \) close to the central patha trajectory composed of the minimizers \(u_\tau ^*=\mathop {\text {argmin}}\limits _u \phi _\tau (u)\).
We apply a standard pathfollowing algorithm (See [28, Chapter 4]) for solving (6.13), and outline the overall procedure in Algorithm 4. This algorithm consists of two main steps:

1.
Initialization: We seek a solution \(u^0\) near the analytic center (i.e. minimizer of \(\phi (u)\)). To this end, we solve a sequence of auxiliary problems \(\hat{\phi }_\tau (u)=\tau w^\textrm{T}u + \phi (u)\) where \(w=\nabla \phi (\hat{u})\). It can be readily seen that \(\hat{u}\) is in the central path of this auxiliary problem with \(\tau =1\). Performing a reverse pathfollowing scheme ( decreasing rather than increasing \(\tau \)), we gradually converge to the analytic center.

2.
Pathfollowing: We solve a sequence of penalized problems with an increasing value of \(\tau \) by a damped version of Newton’s method, which ensures the solutions in the proximity of the central path.
6.2.1 Solving the Newton equation
First, we calculate the gradient and Hessian map of \(\phi _t(\cdot )\):
where \(\theta _i = \tilde{g}_i(u)^{1}\), and
Note that computing the gradient \(\nabla \phi _t(u)\) takes \({\mathcal {O}}(dm)\), hence the computation burden is from forming and solving the Newton systems. This is divided into two cases.

1.
\(m < d\). Then the Hessian is the sum of a low rank matrix and a diagonal matrix. Based on the ShermanMorrisonWoodbury formula, we have
$$\begin{aligned}{}[\nabla ^2\phi _\tau (u)]^{1} = \varGamma ^{1}\varGamma ^{1}N\big (I+N^\textrm{T}\varGamma ^{1} N\big )^{1}N^\textrm{T}\varGamma ^{1}. \end{aligned}$$(6.16)Computing the product \(N^\textrm{T}\varGamma ^{1} N\) takes \({\mathcal {O}}(m^2d)\) while performing Cholesky factorization takes \({\mathcal {O}}({m^3})\). Therefore, the overall complexity of each Newton step is \({\mathcal {O}}(m^3+m^2d)={\mathcal {O}}(m^2d)\).

2.
\(m \ge d\). In such case, we can directly compute \(NN^\textrm{T}\) in \({\mathcal {O}}(md^{2})\) and then perform Cholesky factorization \(\nabla ^2\phi _\tau (x)=LL^\textrm{T}\) in \({\mathcal {O}}(d^{3})\), followed by two triangle systems. Hence the overall complexity of a Newton step is \({\mathcal {O}}(d^{3}+md^2)={\mathcal {O}}(md^2)\).
Due to the above discussion, the cost of computing each Newton system is
6.2.2 Complexity
Before deriving the complexity of solving the subproblems, we require some additional assumptions. We assume that \(M=\max _x\big \{ \Vert \nabla g_i(x)\Vert \big \}\) and \(\max g_0({x})\min _x g_0(x)\le V\). Note that these assumptions are easily satisfied if we assume functions in the original problem have bounded level sets.
According to [28, Theorem 4.5.1], the complexity of interior point methods depends not only on the time to follow the central path, but also on the time to arrive near the analytic center from an arbitrary initial point. Let us put it in the context of Algorithm 1. Despite the strict feasibility guarantee, we do not know whether \(x^k\) is near the analytic center of each subproblem. It remains to show how to control the complexity of approximating the analytic center.
To measure the strict feasibility of the initial point, we use the Minkowsky function of the domain, which is defined by \(\pi _x(y)=\inf \{t>0: \, x+t^{1}(yx)\in D\}\) for any given x in the interior of the domain. With the help of the Minkowsky function, we bound the distance between the initial point and the boundary in the following proposition.
Proposition 6
Let \(\hat{u}=(g_0(\hat{\eta }), \hat{x})\) where \(\hat{\eta }=g_0(\hat{x})+\delta \). If \(\Vert u\hat{u}\Vert \le \tfrac{\delta }{M+1}\), then u is feasible for problem (6.13). Moreover, we have
where \(u^*\) is defined in phase zero of Algorithm 4.
Proof
We have
Analogously, for \(i=0,1,\ldots ,m\), we have
Using triangle inequality, we have \(\tilde{g}_i(u)=g_i(x)\le g_i(\hat{x})+\delta \)=0. The last constraint in (6.13) is trivially satisfied for sufficiently large R. Therefore, u is a feasible point of (6.13).
Let \(t^+=\tfrac{(M+1)\Vert \hat{u}u^* \Vert }{(M+1)\Vert \hat{u}u^*\Vert +\delta }\), then from the above analysis, we know that the point
must be a feasible solution. Using the last constraint \(\Vert u\Vert \le R\), we immediately obtain the bound (6.18). \(\square \)
Using [28, Theorem 4.5.1] and Proposition 6, we can derive the total complexity of solving the diagonal QCQP.
Theorem 9
Under the assumptions of Proposition 6, the total number of Newton steps to get an \(\varepsilon \) solution is
Corollary 3
In the inexact LCPG method, assume that the subproblems are solved by Algorithm 4 and the returned solution satisfies the inexactness requirement in Theorem 8. Then, to get an \({\mathcal {O}}(\epsilon , \epsilon )\) TypeII KKT point, the overall arithmetic cost of Algorithm 4 is
Proof
According to Theorem 8, the total number of LCPG is \(K={\mathcal {O}}(1/\varepsilon )\). In the kth iteration of LCPG, we set the error criteria \(\nu ={\mathcal {O}}(\tfrac{1}{k^2})\) and \(\varepsilon ={\mathcal {O}}(\tfrac{1}{k^2})\). Theorem 9 implies that the number of Newton steps is \(N_k={\mathcal {O}}(\sqrt{m}\ln (k))\). Therefore, the total number of Newton steps in LCPG is \(T_K = {\textstyle {\sum }}_{k=0}^K N_k = {\mathcal {O}}\big (\sqrt{m}\tfrac{1}{\varepsilon }\ln \big (\tfrac{1}{\varepsilon }\big )\big ). \) Combining this result with (6.17) gives us the desired bound. \(\square \)
Remark 10
First, at the kth step of LCPG, we need \(\log (k)\) iterations of interior point methods, of which the complexity order is equally contributed by the two phases of IPM. Specifically, we first require \({\mathcal {O}}(\ln (k))\) Newton steps to pull the iterates from near the boundary to the proximity of the central path, and then require \({\mathcal {O}}(\ln (k))\) to obtain an \({\mathcal {O}}(1/k^2)\)accurate solution. Second, it is interesting to consider the case when the constraint is far less than the feature dimensionality, namely, \(m \ll d \). We observe that the total computation
is linear in dimensionality. Third, despite the simplicity, the basic barrier method offers a relatively stronger approximate solution than what is needed in Theorem 8, the feasibility of the solution path allows us to weaken the assumption to \(\hat{\varepsilon }_k=0\). Nevertheless, besides our approach, it is possible to employ longstep and infeasible primaldual interior point methods which may give a better empirical performance.
6.3 Solving subproblems with the firstorder method
In this section, we use a previously proposed ConEx method [4] to solve the subproblem (3.1) when general proximal functions \(\chi _{i}\) are present. Then, we analyze the overall complexity of LCPG method with ConEx method as a subproblem solver. First, we formally state the extended version of problem (6.11) as follows:
For the application of ConEx for the subproblem, we need access to a convex compact set X such that \(\cap _i{\textrm{dom}}{\chi _{i}} \subseteq X\). Moreover, X is a “simple” set in the sense that it allows easy computation of the proximal operator of \(\chi _{0}(x) + {\textstyle {\sum }}_{i=1}w_i\chi _{i}(x)\) for any given weights \(w_i, i =1, \dots , m\). Such assumptions are not very restrictive as many machine learning and engineering problems explicitly seek the optimal solution from a bounded set. Under these assumptions, we apply ConEx to solve the subproblem (3.1) of LCPG. We now reproduce a simplified optimality guarantee of the ConEx method below without necessarily going into the details of the algorithm.
Theorem 10
[4] Let x be the output of ConEx after T iterations for problem (6.19). Assume that \(\phi _0\) is a strongly convex function and \((\widetilde{x}, \widetilde{\lambda })\) is the optimal primaldual solution. Moreover, Let B be a parameter of the ConEx method which satisfies \(B> \Vert \widetilde{\lambda }\Vert _{}^{}\). Then, the solution x satisfies
Even though ConEx can be applied to a wider variety of convex function constrained problems, it has two vital and intricate issues that need to be addressed in our context:

1.
The solution path of ConEx can be arbitrarily infeasible in the early iterations, while the successive iterations make the solutions infeasibility smaller. Note that the approximation criterion in Definition 7 requires guarantees on the amount of infeasibility. This implies ConEx has to run a significant number of iterations before getting sufficiently close to the feasible set.

2.
Since ConEx is a primaldual method, its convergence guarantees depend on the optimal dual solution \(\lambda ^*\). Moreover, a bound on the dual, \(B (> \Vert \lambda ^*\Vert _{}^{})\), is required to implement the algorithm to achieve an accelerated convergence rate of \(O(1/T^2)\) for strongly convex problems.
From Theorem 10, it is clear that ConEx requires a bound B. This requirement naturally leads to two cases: (1) bound B can be estimated apriori, e.g., see Lemma 2; and (2) bound B is known to exist but cannot be estimated, e.g., see Theorem 1. Both cases have different convergence rates for the subproblem which leads to different overall computational complexity.
Case 1: B can be estimated apriori. In this case, we do not need to estimate \(B^k\) as in (6.20). Using the bound B, we can get accelerated convergence of ConEx in accordance with Theorem 10 which leads to better performance of the LCPG method. The corollary below formally states the total computational complexity of LCPG method for this case.
Corollary 4
If an explicit value of B is known, the LCPG method with ConEx as subproblem solver obtains \(O(\tfrac{1}{K},\tfrac{1}{K})\) typeII KKT point in \(O(K^2)\) computations.
Proof
According to Theorem 10, the required ConEx iterations for each subproblem can be bounded by
Since B is a constant, we have \(T^k = O(\epsilon _k^{1/2}) = O(k)\). Finally, we have total computations \({\textstyle {\sum }}_{k=1}^K T^k = O(K^2)\). Hence, we conclude the proof. \(\square \)
Case 2: B is known to exist but cannot be estimated. For the subproblem (3.1), we can easily find \(B^k > \Vert \widetilde{\lambda }^{k+1}\Vert _{}^{}\) by using the difference in levels of successive iterations. This bound is weak, especially in the limiting case as it does not take into account
Proposition 7
For subproblem (3.1), we have
Proof
By Slater’s condition, we know that \(\widetilde{\lambda }^{k+1}\) exists. Then, due to saddle point property of \((\widetilde{x}^{k+1}, \widetilde{\lambda }^{k+1})\), we have for all \(x \in X\)
where the equality follows by complementary slackness. Using \(x = x^k\) in the above relation and noting that \(x^k\) satisfies \(\psi (x^k) \le \psi ^{k1}(x^k) \le \eta ^k\), we have \(\eta ^{k+1}  \psi ^{k}(x^k) = \eta ^{k+1}  \psi (x^k) \ge \eta ^{k+1}  \eta ^k = \delta ^k\) implying that
where second inequality follows from \(\widetilde{\lambda }^{k+1} \ge 0\) and \(\delta ^k > 0\) and last inequality follows due to the fact that \(\Vert \widetilde{\lambda }^{k+1}\Vert _{1}^{} \ge \Vert \widetilde{\lambda }^{k+1}\Vert _{}^{}\). We can further upper bound the LHS of the above relation as follows
where the last inequality follows since \(\widetilde{x}^{k+1}\) is feasible for the original problem (1.1). Combining the above two relations, we obtain (6.20). Hence, we conclude the proof. \(\square \)
We now state the final computation complexity of LCPG withConEx which uses the bound in (6.20).
Corollary 5
If an explicit value of B is not known, the LCPG method with ConEx as subproblem solver obtains \(O(\tfrac{1}{K},\tfrac{1}{K})\) typeII KKT point in \(O(K^4)\) computations.
Proof
Using Proposition 7, we can set \(B^k:= \tfrac{\psi _{0}(x^k)\psi _{0}^*}{\delta _{i_*}^k}\) where \(i_*:= \mathop {\text {argmin}}\limits _{i \in m} \eta _i  \eta ^0_i\). Then, required ConEx iterations \(T^k\) can be bounded by
Finally, in view of (6.9) and the fact that \({\textstyle {\sum }}_{i=0}^\infty \epsilon _i \le \Vert \eta  \eta ^0\Vert _{}^{}\) implies that \(B^k \le \tfrac{1}{\delta _{i_*}^k}[\psi _{0}(x^0)  \psi _{0}^*+ \Vert \eta \eta ^0\Vert _{}^{}]\) for all k. Moreover, for all \(k \le K\), we have \(\epsilon _k = \tfrac{\delta ^k_{i_*}}{2}\). Hence, we get \(T^k = O({\epsilon _k^{3/2}}) = O(k^3)\). Finally, we have \(\sum _{k= 1}^{K} T^k = O(K^4)\) which is the overall computational complexity of LCPG method with ConEx as subproblem solver to obtain \((O(\tfrac{1}{K}), O(\tfrac{1}{K}))\) typeII KKT point. \(\square \)
Remark 11
[Gradient complexity vs. computational complexity] Note that evaluating the gradient of \(\psi _{i}^k(x)\) is relatively simple since it does not involve any new computation of \(\nabla f_i(x)\). In that sense, the entire inner loop requires only one \(\nabla f_i\) computation; hence the total gradient complexity of \(\nabla f_i\) equals the total outer loops of inexact LCPG. On the other hand, inner loop computation does contribute to the problem’s computational complexity. However, such iterations are expected to be very cheap given the ease of obtaining gradients for the QP subproblem (3.1) with identity hessian matrices.
7 LCPG for convex optimization
In this section, we establish the complexity of LCPG (i.e., Algorithm 1) when the objective \(f_0\) and constraint \(f_i, i \in [m]\) are convex. In particular, we consider two convex problems, depending on whether \(f_0\) is convex or strongly convex. To provide a combined analysis of the two cases, we assume the following:
Assumption 6
\(f_0(x)\) is \(\mu _0\)convex function for some \(\mu _0\ge 0\). Namely,
Note that if \(\mu _0= 0\) then \(f_0\) is simply a convex function. Now we provide the convergence rate of LCPG to optimality.
For more generality, we consider an inexact variant of LCPG for which an approximate solution in terms of Definition 7 is returned in each iteration. Let \((\widetilde{x}^{k+1}, \widetilde{\lambda }^{k+1})\) be the saddlepoint solution \({\mathcal {L}}_k(x, \widetilde{\lambda })\), i.e., \(\widetilde{x}^{k+1}\) is an exact solution of the subproblem (3.1). First, we extend the threepoint inequality in Lemma 1 for an inexact solution.
Lemma 8
Let \(z^+\) be an \(\epsilon \)approximate solution of problem \(\min _{x \in \mathbb {R}^d}\{g(x) + \tfrac{\gamma }{2}\Vert xz\Vert _{}^{2}\}\) where g(x) is a proper, lsc. and convex function. Then,
Proof
First, let \(x^+\) be the optimal solution of \(\min _{x \in \mathbb {R}^d}\{g(x) + \tfrac{\gamma }{2}\Vert xz\Vert _{}^{2}\}\). In view of Lemma 1, for any x, we have
Placing \(x=z^+\) above, we have
On the other hand, by the definition of \(\epsilon \)solution, we have
Combining the above two inequalities gives
Summing up (7.2) and (7.4) again and then rearranging the terms, we get
where the last inequality uses the fact that \(\tfrac{1}{2}\Vert a+b\Vert ^2\le \tfrac{1}{2}\Vert a\Vert ^2\langle a, b\rangle \le \tfrac{1}{2}\Vert a\Vert ^2+\Vert a\Vert \Vert b\Vert \) with \(a=xz^+\) and \(b=z^+x^+\). Finally, combining the above two results gives the desired inequality (7.1). \(\square \)
Using the above lemma, we provide the main convergence property of LCPG for convex optimization.
Lemma 9
Let x be feasible solution. Then, we have
Proof
Note that
where (i) follows from the definition of \(\psi _{0}^k\), (ii) follows since \(x^{k+1}\) is an \(\epsilon _k\) solution of (3.1), (iii) follows by complementary slackness for the optimal primaldual solution for (3.1) and (iv) follows from Lemma 8. In particular, we use \(g(x) + \tfrac{\gamma }{2}\Vert xz\Vert _{}^{2} = \psi _0^k(x) + \langle \widetilde{\lambda }^{k+1}, \psi ^k(x)  \eta ^k\rangle \) with \(z = x^k\), \(z^+ = x^{k+1}\), \(\epsilon = \epsilon _k\) and \(\gamma = L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle \). Note that \(x^{k+1}\) is an \(\epsilon _k\)approximate solution for \(\min _{x \in \mathbb {R}^d} \psi _{0}^k(x) + \langle \widetilde{\lambda }^{k+1}, \psi ^k(x) \eta ^k\rangle \) due to Definition 7.
Finally, note that
Using the above two relations in (7.7), we obtain (7.6). Hence, we conclude the proof. \(\square \)
Let \(x^*\) be an optimal solution of (1.1) and \(\widetilde{D}:= \max \{\Vert xy\Vert _{}^{}:x, y \in {\textrm{dom}}{\chi _{0}}, \psi _i(x) \le \eta _i, \psi _{i}(y) \le \eta _{i}, \text { for all } i\in [m] \}\). Now, we show convergence rate guarantees.
Theorem 11
Consider general convex optimization problems with \(\mu _0 = 0\). Suppose Assumption 3 is satisfied and set \(\delta ^k = \tfrac{(\eta \eta ^0)}{(k+1)(k+2)}\). Then we have
Proof
From Lemma 9 with \(\mu _0 = 0\) for convex part and \(\psi (x^*) \le \eta \), we have
Dividing both sides by \(\tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle }{2}\), we have
Note that the sequence \(\{\widetilde{\lambda }^{k+1}\}\) is uniformly bounded above such that \(\Vert \widetilde{\lambda }^{k+1}\Vert _{}^{} \le B\) for all \(k \ge 0\). Using this fact and the above relation, we have
Using \(\delta _k = \tfrac{\eta \eta ^0}{(k+1)(k+2)}\) and \(\epsilon _k = \tfrac{\Vert \eta \eta ^0\Vert _{}^{}}{2(k+1)(k+2)}\), we have \(x^{k}\) is strictly feasible solutions for (3.1) for all k. Hence, under Assumption 3, we can follow the steps of Theorem 1 to show uniform bound B on sequence \(\{\Vert \widetilde{\lambda }^k\Vert _{}^{}\}\). Using these values in (7.9), we have
Due to the optimality of the exact solution \(\widetilde{x}^{k+1}\), we have \(\psi ^k_{0}(\widetilde{x}^{k+1}) \le \psi ^k_{0}(x^k) = \psi _{0}(x^k)\). We also have \(\psi _0(x^{k+1}) \le \psi _0^k(x^{k+1}) \le \psi _{0}^k(\widetilde{x}^{k+1}) + \epsilon _k\). Combining these two relations, we get:
Effectively, inexact LCPG method is almost (up to an additive error of \(\epsilon _k\)) a descent method. Using this relation recursively, we have
Using the above relation in (7.10), we have
Multiplying the above relation by \(k+1\) and summing from \(k = 0\) to \(K1\), we have
After rearranging, this relation implies (7.8). Hence, we conclude the proof. \(\square \)
Theorem 12
Consider strongly convex problems (\(\mu _0>0\)) and suppose that Assumption 3 is satisfied. Set \(\delta ^k = \rho ^k(1\rho )(\eta \eta ^0)\) where \(\rho =\tfrac{L_0  \mu _0}{2(L_0a\mu _0)}\), \(2\epsilon _k \le a (1\rho )\rho ^k\Vert \eta \eta ^0\Vert _{}^{}\) and \(a\in (0,1)\). Then we have
Moreover, if \(\epsilon _k=0\), we have
Proof
Proceeding similar to the convex case, using Lemma 9, we obtain
For \(0< a < 1\), we have
Combining the above two results, we have
Let us denote
Multiplying both sides of the above inequality by \(\tfrac{\varGamma _{k}}{L_0+\langle \widetilde{\lambda }^{k+1},L\rangle \mu _0}\) and noting that \(\eta \eta ^k = \rho ^k (\eta \eta ^0)\) (follows by the choice of \(\delta ^k\)), we obtain
Since \(\Vert \widetilde{\lambda }^k\Vert _{}^{} \le B\), we have \((\tfrac{L_0 + B\Vert L\Vert _{}^{}a\mu _0}{L_0 + B\Vert L\Vert _{}^{}  \mu _0})^k \le \varGamma _k \le (\tfrac{L_0a\mu _0}{L_0  \mu _0})^k\). Moreover, we have \(2\epsilon _k \le a (1\rho )\rho ^k\Vert \eta \eta ^0\Vert _{}^{}\) and \(\rho =\tfrac{L_0  \mu _0}{2(L_0a\mu _0)}\). Using these relations in (7.13), we have
Similar to the convex part, we also have
Using the above relation into (7.14), we have
Summing the above relation from \(k = 0\) to \(K1\), we have
Note that
Combining the above two relations we obtain the desired result (7.11). \(\square \)
8 Numerical study
In this section, we conduct some preliminary studies to examine our theoretical results and the performance of the LCPG method. The experiments are run on CentOS with Intel Xeon (2.60 GHz) and 128 GB memory.
8.1 A simulated study on the QCQP
In the first experiment, we compare LCPG with some established opensource solvers such as CVXPY [11] and DCCP [32]. We consider the penalized Quadratically Constrained Quadratic Program (QCQP) described as follows,
where each \(Q_i\) (\(0\le i \le m\)) is an \(n\times n\) matrix, \(b_0,b_1,\ldots , b_m\) are ndimensional real vectors, \(\alpha \) is a positive weight on the \(\ell _1\) norm penalty, which helps to promote sparse solution. In the first setting, we consider a convex constrained problem where each \(Q_i\) is a positive semidefinite matrix. We set \(Q_i=VDV^\textrm{T}\) where V is an \(n\times n\) random sparse matrix with density 0.01, and its nonzero entries are uniformly distributed in [0, 1]. D is a diagonal matrix whose diagonal elements are uniformly distributed in between [0, 100]. We set \(b_i=10e+v\), where e is a vector of ones and \(v\in {\mathcal {N}}(0, I_{n\times n})\) is sampled from standard Gaussian distribution. We set \(c_i=10\) to make \(x=0\) a strictly feasible initial solution. Furthermore, we add a ball constraint to ensure that the domain is a compact set. We set \(r=\sqrt{20}\) and \(\alpha =1\). We fix \(m=10\) and explore different dimensions n from the set \(\{500, 1000, 2000, 3000, 4000\}\).
We solve Problem (8.1) by both CVXPY and LCPG. Both use the initial solution \(x=0\). For CVXPY, we use MOSEK as the internal solver due to its superior performance in quadratic optimization. In LCPG, for simplicity, we also solve the diagonal quadratic subproblem by MOSEK through CVXPY. Note that calling the external API repetitively for each LCPG subproblem only causes more overheads to run LCPG. Nonetheless, as we shall see, the standard IPM solvers can still fully leverage the diagonal structure and exhibit fast convergence.
In Table 3, we present the experiment results of the compared algorithms. The final objective, the norm of the dual solution (DNorm), and for LCPG, the maximum dual norm in the solution path (Max DNorm) are reported. All values represent the average of 5 independent runs. From the results, we observe that while LCPG does not outperform CVXPY for the smallsize problem (\(n=500\)), LCPG becomes increasingly favorable as the problem dimension increases. This justifies the empirical advantage of our proposed approach as we do not need to construct a full Hessian matrix. Moreover, interestingly, we observe that the dual solution norm \(\{\Vert \lambda ^k\Vert \}\) is increasing, reaching the maximum at the last iteration. This accounts for the equal values of DNorm and Max DNorm. Meanwhile, in all the cases, the dual remains bounded and the reported dual norm closely aligns with the solution returned by CVXPY. This result confirms our intuition that the dual bound is intricately tied to the nature of problems.
In the second setting of this experiment, we examine the performance of LCPG on nonconvex constrained optimization. Specifically, we express \(Q_i\) as the difference of two matrices: \({Q}_i=P_iS_i\), where \(P_i\) is generated in the same manner as \(Q_i\) in the first setting, and \(S_i = 10 I_{n\times n}\). Given the construction of the quadratic components, it is natural to view the function \( \tfrac{1}{2}x^TQ_i x + b_i^T x +c_i\) as a difference of two convex quadratic functions: \(\tfrac{1}{2}x^TP_i x + b_i^T x +c_i\tfrac{1}{2}x^T S_i x\). Leveraging this decomposition, we apply the DC programming, and more specifically, the DCCP framework to solve (8.1). Each convex subproblem of DCCP is solved by MOSEK through the CVXPY interface. In Table 4, we describe the performance of LCPG and the DCCP algorithm. It can be observed that LCPG compares favorably against the DCCP solver. Furthermore, the boundedness of the dual for both algorithms is also observed, which is consistent with our intuition.
8.2 Study of gradient complexities
In the next experiment, our primary goal is to examine the main theoretical results, namely, the gradient complexities of LCPG, its stochastic variants LCSPG and LCSVRG. We apply all these algorithms to a sparsityinduced finitesum problem, wherein a nonconvex constraint is incorporated into the supervised learning framework to actively enforce a sparse solution. The optimization problem is as follows
where \(f_i(x)\) is a smooth loss function associated with the ith sample, \(\psi _1(x)\) is the difference between \(\ell _1\)penalty and a convex smooth function g(x). Employing a differenceofconvex constraint is seen as a tighter relaxation of the cardinality constraint \(\Vert x\Vert _0\le \kappa \) than the \(\ell _1\) relaxation. The appealing properties of differenceofconvex penalties have been demonstrated in various studies [5, 14, 16, 17, 36, 37].
In view of the concave structure of \(g(x)\), there is a strong asymmetry between the lower and upper curvature of \(g(x)\), namely, the following
holds for certain \(L_g>0\). Note that this is much stronger than the \(L_g\) smoothness condition which adds an extra \(\tfrac{L_g}{2}\Vert yx\Vert ^2\) on the righthand side of (8.3). Due to this feature, one can impose a tighter piecewise linear surrogate function constraint
in the LCPG subproblem. It should be noted that our analysis is readily adaptable to accommodate this scenario since it is the smoothness, as opposed to concavity/convexity, that plays a central role in our convergence analysis and that remains valid. An empirical advantage of this approach is that we now have a tractable subproblem solvable in nearly linear time. See more discussion in [5].
Our experiment considers the task of binary classification with logistic loss, denoted by \(f_i(x)=\log (1+\exp (b_i(a_i^Tx))\), where \(a_i\in \mathbb {R}^d,b_i\in \{1,1\}\), \(1\le i \le n\). We use the SCAD penalty \(g(x)=\sum _{j=1}^d h_{\beta ,\theta }(x_j) \) where \(h_{\beta ,\theta }(\cdot )\) is defined in (3.22). We use the realsim dataset from the LibSVM repository [10] and the covtype data from the UCI repository [19]. For the latter, we formulate a binary classification task by distinguishing class “3” from the other classes. We set \(\beta =2\), \(\theta =5\), and \(\eta _1=\sigma d\), with \(\sigma \in \{0.4, 0.6\}\) for covtype and \(\sigma \in \{0.1, 0.2\}\) for realsim dataset. For each algorithm, we use its theoretically suggested batch size and stepsize. for a fair comparison, we count n evaluations of the stochastic gradient as an effective pass over the dataset and plot the objective value over the number of effective passes. Figure 3 plots the convergence result of the compared algorithms. It can be readily seen that LCSPG performs better than LCPG in terms of the number of gradient samples, and LCSVRG achieves the best performance among the three algorithms. The empirical findings further confirm our theoretical complexity analysis.
9 Conclusion
In this work, we presented a new LCPG method for nonconvex function constrained optimization which can achieve gradient complexity of the same order as that of unconstrained nonconvex problems. The key ingredient in our algorithm design is the use of constraint levels to ensure the subproblem feasibility, which allows us to overcome a wellknown difficulty in bounding the Lagrange multipliers in the presence of nonsmooth constraints. Moreover, a merit of our convergence analysis is its striking similarity with that of gradient descent methods for unconstrained problems. Therefore, we can easily extend our method to minimizing stochastic, finitesum, and structured nonsmooth functions with nonconvex function constraints; many of the complexity results were not known before. Another important feature of our work is that the method can deal with complex scenarios where the subproblems are not exactly solvable. To the best of our knowledge, existing work on sequential convex optimization (SQP, MBA) only assumes the subproblems to be exactly solved. We provided a detailed complexity analysis of LCPG when the subproblems are inexactly solved by customized interior point method and firstorder method. Finally, we clearly distinguished the notion of gradient complexity from that of computational complexity. In terms of gradient complexity, all of our proposed methods are stateoftheart and easy to implement. Whether the computational complexity can be further improved for composite cases remains an open problem that we leave as a future direction.
Notes
This assumption is used to ensure the limit point of a certain sequence in the analysis. There are other conditions that can ensure the existence of a limit point as we will discuss in Remark 1.
References
Auslender, A., Shefi, R., Teboulle, M.: A moving balls approximation method for a class of smooth constrained minimization problems. SIAM J. Optim. 20(6), 3232–3259 (2010)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Nashua (1999)
Bolte, J., Pauwels, E.: Majorizationminimization procedures and convergence of sqp methods for semialgebraic and tame programs. Math. Oper. Res. 41(2), 442–465 (2016)
Boob, D., Deng, Q., Lan, G.: Stochastic firstorder methods for convex and nonconvex functional constrained optimization. Mathematical Programming, pp. 1–65 (2022)
Boob, D., Deng, Q., Lan, G., Wang, Y.: A feasible level proximal point method for nonconvex sparse constrained optimization. In Advances in Neural Information Processing Systems, volume 33, pp. 16773–16784. Curran Associates, Inc., (2020)
Burke, J.: A sequential quadratic programming method for potentially infeasible mathematical programs. J. Math. Anal. Appl. 139(2), 319–351 (1989)
Burke, J.V., Han, S.P.: A robust sequential quadratic programming method. Math. Program. 43(1), 277–303 (1989)
Cartis, C., Gould, N.I., Toint, P.L.: On the evaluation complexity of composite function minimization with applications to nonconvex nonlinear programming. SIAM J. Optim. 4, 1721–1739 (2011)
Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of finding firstorder critical points in constrained nonlinear optimization. Math. Program. 144(1), 93–106 (2014)
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:127:27 (2011)
Diamond, S., Boyd, S.: Cvxpy: a pythonembedded modeling language for convex optimization. J. Mach. Learn. Res. 17(1), 2909–2913 (2016)
Facchinei, F., Kungurtsev, V., Lampariello, L., Scutari, G.: Ghost penalties in nonconvex constrained optimization: diminishing stepsizes and iteration complexity. Math. Oper. Res. 46(2), 595–627 (2021)
Facchinei, F., Lampariello, L., Scutari, G.: Feasible methods for nonconvex nonsmooth problems with applications in green communications. Math. Program. 164(1), 55–90 (2017)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Ghadimi, S., Lan, G.: Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Gong, P., Zhang, C., Lu, Z., Huang, J., Ye, J.: A general iterative shrinkage and thresholding algorithm for nonconvex regularized optimization problems. In: International Conference on Machine Learning, pp. 37–45. PMLR (2013)
Gotoh, J.Y., Takeda, A., Tono, K.: Dc formulations and algorithms for sparse optimization problems. Math. Program. 169, 141–176 (2018)
Reddi, S.J., Sra, S., Poczos, B., Smola, A. J.: Proximal stochastic methods for nonsmooth nonconvex finitesum optimization. In: Advances in Neural Information Processing Systems, volume 29, (2016)
Kelly, M., Longjohn, R., Nottingham, K.: The UCI machine learning repository (2019)
Lan, G.: Firstorder and Stochastic Optimization Methods for Machine Learning. Springer, New York (2020)
Lan, G.: FirstOrder and Stochastic Optimization Methods for Machine Learning. SpringerNature, New York (2020)
Lan, G., Monteiro, R.D.C.: Iterationcomplexity of firstorder penalty methods for convex programming. Math. Program. 138, 115–139 (2013)
Lan, G., Monteiro, R.D.C.: Iterationcomplexity of firstorder augmented Lagrangian methods for convex programming. Math. Program. 155(1–2), 511–547 (2016)
Li, Z., Xu, Y.: Augmented Lagrangianbased firstorder methods for convexconstrained programs with weakly convex objective. INFORMS J. Optim. 3(4), 373–397 (2021)
Lin, Q., Ma, R., Xu, Y.: Inexact proximalpoint penalty methods for constrained nonconvex optimization. arXiv:1908.11518, (2019)
Ma, R., Lin, Q., Yang, T.: Quadratically regularized subgradient methods for weakly convex optimization with weakly convex constraints. In: International Conference on Machine Learning, pp. 6554–6564. PMLR (2020)
Mangasarian, O., Fromovitz, S.: The Fritz John necessary optimality conditions in the presence of equality and inequality constraints. J. Math. Anal. Appl. 17, 37–47 (1967)
Nemirovski, A.: Interior point polynomial time methods in convex programming. Lecture notes (2004)
Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Program. 103(1), 127–152 (2005)
Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. Optimizing Methods in Statistics, pp. 111–135 (1971)
Schmidt, M., Roux, N.L., Bach, F.: Convergence rates of inexact proximalgradient methods for convex optimization (2011). arXiv:1109.2415
Shen, X., Diamond, S., Gu, Y., Boyd, S.: Disciplined convexconcave programming. In: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 1009–1014. IEEE (2016)
Wang, X., Ma, S., Yuan, Y.: Penalty methods with stochastic approximation for stochastic nonlinear programming. Math. Comput. 86(306), 1793–1820 (2017)
Xu, Y.: Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming. Math. Program. (2019)
Yu, P., Pong, T.K., Lu, Z.: Convergence rate analysis of a sequential convex programming method with line search for a class of constrained differenceofconvex optimization problems. SIAM J. Optim. 31(3), 2024–2054 (2021)
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Zhang, C.H., Zhang, T.: A general theory of concave regularization for highdimensional sparse estimation problems. Stat. Sci. 27(4), 576–593 (2012)
Funding
Open access funding provided by SCELC, Statewide California Electronic Library Consortium
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
DB was partially supported by NSF grant CCF 2245705. QD was partially supported by China NSFC grant 11831002, 72394364. GL was partially supported by AHA grant 23CSA1052735 and USDA grant 20206702131526. Part of the work was done while QD was at Shanghai University of Finance and Economics.
Appendices
Proof of Lemma 7
Let \(\{\delta _n\}_{n \ge 1}\) be a function of \({\mathcal {C}}^{\infty }\)smooth, realvalued mollifier functions over \(\mathbb {R}^d\) where, for every \(n \ge 1\), we have: (i) \(\delta _n \ge 0\), (ii) \(\int \delta _n(\tau )d\tau = 1\), and (iii) and \(\delta _n(\tau ) = 0\) for \(\tau \) satisfying \(\Vert \tau \Vert _{}^{} \ge \tfrac{1}{n}\). Moreover, we define \(p_n = \delta _n * p\). It now follows that
Using (5.2) along with the above relation, noting that \(\delta _n \ge 0\) and the fact \(\int \delta _n(\tau )d\tau = 1\), we have
for all x, y. Note that \(p_n\) is \({\mathcal {C}}^{\infty }\)smooth. Hence, using Taylor’s theorem, there exist \(\xi \in [x,y]\) such that \(p_n(x)  p_n(y)  \langle \nabla p_n(y) , xy\rangle = \tfrac{1}{2} \langle xy, \nabla ^2 p_n(\xi )(xy)\rangle + o(\Vert xy\Vert _{}^{2})\). Using this relation along with (A.1) and denoting \(v:= yx\), we have
Now, diving both sides of the above relation by \(\Vert v\Vert _{}^{2}\), taking \(y \rightarrow x\) which implies \(\xi \rightarrow x\) and \(\nabla ^2 p_n(\xi ) \rightarrow \nabla ^2 p_n(x)\), we have
for all v and x. The above relation is equivalent to the fact that \(\Vert \nabla ^2 p_n(x)\Vert _{}^{} \le \max \{L, \mu \}\) for all x. From here, for any x, y, we have
Now, taking \(n \rightarrow \infty \) and noting that \(\nabla p_n(x) \rightarrow \nabla p(x)\) for all x, we have (5.3). Hence, we conclude the proof.
Proof of Theorem 7
First of all, using the definition of \(\epsilon _k\) and the fact that \(\psi _i^k(x^{k+1}) \le \eta ^k _i + \epsilon _k\) for all \(i \in [m]\), we have
Hence \(x^{k}\) is strictly feasible solution of (3.1). Due to Slater condition, there exists a pair of optimal primal and dual solutions, which we denote by \(\tilde{x}^{k+1}\) and \({\lambda }^{k+1}\). We first prove the following lemma. Our argument will be based on the following key result.
Lemma 10
Proof
Using optimality of \(\widetilde{x}^{k+1}\), strong convexity of \(\psi _{0}^k\), feasibility of \(x_k\) for the subproblem (3.1) and \(\psi _{0}^k(x^{k+1}) \le \psi _{0}^k(\widetilde{x}^{k+1}) + \epsilon _k\), we have,
Since \(\epsilon _k\) is summable, we have that \(\Vert x^k\widetilde{x}^{k+1}\Vert _{}^{2}\) is summable. This implies \(\Vert x^k\widetilde{x}^{k+1}\Vert _{}^{} \rightarrow 0\). Since \(\epsilon _k \rightarrow 0\) and \(\widetilde{x}^{k+1}\) is a unique optimal solution of (3.1), we have \(\Vert x^{k+1}  \widetilde{x}^{k+1}\Vert _{}^{} \rightarrow 0\). Then, using CauchySchwarz inequality, we have \(\Vert x^{k+1} x^k\Vert _{}^{} \le \Vert x^{k+1}  \widetilde{x}^{k+1}\Vert _{}^{} + \Vert \widetilde{x}^{k+1}x^k\Vert _{}^{}\) and hence, \(\Vert x^{k+1} x^k\Vert _{}^{} \rightarrow 0\). Hence, we conclude the proof. \(\square \)
Now, we show boundedness of \(\tilde{\lambda }^{k+1}\). Assume, for the sake of contradiction, that \(\tilde{\lambda }^{k+1}\) is unbounded. Let \(\bar{x}\) be a limit point of \(\{x^k\}\). Passing to a subsequence if necessary, we have \(x^{k} \rightarrow \bar{x}\). Using Lemma 10, we have \(\widetilde{x}^{k+1} \rightarrow \bar{x}\) and \(x^{k+1} \rightarrow \bar{x}\). Then, we have
Note that the above relation is comparable to (3.10) up to an error term of \(\epsilon _k\). Following the arguments in the proof of Theorem 1 (Part 1, (3.10) onwards) and noting that \(\epsilon _k\) is summable, we conclude that \(\{\tilde{\lambda }^{k+1}\}\) is bounded.
Now, we prove limit point of \(\{x^k\}\) is a KKT point. Since \({\mathcal {L}}_{k}(x^{k+1}, \tilde{\lambda }^{k+1}) \le {\mathcal {L}}_{k}(\widetilde{x}^{k+1}, \tilde{\lambda }^{k+1}) + \epsilon _k\), we rewrite (3.18) as
Let \(\bar{x}\) be a limit point of sequence \(\{x^k\}\). Since \(\tilde{\lambda }^{k+1}\) is bounded, we assume limit point \(\bar{\lambda }\). Without loss of generality, we have \(x^k \rightarrow \bar{x}\) and \(\tilde{\lambda }^{k+1} \rightarrow \bar{\lambda }\). Then, in view of Lemma 10, we have \(\lim _{k\rightarrow \infty }x^{k+1} \rightarrow \bar{x}\) and \(\lim _{k\rightarrow \infty } \widetilde{x}^{k+1} \rightarrow \bar{x}\). Taking limit \(k \rightarrow \infty \) in (B.2), we have
Note that the above equation matches with (3.19) exactly. From here, we follow the proof of Theorem 1 (Part 2, (3.19) onwards) to conclude firstorder stationarity of \((\bar{x}, \bar{\lambda })\). A similar argument can show complimentary slackness. Hence, we have that \((\bar{x}, \bar{\lambda })\) is KKTsolution and we conclude the proof.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Boob, D., Deng, Q. & Lan, G. Level constrained first order methods for function constrained optimization. Math. Program. (2024). https://doi.org/10.1007/s10107024020574
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10107024020574
Keywords
 Function constrained optimization
 Proximal gradient descent
 Moving balls method
 Stochastic optimization