1 Introduction

In this paper, we study the following constrained optimization problem

$$\begin{aligned} \begin{aligned} \min _{x\in \mathbb {R}^d}&\quad \psi _0(x):=f_0(x)+\chi _0(x) \\ \text {s.t.}&\quad \psi _i(x) :=f_i(x)+\chi _i(x) \le \eta _i,\quad i=1,\dots ,m. \end{aligned} \end{aligned}$$
(1.1)

where \(\psi _{i}(x)\) is a composite function that sums up functions \(f_i(x)\) and \(\chi _{i}(x)\). Here, \(f_i, i ={0}, 1, \dots , m\), are smooth functions, \(\chi _0(x)\) is a proper, convex, lower-semicontinuous (lsc) function and \(\chi _{i}(x), i=1,\dots ,m\), are convex continuous functions over the domain of \(\chi _{0}\) (i.e. \({\textrm{dom}}_{\chi _{0}}\)). We consider that \(\chi _{i}, i =0, \dots , m\) are ‘simple’ functions, namely, a feasible optimization problem of the form below

$$\begin{aligned} \min _{x \in \mathbb {R}^d}\ \big \{\Vert x-a^0\Vert _{}^{2} + \chi _{0}(x): \Vert x-a^i\Vert _{}^{2} + \chi _{i}(x) \le b^i, i = 1, \dots , m.\big \} \end{aligned}$$
(1.2)

can be solved to efficiently obtain either an exact solution or an inexact solution of desired accuracy. Note that if \(\chi _{i} =0, i= 1, \dots , m\), then (1.2) becomes a proximal operator for function \(\chi _{0}\) on the intersection of balls. If we further assume \(\chi _{0} = 0\), then (1.2) is a special type of quadratically constrained quadratic programming (QCQP) that can be solved efficiently because all the Hessians are identity matrices. In addition, we consider the case where constraints \(f_i, i =1, \dots , m\), are structured nonsmooth functions which can be approximated by smooth functions (also called smoothable functions). Note that problem (1.1) covers a variety of convex and nonconvex function constrained optimization depending on the assumptions of \(f_i, i = 0, \dots , m\).

Nonlinear optimization with function constraints is a classical topic in continuous optimization. While the earlier study focused on the asymptotic performance, recent work has put more emphasis on the complexity analysis of algorithms, mainly driven by the growing interest in large-scale optimization and machine learning. For most of our discussion on the complexity analysis, we generally require convergence to an \(\epsilon \)-approximate KKT point (c.f. Definition 3). Penalty methods [9, 25, 33], including augmented Lagrangian methods [22,23,24, 34], is one popular approach for constrained optimization. In [8], Cartis et al. presented an exact penalty method by minimizing a sequence of convex composition functions. When the penalty weight is bounded, this method solves \({\mathcal {O}}(1/\epsilon )\) trust region subproblems. If the penalty weight is unbounded, the complexity is of \({\mathcal {O}}(1/\epsilon ^{2.5})\) to reach an \(\epsilon \)-KKT point. In a subsequent work [9], the authors provided a target following method that achieves the complexity of \({\mathcal {O}}(1/\epsilon )\), regardless of the growth of the penalty parameter. In [33], Wang et al. extended the penalty method for solving constrained problems where the objective takes the expectation form. Sequential quadratic programming (SQP) is another important approach for constrained optimization. Typically, SQP involves linearization of the constraints, quadratic approximation of the objective, and possibly some trust region constraint for the convergence guarantee [6, 7]. The recent work [12] established a unified convergence analysis of SQP (GSQP) in more general settings where the feasibility and constraint qualification may or may not hold. Different from the standard SQP, the Moving Balls Approximation (MBA) method [1] follows a feasible solution path and transforms the initial problem into a diagonal QCQP. A subsequent work [3] presented a unified analysis of MBA and other variants of SQP methods. Under the assumption of Kurdyka-Łojasiewicz (KL) property, they establish the global convergence rates which depend on the Łojasiewicz exponent.

Despite much progress in prior works, there are some significant remaining issues. Specifically, most of the analysis is carried out for only smooth optimization and requires that the exact optimal solution of the convex subproblem is readily available. Unfortunately, both assumptions can be unrealistic in many large-scale applications. To overcome these issues, [4, 25, 26] presented some new proximal point algorithms that iteratively solve strongly convex proximal subproblems inexactly using first-order methods. A significant computational advantage is that first-order methods only need to compute a relatively easy proximal gradient mapping in each iteration. In particular, [4] proposed to solve the proximal point subproblem by a new first-order primal-dual method called ConEx. Under some strict feasibility assumption, they derived the total complexities of the overall algorithm for which the objective and constraints can be either stochastic or deterministic, and either nonsmooth or smooth. Notably, for nonconvex and smooth constrained problems, inexact proximal point [4] requires \({\mathcal {O}}(1/\epsilon ^{1.5})\) function/gradient evaluations. A similar complexity bound is obtained by the proximal point penalty method [25] when a feasible point is available. Nevertheless, at this point, it may be difficult to directly compare the efficiency of the proximal point with the earlier approach, given that very different oracles are employed in each method. The inexact proximal point method appears to be less efficient in terms of gradient and function value computations since first-order penalty method [9] and a variant of SQP [12] (where the surrogate is formed by first-order approximation) has an \({\mathcal {O}}(1/\epsilon )\) complexity bound. Nevertheless, it might be more efficient if the corresponding proximal mapping is much easier to solve than the subproblems in penalty or SQP methods.

In this paper, we attempt to alleviate some of the aforementioned issues in solving nonconvex constrained optimization. Our main contribution is the development of a novel Level Constrained Proximal Gradient (LCPG) method for constrained optimization, based on the following key ideas.

First, we convert the original problem (1.1) into a sequence of simple convex subproblems of the form (1.2) for which an exact or an approximate solution can be computed efficiently. In particular, solving the subproblem requires at most one gradient and function value computation for \(f_i, i =0, \dots , m\). This phenomenon is quite similar to simple single-loop methods even though LCPG method can be multi-loop since we allow for an inexact solution of (1.2) using some kind of iterative scheme.

Second, starting from a strictly feasible initial point and carefully controlling the feasibility levels of the subproblem constraints, we ensure that LCPG follows a strictly feasible solution path. This also allows us to deal with nonsmooth constraints where \(\chi _i\) is not necessarily 0 and further extends LCPG to the inexact case where the subproblem admits an approximate solution. Even though subtle, the level-control design is crucial in bounding the Lagrange multipliers under the well-known Mangasarian–Fromovitz constraint qualification (MFCQ) [4, 27]. Subsequently, we also show asymptotic convergence of the LCPG method.

Third, we offer a new insight into the complexity analysis of LCPG as a gradient descent type method, which could be of independent interest. When the objective and constraints are nonconvex composite, we aim to find a first-order \(\epsilon \)-KKT point (c.f. Definition 3) under the aforementioned MFCQ assumption. We can show that LCPG method converges in \(O(1/\epsilon )\) iterations. Furthermore, each subproblem requires at most one function-value and gradient computation. The net outcome is that gradient complexity of our method is of \(O(1/\epsilon )\). Notice that the number of iterations required by the proximal point method under MFCQ is also \(O(1/\epsilon )\) (see [4, Theorem 5]). However, each iteration of this method requires \(O(1/\epsilon ^{0.5})\) gradient computation, and hence its total gradient complexity can be bounded by \(O(1/\epsilon ^{1.5})\). This is much worse than LCPG method. We compare with some significant lines of work in Table 1.

Exploiting the intrinsic connection between LCPG and proximal gradient (without function constraints), we extend LCPG to a variety of cases. (1) We can show a similar \(O(1/\epsilon )\) gradient complexity for an inexact LCPG method for which the subproblem is solved to a pre-specified accuracy. If we assume \(\chi _{i} = 0\), then the corresponding subproblem (1.2) (i.e. diagonal QCQP) can be efficiently solved by a customized interior point method in logarithmic time. In the more general setting when \(\chi _{i} \ne 0\), we propose to solve (1.2) by the first-order method ConEx, which has very cheap iterations. (2) We also extend LCPG method to stochastic (LCSPG) and variance-reduced (LCSVRG) variants when \(f_0\) is either stochastic or finite-sum function, respectively. LCSPG and LCSVRG  require \(O(1/\epsilon ^2)\) (similar to SGD [15]) and \(O(\sqrt{n}/\epsilon )\) (similar to SVRG [18]) stochastic gradients, respectively, where n is the number of components in the finite-sum objective. The complexity of variants of LCPG method for stochastic cases can also be seen in Table 2. (3) We consider the case when function \(f_i, i =0,1, \dots , m\) are nondifferentiable but contain a smooth saddle structure (referred to as structured nonsmooth). We extend LCPG method for such nonsmooth nonconvex function constrained problem using Nesterov’s smoothing scheme [29]. In this case, LCPG method requires \(O(1/\epsilon ^2)\) gradients.

We show that the GD-type analysis of the LCPG method can be extended to the convex case. In particular, when the objective and constraint functions are convex, we show that LCPG method requires \(O(1/\epsilon )\) gradient computations for smooth and composite constrained problems, and this complexity improves to \(O(\log {({1}/{\epsilon })})\) when the objective is smooth and strongly-convex. Furthermore, we develop the complexity of inexact variants of LCPG method by leveraging the analysis of gradient descent with inexact projection oracles [31]. Inexact LCPG method maintains the gradient complexity of \(O(1/\epsilon )\) and \(O(\log (1/\epsilon ))\) for convex and strongly convex problems, respectively.

Throughout our analysis, we require that the Lagrange multipliers for the convex subproblems of type (1.2) are bounded. This problem is addressed in different ways in arguably all works in the literature. In this paper, we show that under the assumption of MFCQ, Lagrange multipliers associated with the sequence of subproblems remain bounded by a quantity specified as B. Even then, the value of B cannot be estimated a priori. Fortunately, this bound is not needed in the implementation of our methods. However, it plays a role in complexity analysis. Hence, our comparison with the existing complexity literature (e.g., proximal point method of [4]) is valid when bound B on the sequence of Lagrange multipliers largely depends on the problem itself and not on the sequence of subproblems. One can easily see that such uniform bounds on Lagrange multipliers hold under the strong feasibility constraint qualification [4], a similar uniform Slater’s condition [26] or for nonsmooth nonconvex relaxation in the application of sparsity constrained optimization [5]. The problem of comparing bounds B on Lagrange multipliers requires getting into specific applications, which is not the purpose of this paper. Hence, throughout our comparison with existing literature, we assume that bound B for different methods is of a similar order.

Table 1 Comparison of algorithm function/gradient evaluation complexities
Table 2 Total number of stochastic gradient evaluations to obtain \({\mathcal {O}}\big (\varepsilon ,\varepsilon \big )\) randomized KKT points in the finite sum problem

Comparison with MBA method Auslender et al. [1] provided a Moving Balls Approximation (MBA) method for smooth constrained problems, i.e. \(\chi _i(x), i=0, \dots , m\), are not present. They use Lipschitz continuity of constraint gradients along with MFCQ to ensure that the subproblems satisfy Slater’s conditions (see [1, Proposition 2.1(iii)]). A similar result is also used in [35] where they provide a line-search version of MBA for functions satisfying certain KL properties. The MBA method was studied for semi-algebraic functions in [3] where they used the KL-property of semi-algebraic functions. The work [1] also provides the complexity guarantee for constrained programs with a smooth and strongly convex objective. Our results differ from the past studies in the following several aspects. First, we do not assume any KL property on the class of functions, hence making the method applicable to a wider class of problems. Second, we show complexity analysis for a variety of cases, e.g., stochastic, finite-sum, or structured nonsmooth cases. Note that complexity results are not known for the MBA type method even for the purely smooth problem. Third, we show complexity results for both convex and strongly convex cases which strictly subsumes the results in [1]. Fourth, it should be noted that [1] also considered the efficiency of solving subproblems. They proposed an accelerated gradient method that obtains \({\mathcal {O}}(1/\sqrt{\epsilon })\) complexity for solving the dual of the QCQP subproblem. However, it is unclear what accuracy is enough for ensuring asymptotic convergence of the whole algorithm.

Comparison with generalized SQP The work [12] developed the first complexity analysis of the generalized SQP (GSQP) method by using a novel ghost penalty approach. Different from our feasible method, they consider a general setting where the feasibility and constraint qualification may or may not hold. They show that SQP-type methods have an \({\mathcal {O}}(1/\epsilon ^2)\) complexity for reaching some \(\epsilon \)-approximate generalized stationary point. Under an extended MFCQ condition, they established an improved complexity \({\mathcal {O}}(1/\epsilon )\) for reaching the scaled-KKT point, which matches our complexity result under a similar MFCQ assumption. Notably, both their analysis and ours rely on MFCQ to show that the global upper bound (constant B) on the multipliers of the subproblems exists. However, to obtain the best \({\mathcal {O}}(1/\epsilon )\) complexity, GSQP explicitly relies on the value of such unknown upper bound to determine the stepsize, which appears to be rather challenging in practical use. In contrast, our algorithm does not involve the constant B in the algorithm implementation; we only require the Lipschitz constants of the gradients, which is standard for gradient descent methods.

Outline This paper is organized as follows: Sect. 2 describes notations and assumptions. It also provides various definitions used throughout the paper. Section 3 presents the LCPG method which uses exact solutions of subproblems. It also establishes asymptotic convergence and convergence rate results. Section 4.1 and Sect. 4.2 provides the LCSPG and LCSVRG method for stochastic and finite-sum problems, respectively. Section 5 shows the extension of LCPG for nonsmooth nonconvex function constraints. Section 6 introduces the inexact LCPG method and provides its complexity analysis when the subproblems are inexactly solved by an interior point method or first-order method. Finally, Sect. 7 extends LCPG method for convex optimization problems and establishes its complexity for both strongly convex and convex problems.

2 Notations and assumptions

Notations. \(\mathbb {R}^n_+\) stands for the non-negative orthant in \(\mathbb {R}^n\). We use \(\Vert \cdot \Vert _{}^{}\) to express the Euclidean norm. For a set \({\mathcal {X}}\), we define \(\Vert {\mathcal {X}}\Vert _{-}^{}=\text {dist}(0,{\mathcal {X}})=\inf \big \{\Vert x\Vert _{}^{},x\in {\mathcal {X}}\big \}\). If \({\mathcal {X}}\) is a convex set then we denote its normal cone at x as \(N_{\mathcal {X}}(x)\). Furthermore, we denote the dual cone of the normal cone at x as \(N^*_{\mathcal {X}}(x)\). Let e be a vector full of elements one. For simplicity, we denote \([m]=\{1,2,\ldots , m\}\), \(f(x)=[f_{1}(x),\ldots ,f_{m}(x)]^{\textrm{T}}\), \(\chi (x)=[\chi _{1}(x),\ldots ,\chi _{m}(x)]^{\textrm{T}}\), and \(\psi (x)=[\psi _{1}(x),\psi _{2}(x),\ldots ,\psi _{m}(x)]^{\textrm{T}}\). For vectors \(x, y\in \mathbb {R}^m\), \(x\le y\) is understood as \(x_i\le y_i\) for \(i\in [m]\).

Assumption 1

(General) We assume that the optimal value of problem (1.1) is finite: \(\psi _{0}^{*}>-\infty \). Furthermore, the objective and constraint functions have the following properties.

  1. 1:

    \(\chi _{0}\) is a proper, convex, and lower semi-continuous (lsc) function. Moreover, we assume that for all \(i = 1, \dots , m\), the function \(\chi _{i}(x)\) is convex continuous over \({\textrm{dom}}_{\chi _{0}}\).

  2. 2:

    \(f_{i}(x)\) is \(L_{i}\)-Lipschitz smooth on \({\textrm{dom}}_{\chi _{0}}\): \(\Vert \nabla f_{i}(x)-\nabla f_{i}(y)\Vert _{}^{}\le L_{i}\Vert x-y\Vert _{}^{}\) for any \(x,y\in {\textrm{dom}}_{\chi _{0}}\). For brevity, we denote \(L=[L_{1},\ldots ,L_{m}]^{\textrm{T}}\).

  3. 3:

    The feasible set for (1.1), i.e., \(\bigcap _{i\in [m]}\{x:\psi _i(x) \le \eta _i\} \cap {\textrm{dom}}_{\chi _{0}}\) is nonempty and compact.Footnote 1

The Lagrangian function of problem (1.1) is denoted by

$$\begin{aligned} {\mathcal {L}}(x,\lambda )=\psi _{0}(x)+\sum _{i=1}^{m}\lambda _{i}[\psi _{i}(x)-\eta _{i}]. \end{aligned}$$
(2.1)

For functions \(\psi _i\), we denote its subdifferential as

$$\begin{aligned} \partial \psi _{i}(x) = \{\nabla f_{i}(x) \}+ \partial \chi _{i}(x),i = 0, \dots , m, \end{aligned}$$

where the sum is in the Minkowski sense. Note that this definition of the subdifferential for nonconvex functions was first proposed in [4]. Moreover, \(\partial \psi _i = \{\nabla f_i \}\) when \(\psi _i\) is a “purely” smooth nonconvex function and \(\partial \psi _i = \partial \chi _i\) when \(\psi _i\) is a nonsmooth convex function. Hence, it is a valid definition for the subdifferential of a nonconvex function. Below, we define the KKT condition using the above subdifferential.

Definition 1

(KKT condition) We say \(x\in {\textrm{dom}}_{\chi _{0}}\) is a KKT point of problem (1.1) if x is feasible and there exists a vector \(\lambda \in \mathbb {R}_{+}^{m}\) such that

$$\begin{aligned} 0 \in \partial _{x}{\mathcal {L}}(x,\lambda ), \quad 0 =\sum _{i=1}^m\lambda _{i}\big [\psi _{i}(x)-\eta _{i}\big ]. \end{aligned}$$
(2.2)

The values \(\{\lambda _{i}\}\) are called Lagrange multipliers.

It is known that the KKT condition is necessary for optimality under the assumption of certain constraint qualifications (c.f. [2]). Our result will be based on a variant of the Mangasarian–Fromovitz constraint qualification, which is formally given below.

Definition 2

(MFCQ) We say that a point x satisfies the Mangasarian–Fromovitz constraint qualification for (1.1) if there exists a vector \(z \in -N^*_{{\textrm{dom}}_{\chi _{0}}}(x)\) such that

$$\begin{aligned} \max _{v\in \partial \psi _{i}(x)}\langle v,z\rangle <0,\quad i\in {\mathcal {A}}(x), \end{aligned}$$
(2.3)

where \({\mathcal {A}}(x)=\{i:1\le i\le m,\psi _{i}(x)=\eta _{i}\}\).

Proposition 1

(Necessary condition) Let x be a local optimal solution of Problem (1.1). If it satisfies MFCQ (2.3), then there is a vector \(\lambda \in \mathbb {R}_{+}^{m}\) such that the KKT condition (1) holds.

Next, we introduce some optimality measures before formally presenting any algorithms. It is natural to characterize algorithm performance by measuring the error of satisfying the KKT condition. Towards this goal, we have the following definition.

Definition 3

We say that x is an \(\epsilon \) type-I (approximate) KKT point if it is feasible (i.e. \(\psi (x)\le \eta \)), and there exists a vector \(\lambda \in \mathbb {R}_{+}^{m}\) satisfying the following conditions:

$$\begin{aligned} \Vert \partial _{x}{\mathcal {L}}(x,\lambda )\Vert _{-}^{2}&\le {\epsilon }\\ -\sum _{i=1}^m \lambda _{i}[\psi _{i}(x)-\eta _{i}]&\le {\epsilon }. \end{aligned}$$

Moreover, x is a randomized \({\epsilon }\) type-I KKT point if both x and \(\lambda \) are feasible randomized primal-dual solutions that satisfy

$$\begin{aligned} \mathbb {E} [ \Vert \partial _{x}{\mathcal {L}}(x,\lambda )\Vert _{-}^{2} ]&\le {\epsilon }\\ \mathbb {E} [ -\sum _{i=1}^m \lambda _{i}[\psi _{i}(x)-\eta _{i}] ]&\le {\epsilon }, \end{aligned}$$

where the expectation is taken over the randomness of x and \(\lambda \).

Besides the above definition, we invoke a second optimality measure which will help analyze the performance of a proximal algorithm (see, for example, [4]). Therein, it is arguably more convenient to approach the proximity of some nearly stationary points.

Definition 4

We say that x is a \({(\epsilon , \nu )}\) type-II KKT point if there exists an \({\epsilon }\) type-I KKT point \(\hat{x}\) and \(\Vert x-\hat{x}\Vert _{}^{2}\le \nu \). Similarly, x is a randomized \({(\epsilon , \nu )}\) type-II KKT point if \(\hat{x}\) is a random vector and \(\mathbb {E}[\Vert x-\hat{x}\Vert _{}^{2} ]\le {\nu }\).

3 A proximal gradient method

We present the level constrained proximal gradient (LCPG) method in Algorithm 1. The main idea of this algorithm is to turn the original nonconvex problem into a sequence of relatively easier subproblems that involve some convex surrogate functions \(\psi ^k_i(x)\) (\(0\le i\le m\)) and variable constraint levels \(\eta ^k\):

$$\begin{aligned} \begin{aligned} \min _{x\in {\mathbb {R}^d}}&\quad \psi _0^k(x) \\ \text {s.t.}&\quad \psi _i^k(x) \le \eta _i^k,\quad i\in [m]. \end{aligned} \end{aligned}$$
(3.1)

Above, we take the surrogate function \(\psi _{i}^{k}(x)\) (\(0\le i\le m\)) by partially linearizing \(\psi _i(x)\) at \(x^k\) and adding the proximal term \(\tfrac{L_i}{2}\Vert x-x^k\Vert _{}^{2}\):

$$\begin{aligned} \psi _{i}^{k}(x)&:=f_{i}(x^{k})+\big \langle \nabla f_{i}(x^{k}),x-x^{k}\big \rangle +\tfrac{L_{i}}{2}\Vert x-x^{k}\Vert _{}^{2}+\chi _{i}(x). \end{aligned}$$
(3.2)

It should be noted that our algorithm may not be well-defined if it were to be initialized by an infeasible solution \(x^0\). Furthermore, we require the initial point to be strictly feasible with respect to the nonlinear constraints \(\psi (x)\le \eta \). Therefore, we explicitly state this assumption below and assume it holds throughout the paper.

Assumption 2

(Strict feasibility) There exist a point \(x^0\in {\textrm{dom}}_{\chi _{0}}\) and a vector \(\eta ^0\in \mathbb {R}^m\) satisfying

$$\begin{aligned} \psi _i(x^0)<\eta ^0 < \eta . \end{aligned}$$

With a strictly feasible solution, we assume that the constraint levels \(\{\eta ^k\}\) are incrementally updated and converge to the constraint levels for the original problem:

$$\begin{aligned} \lim _{k\rightarrow \infty } \eta _i^k=\eta _i,\quad i\in [m]. \end{aligned}$$
Algorithm 1
figure a

Level constrained proximal gradient method (LCPG)

The following Lemma will be used many times throughout the rest of the paper.

Lemma 1

[Three-point inequality] Let \(g:\mathbb {R}^d\rightarrow (-\infty , \infty ]\) be a proper lsc convex function and

$$\begin{aligned} x^{+}=\mathop {\text {argmin}}\limits _{z\in {\mathbb {R}^d}}\left\{ g(z)+\tfrac{\gamma }{2}\Vert z-y\Vert _{}^{2}\right\} , \end{aligned}$$

then for any \(x\in \mathbb {R}^d\), we have

$$\begin{aligned} g(x^{+})-g(x)\le \tfrac{\gamma }{2}\big (\Vert x-y\Vert _{}^{2} -\Vert x^{+}-y\Vert _{}^{2}-\Vert x-x^{+}\Vert _{}^{2}\big ). \end{aligned}$$
(3.4)

Next, we present some important properties of the generated solutions in the following theorem.

Proposition 2

Suppose that Assumption 2 holds, then Algorithm 1 has the following properties.

  1. 1.

    The sequence \(\big \{ x^{k}\big \}\) is well-defined and is feasible for problem (1.1). \(\{x^k\}\) satisfies the sufficient descent property:

    $$\begin{aligned} \tfrac{L_{0}}{2}\Vert x^{k+1}-x^{k}\Vert _{}^{2}\le \psi _{0}(x^{k})-\psi _{0}(x^{k+1}). \end{aligned}$$
    (3.5)

    Moreover, the sequence of objective values \(\big \{\psi _{0}(x^{k})\big \}\) is monotonically decreasing and \(\lim _{k\rightarrow \infty }\psi _{0}(x^{k})\) exists.

  2. 2.

    There exists a vector \(\lambda ^{k+1}\in \mathbb {R}_{+}^{m}\) such that the KKT condition holds:

    $$\begin{aligned} \begin{aligned} \partial \psi _{0}^{k}(x^{k+1})+\sum _{i=1}^{m}\lambda _{i}^{k+1}\partial \psi _{i}^{k}(x^{k+1})&\ni 0\\ \lambda _{i}^{k+1}\big [\psi _{i}^{k}(x^{k+1})-\eta _{i}^{k}\big ]&=0,\quad i\in [m]. \end{aligned} \end{aligned}$$
    (3.6)

Proof

Part 1). First, we show that \(\{x^k\}\) is a well-defined sequence, namely, \({\mathcal {X}}_k\cap {\textrm{dom}}_{\chi _{0}}\) is a nonempty set where \({\mathcal {X}}_k=\big \{ x:\psi _{i}^{k}(x)\le \eta _{i}^{k}\big \}\). This result clearly holds for \(k = 0\) by Assumption 2. We show the general case \((k>0)\) by induction. Suppose that \(x^k\) is well-defined, i.e., \({\mathcal {X}}_{k-1}\cap {\textrm{dom}}_{\chi _{0}}\) is nonempty. Then, by the definition of \(\psi _{i}^{k}, \psi _{i}^{k-1}\) and the definition \(x^k\), we have \(x^k \in {\textrm{dom}}_{\chi _{0}}\) and

$$\begin{aligned} \psi _{i}^k(x^k) = \psi _{i}(x^k) \le \psi _{i}^{k-1}(x^{k}) \le \eta ^{k-1}_i < \eta _{i}^k \hspace{1em} \text {for all } i \in [m]. \end{aligned}$$
(3.7)

Here, the first inequality follows due to the smoothness of \(f_i(x)\) which ensures for all x,

$$\begin{aligned} f_i(x) \le f_i(x^{k-1}) + \langle \nabla f_i(x^{k-1}), x-x^{k-1}\rangle + \tfrac{L_i}{2}\Vert x-x^{k-1}\Vert _{}^{2},\quad \forall i \in [m]. \end{aligned}$$

This is equivalent to \(x^k \in {\mathcal {X}}_k\cap {\textrm{dom}}_{\chi _{0}}\), implying that \({\mathcal {X}}_k \cap {\textrm{dom}}_{\chi _{0}}\) is nonempty. We conclude that \(x^{k+1}\) is well-defined. Hence, by induction, we conclude that \(\{x^k\}\) is a well-defined sequence. Furthermore, in view of \(x^k\in {\textrm{dom}}_{\chi _{0}}\), relation (3.7) and the fact that \(\eta ^k_i < \eta _i\), we have \(x^k \in {\textrm{dom}}_{\chi _{0}} \cap \{x: \psi _{i}(x)\le \eta _{i},\ i = 1, \dots ,m\}\). Hence, the whole sequence \(\{x^k\}\) remains feasible for the original problem.

Now, let us apply Lemma 1 with \(g(x)=\langle \nabla f_{0}(x^{k}), x\rangle +\chi _{0}(x)+\textbf{1}_{{\mathcal {X}}_k}(x)\), \(y=x^k\) and \(\gamma =L_{0}\). Then, for any \(x \in {\textrm{dom}}_{\chi _{0}} \cap {\mathcal {X}}_k\), we have

$$\begin{aligned}{} & {} \langle \nabla f_{0}(x^{k}), x^{k+1}-x\rangle + \chi _{0}(x^{k+1})-\chi _{0}(x)\\{} & {} \quad \le {\tfrac{L_0}{2}\big (\Vert x-x^k\Vert _{}^{2}-\Vert x-x^{k+1}\Vert _{}^{2}-\Vert x^k-x^{k+1}\Vert _{}^{2}\big )}. \end{aligned}$$

Placing \(x=x^{k}\) in the above relation, we have

$$\begin{aligned} \langle \nabla f_{0}(x^{k}), x^{k+1}-x^{k}\rangle +\chi _{0}(x^{k+1})-\chi _{0}(x^{k})\le -L_{0}\Vert x^{k+1}-x^{k}\Vert _{}^{2}. \end{aligned}$$
(3.8)

Moreover, since \(f_{0}(\cdot )\) is Lipschitz smooth, we have that

$$\begin{aligned} f_{0}(x^{k+1})\le f_{0}(x^{k})+\langle f_{0}(x^{k}), x^{k+1}-x^{k}\rangle +\tfrac{L_{0}}{2}\Vert x^{k+1}-x^{k}\Vert _{}^{2}. \end{aligned}$$

Summing up the above two inequalities and using the definition \(\psi _{0}=f_{0}+\chi _{0}\), we conclude (3.5). Hence, the sequence \(\big \{\psi _{0}(x^{k})\big \}\) is monotonically decreasing. The convergence of \(\big \{\psi _{0}(x^{k})\big \}\) follows from the lower boundedness assumption.

Part 2). Note that (3.7) ensures the strict feasibility of \(x^k\) w.r.t. the constraint set \({\mathcal {X}}_k\) for the kth subproblem. Therefore, Slater’s condition for (3.1) and the optimality of \(x^{k+1}\) imply that there must exist a vector \(\lambda ^{k+1} \in \mathbb {R}^m_+\) satisfying KKT condition (3.6). Hence, we complete the proof. \(\square \)

In order to show convergence to the KKT solutions, we need the following constraint qualifications.

Assumption 3

[Uniform MFCQ] All the feasible points of problem (1.1) satisfy MFCQ.

We state the main asymptotic convergence property of LCPG in the following theorem.

Theorem 1

Suppose that Assumption 3 holds, then we have the following conclusions.

  1. 1.

    The dual solutions \(\{\lambda ^{k+1}\}\) are bounded from above. Namely, there exists a constant \(B>0\) such that

    $$\begin{aligned} \sup _{0\le k\le \infty }\Vert \lambda ^{k+1}\Vert _{}^{}<B. \end{aligned}$$
    (3.9)
  2. 2.

    Every limit point of Algorithm 1 is a KKT point.

Proof

Part 1). First, we can immediately see that \(\{x^k\}\) is a bounded sequence and hence the limit point exists. This result is from Assumption 1.3 and the feasibility of \(x^k\) for problem (1.1) (c.f. Proposition 2, Part 1). Without loss of generality, we can assume \(\lim _{k\rightarrow \infty }x^{k}=\bar{x}\). For the sake of contradiction, suppose that \(\lambda ^{k+1}\) is unbounded, then passing to a subsequence if necessary, we can assume \(\Vert \lambda ^{k+1}\Vert _{}^{}\rightarrow \infty \) for simplicity. Note that we also have \(\lim _{k\rightarrow 0}\Vert x^{k+1}-x^{k}\Vert _{}^{2}=0\) due to the sufficient descent property (3.5). From the KKT condition (3.6), we have

$$\begin{aligned} \psi _{0}^k(x)+\langle \lambda ^{k+1}, \psi ^{k}(x)-\eta ^{k}\rangle \ge \psi _{0}^k(x^{k+1})+\langle \lambda ^{k+1}, \psi ^{k}(x^{k+1})-\eta ^{k}\rangle , \hspace{1em} \forall x {\ \in {\textrm{dom}}_{\chi _{0}} }. \end{aligned}$$
(3.10)

Let \(X:= \bigcap _{i\in [m]}\{x:\psi _i(x) \le \eta _i\} \cap {\textrm{dom}}_{\chi _{0}}\) be the feasible domain for problem (1.1). Due to the fact that \(x^k \in X\) (Proposition 2), boundedness of X (Assumption 1.3) and strong convexity of \(\psi ^k_{0}\), there exists \(l_0 \in \mathbb {R}\) such that \(X \subset \{x: \psi _{0}^k(x) < l_0\}\) for all k. Then, using (3.10) for all \(x \in {\textrm{dom}}_{\chi _{0}} \cap \{\psi _{0}^k(x) \le l_0\}\) and dividing both sides by \(\Vert \lambda ^{k+1}\Vert _{}^{}\), we have

$$\begin{aligned}&\psi _{0}^k(x)/\Vert \lambda ^{k+1}\Vert _{}^{}+\big \langle \lambda ^{k+1}/ \Vert \lambda ^{k+1}\Vert _{}^{},\psi ^{k}(x)\big \rangle \nonumber \\ {}&\quad \ge \psi _{0}^k(x^{k+1})/ \Vert \lambda ^{k+1}\Vert _{}^{}+\big \langle \lambda ^{k+1}/\Vert \lambda ^{k+1}\Vert _{}^{}, \psi ^{k}(x^{k+1})\big \rangle . \end{aligned}$$
(3.11)

Let us take \(k\rightarrow \infty \) on both sides of (3.11). Note that for all \(x \in {\textrm{dom}}_{\chi _{0}} \cap \{\psi _0^k(x) \le l_0\}\), we have

$$\begin{aligned}&\lim _{k\rightarrow \infty }\psi _{0}^k(x)/\Vert \lambda ^{{k+1}}\Vert _{}^{} =0, \quad \lim _{k\rightarrow \infty }\psi _{0}^k(x^{{k+1}})/\Vert \lambda ^{{k+1}}\Vert _{}^{} =0, \end{aligned}$$
(3.12)
$$\begin{aligned}&\lim _{k\rightarrow \infty }\psi _{i}^{k}(x) =f_{i}(\bar{x})+\langle \nabla f_{i}(\bar{x}),x-\bar{x}\rangle +\tfrac{L_{i}}{2}\Vert x-\bar{x}\Vert _{}^{2}+\chi _{i}(x),\ i\in [m], \end{aligned}$$
(3.13)
$$\begin{aligned}&\lim _{k\rightarrow \infty }\psi _{i}^{k}(x^{k+1}) =f_{i}(\bar{x})+\chi _{i}(\bar{x})=\psi _{i}(\bar{x}),\ i\in [m], \end{aligned}$$
(3.14)

where (3.12) is due to boundedness of \(\psi _{0}^k(x)\) on \({\textrm{dom}}_{\chi _{0}} \cap \{\psi _{0}^k(x) \le l_0\}\) and boundedness of \(\psi _{0}^k(x^{k+1})\) since \(x^{k+1} \in X\) which is a bounded set. Moreover, (3.13) and (3.14) use the continuity of \(f_i(x)\) and \(\chi _i(x)\), \(i\in [m]\). Next, we consider the sequence \(\{u^{k}=\lambda ^{k+1}/\Vert \lambda ^{k+1}\Vert _{}^{}\}\). Since \(\Vert u^{k}\Vert _{}^{}\) is a bounded sequence, it has a convergent subsequence. Let \(\bar{u}\) be a limit point of \(\{u^{k}\}\) and the subsequence \(\{j_{k}\}\subseteq \{1,2,...,\}\) such that \(\lim _{k\rightarrow \infty }u^{j_{k}}=\bar{u}\). Since the subsequence of a convergent sequence is also convergent, we pass to the subsequence \(j_{k}\) in (3.11) and apply (3.12), (3.13) and (3.14), yielding

$$\begin{aligned} \sum _{i=1}^{m}\bar{u}_{i}\big [\langle \nabla f_{i}(\bar{x}),x-\bar{x}\rangle +\tfrac{L_{i}}{2}\Vert x-\bar{x}\Vert _{}^{2} +\chi _{i}(x)\big ]\ge \sum _{i=1}^{m}\bar{u}_{i}\chi _{i}(\bar{x}), \end{aligned}$$
(3.15)

for all \(x \in {\textrm{dom}}_{\chi _{0}} \cap \{\psi _0^k(x) \le l_0\}\). Hence, \(\bar{x}\) minimizes \(\sum _{i=1}^{m}\bar{u}_{i}\big [\langle \nabla f_{i}(\bar{x}),x-\bar{x}\rangle +\tfrac{L_{i}}{2}\Vert x-\bar{x}\Vert _{}^{2}+\chi _{i}(x)\big ]\) on \({\textrm{dom}}_{\chi _{0}}\cap \{\psi _0^k(x) \le l_0\}\). Now noting \(\bar{x} \in X \subset \{\psi _0^k(x) < l_0\}\) and using the stationarity condition for optimality of \(\bar{x}\), we have

$$\begin{aligned} 0\in \sum _{i=1}^{m}\bar{u}_{i}\big [\nabla f_{i}(\bar{x})+\partial \chi _{i}(\bar{x})\big ] + {N_{{\textrm{dom}}_{\chi _{0}}}(\bar{x}) }=\sum _{i=1}^{m}\bar{u}_{i}\partial \psi _{i}(\bar{x}) + {N_{{\textrm{dom}}_{\chi _{0}}}(\bar{x}) }, \end{aligned}$$
(3.16)

where we dropped the constraint \(\psi _0^k(x) \le l_0\) due to complementary slackness and the fact that \(\psi _{0}^k(\bar{x}) < l_0\).

Let \({\mathcal {B}}=\{i:\bar{u}_{i}>0\}\), then we must have \(\lim _{k\rightarrow \infty }\lambda _{i}^{j_{k}}=\lim _{k\rightarrow \infty }\bar{u}_{i}\Vert \lambda ^{j_{k}}\Vert _{}^{}=\infty \) for \(i\in {\mathcal {B}}\). Based on complementary slackness, we have \(\psi _{i}^{j_{k}}(x^{j_{k}+1})=\eta _{i}^{j_{k}}\) for any \(i\in {\mathcal {B}}\) for large enough k. Due to (3.14), we have the limit: \(\psi _{i}(\bar{x})=\eta _{i}\). Therefore, the ith constraint is active at \(\bar{x}\), i.e. \(i\in {\mathcal {A}}(\bar{x})\). In view of (3.16), there exists subgradients \(v_{i}\in \partial \psi _{i}(\bar{x}), i \in [m]\) and \(v_0 \in N_{{\textrm{dom}}_{\chi _{0}}}(\bar{x})\) such that

$$\begin{aligned} 0=v_0 + \sum _{i\in {\mathcal {B}}}\bar{u}_{i}v_{i} . \end{aligned}$$
(3.17)

However, equation (3.17) contradicts the MFCQ assumption. This is because MFCQ guarantees the existence of \(z {\in -N^*_{{\textrm{dom}}_{\chi _{0}}}(\bar{x})}\) such that \(\langle z,v_{i}\rangle <0\) for all \(i\in {\mathcal {A}}(\bar{x})\), which implies

$$\begin{aligned} 0 = \langle z,v_0+\sum _{i\in {\mathcal {B}}}\bar{u}_{i}v_{i}\rangle \le \sum _{i\in {\mathcal {B}}}\bar{u}_{i}\langle z,v_{i}\rangle \le \sum _{i\in {\mathcal {B}}} \bar{u}_i \max _{v \in \partial \psi _{i}(\bar{x})}\langle z, v\rangle <0, \end{aligned}$$

where the first inequality follows since \(z \in -N_{{\textrm{dom}}_{\chi _{0}}}(\bar{x})\) and \(v_0 \in N_{{\textrm{dom}}_{\chi _{0}}}(\bar{x})\) implying that \(\langle z, v_0\rangle \le 0\), the second inequality follow since \(\bar{u}_i \ge 0\) and \(v_i \in \partial \psi _i(\bar{x})\), and the last strict inequality follows due to uniform MFCQ (c.f. Assumption 3 and relation (2.3)) and \(\bar{u}_i > 0\) for at least one \(i \in {\mathcal {B}}\). This clearly leads to a contradiction. Hence, we conclude that \(\{\lambda ^{k+1}\}\) is a bounded sequence and conclude the proof.

Part 2). Without loss of generality, we assume that \(\bar{x}\) is the only limit point. Since \(\{\lambda ^{k+1}\}\) is a bounded sequence, there exists a limit point \(\bar{\lambda }\). Passing to a subsequence if necessary, we have \(\lambda ^{k+1}\rightarrow \bar{\lambda }\).

From the optimality condition \(0\in \partial _{x}{\mathcal {L}}_{k}(x^{k+1},\lambda ^{k+1})\), for any x, we have

$$\begin{aligned}&\big \langle \nabla f_{0}(x^{k}){+}\sum _{i=1}^{m}\lambda _i^{k+1}\nabla f_{i}(x^{k}),x^{k+1}{-}x\big \rangle {+}\chi _{0}(x^{k+1}){-}\chi _{0}(x) {+}\langle \lambda ^{k+1},\chi (x^{k+1}){-}\chi (x)\big \rangle \nonumber \\&\quad \le \tfrac{L_{0}+\langle \lambda ^{k+1},L\rangle }{2}\big [\Vert x-x^{k}\Vert _{}^{2} -\Vert x^{k+1}-x^{k}\Vert _{}^{2}-\Vert x-x^{k+1}\Vert _{}^{2}\big ]. \end{aligned}$$
(3.18)

Let us take \(k\rightarrow \infty \) on both sides of (3.18). We note that \(\lim _{k\rightarrow \infty }\Vert x^{k}-x^{k+1}\Vert _{}^{}=0\) due to Proposition 2, \(\lim _{k\rightarrow \infty }{\chi _i(x^{k+1})=\chi _i(\bar{x})}\) due to the continuity of \(\chi _{i}\) (\(i\in [m]\)), and \(\chi _{0}(\bar{x})\le \liminf _{k\rightarrow \infty }\chi _{0}(x^{k})\) due to the lower semi-continuity of \(\chi _{0}(\cdot )\). It then follows that

$$\begin{aligned} \big \langle \nabla f_{0}(\bar{x})+\sum _{i=1}^{m}\bar{\lambda }_i\nabla f_{i}(\bar{x}),\bar{x}-x\big \rangle +\chi _{0}(\bar{x})-\chi _{0}(x) +\langle \bar{\lambda },\chi (\bar{x})-\chi (x)\big \rangle \le 0. \end{aligned}$$
(3.19)

In other words, \(\bar{x}\) is the minimizer of convex optimization problem \(\min _x\big \langle \nabla f_0(\bar{x})+\sum _{i=1}^{m}\bar{\lambda }_i\nabla f_{i}(\bar{x}),x\big \rangle +\chi _{0}(x)+\bar{\lambda }\big [\chi (x)-\eta \big ]\) over \({\textrm{dom}}_{\chi _{0}}\). Hence we have

$$\begin{aligned} 0\in \nabla f_0(\bar{x})+\partial \chi _{0}(\bar{x})+\langle \bar{\lambda },\partial \psi (\bar{x})\rangle . \end{aligned}$$
(3.20)

In addition, using the complementary slackness, we have

$$\begin{aligned} 0&=\lim _{k\rightarrow \infty }\sum _{i=1}^{m}\lambda _{i}^{k+1}\big [\psi _{i}^{k}(x^{k+1})-\eta _{i}^{k}\big ]\nonumber \\&=\lim _{k\rightarrow \infty }\sum _{i=1}^{m}\lambda _{i}^{k+1}\big [f_{i}(x^{k}){+}\langle \nabla f_{i}(x^{k}),x^{k+1}{-}x^{k}\rangle {+}\tfrac{L_{i}}{2}\Vert x^{k+1}{-}x^{k}\Vert _{}^{2}{+}\chi _{i}(x^{k+1}){-}\eta _{i}^{k}\big ]\nonumber \\&=\sum _{i=1}^{m}\bar{\lambda }_{i}\big [f_{i}(\bar{x})+\chi _{i}(\bar{x})-\eta _{i}\big ] =\langle \bar{\lambda },\psi (\bar{x})-\eta \rangle , \end{aligned}$$
(3.21)

due to the convergence \(\lim _{k\rightarrow \infty }\lambda ^{k+1}=\bar{\lambda }\), \(\lim _{k\rightarrow \infty }\eta ^{k}=\eta \), \(\lim _{k\rightarrow \infty }\chi (x^{k+1})=\chi (\bar{x})\) and \(\lim _{k\rightarrow \infty }\Vert x^{k+1}-x^{k}\Vert _{}^{}=0\). Putting (3.20) and (3.21) together, we conclude that \((\bar{x},\bar{\lambda })\) satisfies the KKT condition. \(\square \)

Remark 1

To show the existence of a limit point \(\bar{x}\), we use Assumption 1.3 to ensure that the sequence \(x^k\) remains in a bounded domain. For the sake of conciseness, we henceforth assume the existence of a limit point \(\bar{x}\) and do not delve into the technical assumption used to ensure this condition. Moreover, it should be noted that the boundedness property can be obtained under other assumptions, e.g., assuming the compactness of sublevel set \(\{x: \psi _{0}(x) \le \psi _{0}(x^0)\}\) and using the sufficient descent condition (3.5), we can immediately show the existence of \(\bar{x}\). However, it appears to be more challenging to show convergence via this approach when sufficient descent condition fails (e.g., in the forthcoming stochastic optimization).

3.1 Dependence of B on the constraint qualification.

In Theorem 1, we proved existence of a bound B on the dual multiplier. However, the size of that bound still remains unknown. Through Example 1 below, we observe that the limiting behaviour of the sequence \(\lambda ^k\) (which largely governs the size of B) is closely tied to the magnitude of the number \(c(\bar{x})\), where

$$\begin{aligned} c(x):= -\min _{\Vert z\Vert _{}^{} \le 1}\max _{v \in \partial \psi _i(x)} \langle v, z\rangle . \end{aligned}$$

Here, the inner max follows from the relation (2.3) and outer min tries to find the best possible z that ensures MFCQ. It is observed that if \(c(\bar{x})\) is a large positive number, then MFCQ is strongly satisfied and B is a reasonable bound. In contrast, if \(c(\bar{x})\) is close to 0, then B can get quite large.

Example 1

Consider a two dimensional optimization problem with SCAD constraint: \(\min _{x} \psi _0(x)\) subject to \(\psi _1(x) \le \eta _1\) where \(\psi _0(x) = 7-x_1\) and \(\psi _1(x) = \beta \Vert x\Vert _{1}^{} - \sum _{i =1}^2 h_{\beta , \theta } (x_i)\). Note that

$$\begin{aligned} h_{\beta ,\theta }(u) = {\left\{ \begin{array}{ll} 0 &{}\hbox { if}\ |u| \le \beta ; \\ \tfrac{(|u|-\beta )^2}{2(\theta -1)} &{}\text {if } \beta \le |u| \le \beta \theta ;\\ \beta |u| - \tfrac{(\theta +1)\beta ^2}{2} &{}\text {if } |u| \ge \beta \theta \end{array}\right. }. \end{aligned}$$
(3.22)

This function fits our framework with the smoothness parameter \(\tfrac{1}{\theta -1}\). Lets consider \(\beta = 1, \theta = 5\), the level \(\eta _1 = 3\) and limit point \(\bar{x} = (5,0)\). Clearly, the constraint is active at \(\bar{x}\) and h is \(\tfrac{1}{4}\)-smooth function. Then, \(\partial \psi (\bar{x}) = \{\begin{bmatrix}0 \\ t \end{bmatrix}: t \in [-1,1]\}\) as per the definition. Then, one can see that uniform MFCQ is violated at the limit point \(\bar{x}\). Indeed,

$$\begin{aligned} \max _{v \in \partial \psi (x)} \langle v, z\rangle = \max _{t \in [-1,1]} tz_2 = |z_2| \nless 0, \end{aligned}$$

implying \(c(\bar{x}) = 0\). Furthermore, no \(\lambda \) can be found satisfying the KKT condition

$$\begin{aligned} \begin{bmatrix} -1 \\ 0 \end{bmatrix}+ \lambda \begin{bmatrix} 0 \\ t \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \end{aligned}$$

for all \(t \in [-1,1]\). Hence, as we get close to this limit point, bound on \(\Vert \lambda ^k\Vert _{}^{}\) will get arbitrarily large. Easy way to see this fact is to construct a subproblem at the limit point itself. After observing the feasible region for the subproblem at (5, 0), it is clear that it has only one feasible solution (5, 0) which gives rise to degeneracy. See Fig. 1 for more details. Figure 1a shows the well-behaved subproblem at an interior point while Fig. 1b show the degeneracy at the limit.

However, as we change level \(\eta _1\) to any value either above or below 3, we do not get any violation of MFCQ. It also gives nondegenerate feasible sets at limit point and \(\lambda ^k\) remains bounded for all k. See Fig. 2 below for more details. In particular, if \(\eta _1 = 2.5 < 3\), then \(\bar{x} = (3,0)\) is the limit point. At this point, we have \(\partial \psi (\hat{x}) = \{\begin{bmatrix}0.5 \\ t \end{bmatrix}: t \in [-1,1]\}\). Moreover, we have

$$\begin{aligned} \max _{v \in \partial \psi (\bar{x})} \langle v, z\rangle = 0.5z_1 + \max _{t \in [-1,1]}tz_2 = 0.5z_1 + |z_2|. \end{aligned}$$

Choosing the unit vector \(z = (z_1, z_2) = (-1,0)\), we obtain that point \(\bar{x}\) satisfies MFCQ with \(c(\bar{x}) = 0.5\). Hence, even when the search point reaches the limit point \(\hat{x}\) (i.e., \(\epsilon \rightarrow 0\)), the \(\lambda ^k\) still exists. (See, in particular, Fig. 2b whose subproblem at \(\bar{x}\) has a nonempty interior).

Fig. 1
figure 1

The nonconvex constraint \(\psi _1(x) \le \eta _1\) where \(\eta _1 =3\). The dotted blue curves are the subproblem constraint for two different points. Since the MFCQ is violated at (5, 0), the subproblem reduces to a single feasible point at the limit point (5, 0)

Fig. 2
figure 2

The nonconvex constraint \(\psi _1(x)\le \eta _1\) where \(\eta _1 = 2.5\). The dotted blue curves are subproblem constraint for two different points. Since the MFCQ is satisfied, the limiting subproblem constraint at (3, 0) is still a full dimensional set with nonempty interior

The view of the above example, we see that the limiting behavior of \(\Vert \lambda ^k\Vert _{}^{}\) (and by implication the order of B) is closely related to the “strength” of the constraint qualification MFCQ at the limit point. In order to get an apriori bound B, we use a somewhat stronger yet verifiable constraint qualification called as strong feasibility which is a slight modification of the CQ proposed in [4, Assumption 3].

Assumption 4

[Strong feasibility CQ] There exists \(\hat{x} \in {\mathcal {X}}:= \bigcap _{i\in [m]}\{x:\psi _i(x) \le \eta _i\} \cap {\textrm{dom}}_{\chi _{0}}\) such that

$$\begin{aligned} \psi _i(\hat{x}) \le \eta ^0_i - 2L_iD_{\mathcal {X}}^2 \end{aligned}$$
(3.23)

where \(D_{\mathcal {X}}:= \max _{x_1, x_2 \in {\mathcal {X}}} \Vert x_1-x_2\Vert _{}^{}\) is the diameter of the set \({\mathcal {X}}\).

In view of Assumption 1.3, we note that \({\mathcal {X}}\) is a bounded set. Hence, \(D_{\mathcal {X}}\) and (consequently) Assumption 4 are well-defined. See [4] for a connection between Assumption 4 and Assumption 3. Below, we show that strong feasibility CQ leads to a fixed apriori bound on \(\lambda ^k\).

Lemma 2

Suppose Assumption 4 is satisfied. Then, \(\Vert \lambda ^k\Vert _{1}^{} \le B:= \tfrac{\psi _0(\hat{x}) - \psi _0^* + L_0D_{\mathcal {X}}^2}{L_{\min }D_{\mathcal {X}}^2}\)

Proof

Note that

$$\begin{aligned} \psi _i^k(\hat{x})&= f_i(x^k) + \langle \nabla f_i(x^k), \hat{x} - x^k\rangle + \tfrac{L_i}{2} \Vert \hat{x} - x^k\Vert _{}^{2} + \chi _i(\hat{x}) \nonumber \\&\le f_i(\hat{x}) + \chi _i(\hat{x}) + L_i\Vert \hat{x} -x^k\Vert _{}^{2} \nonumber \\&= \psi _i(\hat{x}) + L_i\Vert \hat{x} -x^k\Vert _{}^{2} \le \eta ^0_i-L_iD_{\mathcal {X}}^2 < \eta ^k_i - L_iD_{\mathcal {X}}^2, \end{aligned}$$
(3.24)

where first inequality uses \(f_i(\hat{x}) \ge f_i(x^k) + \langle \nabla f_i(x^k), \hat{x} - x^k\rangle - \tfrac{L_i}{2} \Vert \hat{x} -x^k\Vert _{}^{2}\) (follows due to \(L_i\)-smoothness of \(f_i\)), and second inequality follows by Assumption 4 along with the fact that \(x^k \in {\mathcal {X}}\) (see Proposition 2).

In view of (3.24), we have strict feasibility of subproblem (3.1) for all k implying that \(\lambda ^{k+1}\) exists. Hence, we have \(x^{k+1} = \mathop {\text {argmin}}\limits _{x} \psi ^k_0(x) + \langle \lambda ^{k+1}, \psi ^k(x)\rangle \). Then, for all \(x \in {\textrm{dom}}_{\chi _{0}}\), we have

$$\begin{aligned} \psi ^k_0(x^{k+1})&= \psi ^k_0(x^{k+1}) + \langle \lambda ^{k+1}, \psi ^k(x^{k+1}) -\eta ^k\rangle \\&\le \psi ^k_0(x) + \langle \lambda ^{k+1}, \psi ^k(x) - \eta ^k\rangle \end{aligned}$$

where equality follows from the complementary slackness of the KKT condition, and inequality is due to optimality of \(x^{k+1}\). Using \(x = \hat{x}\) in the above relation and combining with (3.24), we obtain

$$\begin{aligned} \psi ^k_0(\hat{x}) - \psi ^k_0(x^{k+1}) \ge \langle \lambda ^{k+1}, \eta ^k-\psi ^k(\hat{x})\rangle \ge \langle \lambda ^{k+1}, L\rangle D_{\mathcal {X}}^2 \ge \Vert \lambda ^{k+1}\Vert _{1}^{}L_{\min }D_{\mathcal {X}}^2 \end{aligned}$$
(3.25)

Finally, note that

$$\begin{aligned} \psi ^k_0(\hat{x}) - \psi ^k_0(x^{k+1}) \le \psi _0(\hat{x}) + L_0\Vert \hat{x} -x^k\Vert _{}^{2} -\psi _0(x^{k+1}) \le \psi _0(\hat{x}) - \psi _0^* + L_0D_{\mathcal {X}}^2 \end{aligned}$$

where first inequality follows by (3.24) for \(i = 0\) and \(\psi ^k_0(x) \ge \psi _0(x)\), and second inequality follows by the definition of \(\psi _0^*\) and \(D_{\mathcal {X}}\). Combining the above relation with (3.25), we get the result. Hence, we conclude the proof. \(\square \)

The discussion above implies that the value of B is intricately related to the constraint qualification. While uniform MFCQ is unverifiable and does not allow for a priori bounds on B, it is widely used in nonlinear programming to ensure the existence of such a bound [2]. As observed in Fig. 1b and Fig. 2b, the actual value of B depends on the closeness of the MFCQ violation at the limit point. This situation is rare, but the current assumptions do not eliminate that possibility. Problems of this nature are ill-conditioned, and to our knowledge, no algorithm can ensure bounds on the dual in such a situation. The existing literature deals with this issue in two ways: One track assumes existence of B (similar to Theorem 1) and performs the complexity or convergence analysis; A second track assumes a stronger constraint qualification that removes the ill-conditioned problems and shows more explicit bound on the dual (similar to Lemma 2. We perform our analysis for both cases. To conclude, we henceforth assume that the bound B is a constant and do not delve into the discussion on related constraint qualification. To substantiate that the bound B is small, we perform detailed numerical experiments in Sect. 8.

3.2 Convergence rate analysis of LCPG method

Our main goal now is to develop some non-asymptotic convergence rates for Algorithm 1.

Lemma 3

In Algorithm 1, for \(k=1,2,...,\) we have

$$\begin{aligned} \Vert \partial _{x}{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{}&\le 2\big (L_{0}+\langle \lambda ^{k+1},L\rangle \big )\Vert x^{k+1}-x^{k}\Vert _{}^{}. \end{aligned}$$
(3.26)

Proof

Let \({\mathcal {L}}_{k}\) be the Lagrangian function of subproblem (3.1):

$$\begin{aligned} {\mathcal {L}}_{k}(x,\lambda )&=\psi _{0}^{k}(x)+\sum _{i=1}^{m}\lambda _{i}[\psi _{i}^{k}(x)-\eta _{i}^{k}]. \end{aligned}$$
(3.27)

Using (2.1) and (3.27), we have

$$\begin{aligned}&\partial _x{\mathcal {L}}_{k}(x^{k+1},\lambda ^{k+1})\nonumber \\&\quad = \nabla f_{0}(x^{k})+L_{0}(x^{k+1}-x^{k})\nonumber \\&\qquad +\partial \chi _{0}(x^{k+1})+\sum _{i=1}^{m}\lambda _{i}^{k+1}\big [\nabla f_{i}(x^{k})+L_{i}(x^{k+1}-x^{k})+\partial _x\chi _{i}(x^{k+1})\big ]\nonumber \\&\quad = \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})+\nabla f_{0}(x^{k})-\nabla f_{0}(x^{k+1})\nonumber \\&\qquad +\sum _{i=1}^{m}\lambda _{i}^{k+1}\big [\nabla f_{i}(x^{k})-\nabla f_{i}(x^{k+1})\big ]\nonumber \\&\qquad +\big (L_{0}+\langle \lambda ^{k+1},L\rangle \big )\big (x^{k+1}-x^{k}\big ). \end{aligned}$$
(3.28)

Using the smoothness of \(f_i(x)\), the optimality condition \(0\in \partial _x{\mathcal {L}}_{k}(x^{k+1},\lambda ^{k+1})\) and the triangle inequality, we obtain

$$\begin{aligned} \Vert \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{}&\le 2L_{0}\Vert x^{k+1}-x^{k}\Vert _{}^{}+2\langle \lambda ^{k+1},L\rangle \Vert x^{k+1}-x^{k}\Vert _{}^{}. \end{aligned}$$
(3.29)

Hence we conclude the proof. \(\square \)

In view of Lemma 3, we derive the complexity of LCPG to attain approximate KKT solutions in the following theorem.

Theorem 2

Let \(\alpha _k>0\) \((k=0,1,..,K)\) be a non-decreasing sequence and suppose that Assumption 3 holds, then there exists a constant \(B>0\) such that

$$\begin{aligned} \sum _{k=0}^{K}\alpha _k\Vert \partial _{x}{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{2}&\le {8(L_{0}+B\Vert L\Vert _{}^{})^{2}} D^2\alpha _K, \end{aligned}$$
(3.30)
$$\begin{aligned} \sum _{k=0}^{K} \alpha _k \langle \lambda ^{k+1}, \vert \psi (x^{k+1})-\eta \vert \rangle&\le 2B\Vert L\Vert _{}^{}D^2\alpha _K + B \sum _{k=0}^K\alpha _k\Vert \eta -\eta ^k\Vert _{}^{}, \end{aligned}$$
(3.31)

where \(D=\sqrt{\tfrac{\psi _{0}(x^{0})-\psi _{0}^{*}}{L_{0}}}\). Moreover, if we choose the index \(\hat{k}\in \{0,1,...,K\}\) with probability \(\mathbb {P}(\hat{k}=k)=\alpha _k/(\sum _{k=0}^K\alpha _k)\), then \(x^{\hat{k}+1}\) is a randomized \(\epsilon _K\) type-I KKT point with

$$\begin{aligned} \epsilon _K=\tfrac{1}{\sum _{k=0}^K\alpha _k} \max \big \{ {8(L_{0}+B\Vert L\Vert _{}^{})^{2}}D^2\alpha _K,\ 2B\Vert L\Vert _{}^{} D^2 \alpha _K + B \sum _{k=0}^K\alpha _k\Vert \eta -\eta ^k\Vert _{}^{}\big \} \end{aligned}$$
(3.32)

Proof

From the sufficient descent property (3.5), we have

$$\begin{aligned} \sum _{k=0}^K {\alpha _k}\Vert x^k-x^{k+1}\Vert _{}^{2}&\le \tfrac{2}{L_0} \sum _{k=0}^K \alpha _k \big [\psi _0(x^k)-\psi _0(x^{k+1})\big ] \nonumber \\&= \tfrac{2}{L_0}\big [\alpha _0\psi _0(x^0) + \sum _{k=1}^{K}(\alpha _k-\alpha _{k-1})\psi _0(x^k) - \alpha _K\psi _0(x^{K+1}) \big ] \nonumber \\&\le \tfrac{2\alpha _K}{L_0} \big [\psi _0(x^0) - \psi _0(x^{K+1})\big ] \nonumber \\&\le \tfrac{2\alpha _K}{L_0} \big [\psi _0(x^0) - \psi _0(x^{*})\big ] = 2\alpha _K D^2, \end{aligned}$$
(3.33)

where the second inequality uses the monotonicity of sequence \(\psi _0(x^k)\). In view of Theorem 1 and Cauchy-Schwarz inequality, we have \(\langle \lambda ^{k+1},L\rangle \le \Vert \lambda ^{k+1}\Vert _{}^{}\Vert L\Vert _{}^{} \le B\Vert L\Vert _{}^{}\). This relation and (3.26) implies

$$\begin{aligned} \Vert \partial _{x}{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{2}\le 4{(L_{0}+B\Vert L\Vert _{}^{})^{2}} \Vert x^k-x^{k+1}\Vert _{}^{2}. \end{aligned}$$

Combining the above inequality with (3.33) immediately yields (3.30).

Next, we bound the error of complementary slackness. We have

$$\begin{aligned}&\sum _{i=1}^m\lambda _{i}^{k+1}\vert \psi _{i}(x^{k+1})-\eta _{i}\vert \nonumber \\&\quad =\sum _{i=1}^m \lambda _{i}^{k+1}\big |\psi _i^k(x^{k+1})-\eta _i^k -( \eta _i-\eta _i^k) + \psi _{i}(x^{k+1})-\psi _{i}^k(x^{k+1})\big |\nonumber \\&\quad \le \sum _{i=1}^m\big [\lambda _{i}^{k+1}\big |\psi _{i}^{k}(x^{k+1}) -\eta _{i}^{k}\big |+\lambda _{i}^{k+1}(\eta _{i}-\eta _{i}^{k})\nonumber \\&\qquad +\lambda _{i}^{k+1}\big \vert f_{i}(x^{k+1})-f_{i}(x^{k})-\langle \nabla f_{i}(x^{k}),x^{k+1}-x^{k}\rangle -\tfrac{L_{i}}{2}\Vert x^{k+1}-x^{k}\Vert _{}^{2}\big \vert \big ]\nonumber \\&\quad \le \sum _{i=1}^m\lambda _{i}^{k+1}(\eta _{i}-\eta _{i}^{k})+\lambda _{i}^{k+1}{L}_{i}\Vert x^{k+1}-x^{k}\Vert _{}^{2}\nonumber \\&\quad \le B \Vert \eta -\eta ^k\Vert _{}^{} + B\Vert L\Vert _{}^{} \Vert x^{k+1}-x^k\Vert _{}^{2} \end{aligned}$$
(3.34)

where the first inequality uses the triangle inequality, the second inequality uses complementary slackness and the Lipschitz smoothness of \(f_{i}(\cdot )\), and the last inequality follows from Cauchy-Schwartz inequality and boundedness of \(\lambda ^{k+1}\). Summing up (3.34) weighted by \(\alpha _k\) for \(k=0,...,K\), we have

$$\begin{aligned} \sum _{k=0}^{K}\alpha _k \langle \lambda ^{k+1}, \vert \psi (x^{k+1})-\eta \vert \rangle&\le \sum _{k=0}^{K} \alpha _k \big [{B\Vert L\Vert _{}^{}}\Vert x^{k+1}-x^k\Vert _{}^{2} + B \Vert \eta -\eta ^k\Vert _{}^{} \big ]. \end{aligned}$$

Combining the above result with (3.33) gives (3.31). Finally, the fact that \(x^{\hat{k}+1}\) is a randomized \(\epsilon _K\) type-I KKT point for \(\epsilon _K\) defined in (3.32) is immediately follows from (3.30), (3.31) and Definition 3. \(\square \)

The following corollary shows that the output of Algorithm 1 is a randomized \({\mathcal {O}}(1/K)\) KKT point under more specific parameter selection.

Corollary 1

In Algorithm 1, suppose that all the assumptions of Theorem 2 hold. Set \(\delta ^{k}=\tfrac{\eta -\eta ^{0}}{(k+1)(k+2)}\) and \(\alpha _k=k+1\). Then \(x^{\hat{k}+1}\) is a randomized \(\epsilon \) Type-I KKT point with

$$\begin{aligned} \epsilon = \tfrac{2}{K+2}\max \big \{8(L_{0}+B\Vert L\Vert _{}^{})^{2}D^2, 2B\Vert L\Vert _{}^{} D^2 + B \Vert \eta -\eta ^0\Vert _{}^{}\big \} \end{aligned}$$
(3.35)

Proof

Notice that \(\alpha _K=K+1\), \(\sum _{k=0}^K \alpha _k = \tfrac{(K+1)(K+2)}{2}\). Moreover, for \(i\in [m]\) and \(k\ge 0\), we have

$$\begin{aligned} \eta _{i}^{k}=\eta _{i}^{0}+\sum _{i=0}^{k-1}\delta ^{i}=\eta _{i}^{0} +(\eta _{i}-\eta _{i}^{0})\sum _{i=0}^{k-1}\tfrac{1}{(i+1)(i+2)} =\tfrac{k}{k+1}\eta _{i}+\tfrac{1}{k+1}\eta _{i}^{0}, \end{aligned}$$

which implies that \(\sum _{k=0}^{K}\alpha _k\Vert \eta -\eta ^{k}\Vert _{}^{}=(K+1) \Vert \eta -\eta ^0\Vert _{}^{}\). Plugging these values into (3.32) gives us the desired conclusion. \(\square \)

Remark 2

Corollary 1 shows that the gradient complexity of LCPG for smooth composite constrained problems is on a par with that of gradient descent for unconstrained optimization problems. To the best of our knowledge, this is the first complexity result for a constrained problem where the constraint functions can be nonsmooth and nonconvex. Note that the convergence rate involves the unknown bound B on the Lagrangian multipliers. The presence of such a constant is not new in nonlinear programming literature [1, 8, 9, 12, 13]. Fortunately, we can safely implement LCPG method since the step-size scheme does not rely on B. On the other hand, the bound B is often a problem-dependent quantity. E.g., in [4] authors show a class of problems for which an a priori bound B can be established, or [5] shows the exact value of B for a class of nonconvex relaxations of sparse optimization problems. In such cases, our comparisons are arguably fair. Hence, throughout the paper, we make comparative statements under the assumption that B largely depends on the problem.

4 Stochastic optimization

The goal of this section is to extend our proposed framework to stochastic constrained optimization where the objective \(f_0\) is an expectation function:

$$\begin{aligned} f_0(x):=\mathbb {E}_{\xi \in \varXi } [F(x,\xi )]. \end{aligned}$$
(4.1)

Here, \(F(x,\xi )\) is differentiable and \(\xi \) denotes a random variable following a certain distribution \(\varXi \). Directly evaluating either the objective \(f_0\) or its gradient can be computationally challenging due to the stochastic nature of the problem. To address this, we introduce the following additional assumptions.

Assumption 5

The information of \(f_{0}\) is available via a stochastic first-order oracle (SFO). Given any input x and a random sample \(\xi \), SFO outputs a stochastic gradient \(\nabla F(x,\xi )\) such that

$$\begin{aligned} \mathbb {E}\big [\nabla F(x,\xi )\big ] =\nabla f_0(x),\ and \ \mathbb {E}\big [\Vert \nabla F(x,\xi )-\nabla f_0(x)\Vert ^{2}\big ]\le \sigma ^{2}, \end{aligned}$$

for some \(\sigma \in (0,\infty )\).

4.1 Level constrained stochastic proximal gradient

In Algorithm 2, we present a stochastic variant of LCPG for solving problem 1.1 with \(f_0\) defined by (4.1). As observed in (4.2) and (4.3), the LCSPG method uses a mini-batch of random samples to estimate the true gradient in each iteration. It should be noted that the value \(f_0(x^k)\) is presented in (4.3) only for the ease of description, it is not required when solving (3.1).

Algorithm 2
figure b

Level constrained stochastic proximal gradient (LCSPG)

Note that the proximal point method of [4] does not need to account for a stochastic nonconvex problem separately since they solve corresponding stochastic convex subproblems using a ConEx method developed in their work. On the contrary, LCSPG directly applies to stochastic nonconvex function constrained problems and convex subproblems are deterministic in nature. Hence, we need to develop asymptotic convergence and convergence rates for the LCSPG method separately.

Let \(\zeta ^{k}=G^{k}-\nabla f(x^{k})\) denote the error of gradient estimation. We have

$$\begin{aligned} \mathbb {E}\big [\Vert \zeta ^{k}\Vert ^{2}\big ]=\tfrac{1}{b_{k}^{2}}\sum _{i=1}^{b_{k}}\mathbb {E}_{\xi }\big [\Vert \nabla F(x^{k},\xi _{i,k})-\nabla f(x^{k})\Vert ^{2}\big ]\le \tfrac{\sigma ^{2}}{b_{k}}. \end{aligned}$$

The following proposition summarizes some important properties of the generated solutions of LCSPG.

Proposition 3

In Algorithm 2, for any \(\beta _{k}\in (0,2\gamma _{k}-L_{0})\), we have

$$\begin{aligned} \psi _{0}(x^{k+1})\le \psi _{0}(x^{k})-\tfrac{2\gamma _{k}-\beta _{k}-L_{0}}{2} \Vert x^{k+1}-x^{k}\Vert ^{2}+\tfrac{\Vert \zeta ^{k}\Vert ^{2}}{2\beta _{k}}. \end{aligned}$$
(4.4)

Moreover, there exists a vector \(\lambda ^{k+1}\in \mathbb {R}^m_+\) such that the KKT condition (3.6) (with \(\psi _0^k\) defined in (4.3)) holds.

Proof

By the KKT condition, \(x^{k+1}\) is the minimizer of \({\mathcal {L}}_{k}(\cdot ,\lambda ^{k+1})\). Therefore, we have

$$\begin{aligned} {\mathcal {L}}_{k}(x^{k+1},\lambda ^{k+1})+\tfrac{\gamma _{k}+\langle \lambda ^{k+1},L\rangle }{2}\Vert x^{k+1}-x\Vert ^{2}\le {\mathcal {L}}_{k}(x,\lambda ^{k+1}),\quad \forall x\in {\mathcal {X}}. \end{aligned}$$
(4.5)

Placing \(x=x^{k}\) in (4.5) and using (3.27), we have

$$\begin{aligned}&\langle G^{k}, x^{k+1}-x^{k}\rangle + \chi _{0}(x^{k+1})-\chi _{0}(x^{k}) \nonumber \\ \le {}&\sum _{i=1}^{m}\lambda _{i}^{k+1}\big [\psi _{i}^{k}(x^{k})-\eta _{i}^{k}\big ] -\lambda _{i}^{k+1}\big [\psi _{i}^{k}(x^{k+1})-\eta _{i}^{k}\big ] -\gamma _{k}\Vert x^{k+1}-x^{k}\Vert ^{2}\nonumber \\ \le {}&-\gamma _{k}\Vert x^{k+1}-x^{k}\Vert ^{2}, \end{aligned}$$
(4.6)

where the second inequality is due to the complementary slackness \(\lambda _{i}^{k+1}\big [\psi _{i}^{k}(x^{k+1})-\eta _{i}^{k}\big ]=0\) and strict feasibility \(\lambda _{i}^{k+1}\big [\psi _{i}^{k}(x^{k})-\eta _{i}^{k}\big ]=\lambda _{i}^{k+1}\big [\psi _{i}(x^{k})-\eta _{i}^{k}\big ]<0.\) Using (4.6) and Lipschitz smoothness of \(f_0\), we have

$$\begin{aligned} \psi _{0}(x^{k+1})&\le f_{0}(x^{k})+\langle \nabla f_{0}(x^{k}),x^{k+1}-x^{k}\rangle +\tfrac{L_{0}}{2}\big \Vert x^{k+1}-x^{k}\big \Vert ^{2}+\chi _{0}(x^{k+1})\nonumber \\&= f_{0}(x^{k}){+}\chi _{0}(x^{k+1}){+}\langle G^{k},x^{k+1}{-}x^{k}\rangle {+}\tfrac{L_{0}}{2}\big \Vert x^{k+1}{-}x^{k}\big \Vert ^{2}{-}\langle \zeta ^{k},x^{k+1}{-}x^{k}\rangle \nonumber \\&\le \psi _{0}(x^{k})-\tfrac{2\gamma _{k}-L_{0}}{2}\Vert x^{k+1} -x^{k}\Vert ^{2}-\langle \zeta ^{k},x^{k+1}-x^{k}\rangle \nonumber \\&=\psi _{0}(x^{k}){-}\tfrac{2\gamma _{k}{-}\beta _k{-}L_{0}}{2}\Vert x^{k+1} {-}x^{k}\Vert ^{2}{+}\Vert \zeta ^{k}\Vert \cdot \Vert x^{k+1}{-}x^{k}\Vert {-}\tfrac{\beta _k}{2}\Vert x^{k+1}{-}x^{k}\Vert ^{2}\nonumber \\&\le \psi _{0}(x^{k})-\tfrac{2\gamma _{k}-\beta _{k}-L_{0}}{2}\Vert x^{k+1} -x^{k}\Vert ^{2}+\tfrac{\Vert \zeta ^{k}\Vert ^{2}}{2\beta _{k}}. \end{aligned}$$
(4.7)

Above, the last inequality uses the fact \(-\tfrac{a}{2} x^2+bx\le \tfrac{b^2}{2a}\) for any \(x,b\in \mathbb {R}, a>0\). Showing the existence of the KKT condition follows a similar argument of proving part 2, Proposition 2. \(\square \)

We prove a technical result in the following lemma which plays a crucial role in proving dual boundedness.

Lemma 4

Let \(\{X_k\}_{k \ge 1}\) be a sequence of random vectors such that \(\mathbb {E}[X_k] = {0}\) for all \(k \ge 1\) and \(\sum _{k=1}^\infty \sigma _k^2 \le M < \infty \) where \(\sigma _k:= \sqrt{\mathbb {E}[\Vert X_k\Vert _{}^{2}]}\). Then, \(\lim _{k\rightarrow \infty } X_k = 0\) almost surely (a.s.).

Proof

We prove this result by contradiction. If the result does not hold then there exists \(\epsilon >0\) and \(c > 0\) such that

$$\begin{aligned} \mathbb {P}\Big (\limsup _{k} \Vert X_k\Vert _{}^{} \ge \epsilon \Big ) \ge c. \end{aligned}$$
(4.8)

However, due to Chebyshev’s inequality, we have \(\mathbb {P}(\Vert X_k\Vert _{}^{} \ge \epsilon ) \le \tfrac{\sigma _k^2}{\epsilon ^2}\). Since \(\sigma _k^2\) is summable, there exists \(k_0\) such that \(\sum _{k =k_0}^\infty \mathbb {P}(\Vert X_k\Vert _{}^{} \ge \epsilon ) \le \sum _{k =k_0}^\infty \tfrac{\sigma _k^2}{\epsilon ^2} < c\). Therefore, we have

$$\begin{aligned} \mathbb {P}\Big (\limsup _{k}\Vert X_k\Vert _{}^{} \ge \epsilon \Big )&= \mathbb {P}\Big (\limsup _{k\ge k_0} \Vert X_k\Vert _{}^{} \ge \epsilon \Big ) \le \sum _{k =k_0}^\infty \mathbb {P}(\Vert X_k\Vert _{}^{} \ge \epsilon )<c. \end{aligned}$$

The above relation contradicts (4.8). Hence, we have \(\lim _{k\rightarrow \infty } X_k = 0\) a.s. \(\square \)

In the following theorem, we present the main asymptotic property of LCSPG.

Theorem 3

Suppose that \(\sum _{k=0}^\infty b_k^{-1}<\infty \), then we have that \(\lim _{k\rightarrow \infty }\Vert x^{k+1}-x^k\Vert _{}^{}=0\), a.s. Moreover, suppose that Assumption 3 holds, \(\gamma _k<\infty \), \(\beta _k\) is lower bounded and \(2\gamma _k-\beta _k-L_0>0\), then we have that (1) \(\sup _{k} \Vert \lambda ^k\Vert _{}^{}<\infty \) a.s., and (2) all the limit points of Algorithm 2 satisfy the KKT condition, a.s.

Proof

First, we fix notations. Let \((\varOmega , {\mathcal {F}}, \mathbb {P})\) be the probability space defined over the sampling minibatches \(B_0,B_1,\ldots ,\). Let \(\mathbb {E}_k[\cdot ]\) be the expectation conditioned on the sub \(\sigma \)-algebra generating \(B_0,B_1,\ldots , B_{k-1}\). Applying it to (4.4) gives

$$\begin{aligned} \mathbb {E}_k[\psi _0(x^{k+1})] \le \psi _0(x^k) + \tfrac{\sigma ^2}{2b_k\beta _k}. \end{aligned}$$

In view of the super-martingale convergence theorem [30], we have that

$$\begin{aligned} \lim _{k\rightarrow \infty }\psi _0(x^k) \ \text {exists and is finite a.s., when } \sum _{k=0}^\infty (b_k\beta _k)^{-1}<\infty . \end{aligned}$$
(4.9)

Let \(C_{k+1}=\sum _{s=0}^k\tfrac{2\gamma _s-\beta _s-L_0}{2}\Vert x^{s+1}-x^s\Vert _{}^{2}\) for \(k\ge 0\) and \(C_0=0\). We have

$$\begin{aligned} \mathbb {E}_k[\psi _0(x^{k+1})+C_{k+1}] \le \psi _0(x^k) +C_k + \tfrac{\sigma ^2}{2b_k\beta _k}. \end{aligned}$$

Applying the super-martingale convergence theorem [30] again we can show that the limit of \(\psi _0(x^k)+C_k\) exists a.s. Together with (4.9) and lower-boundedness of \(\beta _k\) and \(2\gamma _k-\beta _k-L_0\), we have that \(\lim _{k\rightarrow \infty }\Vert x^{k+1}-x^k\Vert _{}^{2}=0\), a.s.

Next, we prove the boundedness of \(\Vert \lambda ^k\Vert _{}^{}\). Let us consider the events

$$\begin{aligned} {\mathcal {U}}&=\Big \{\omega \in \varOmega : \sup _k \Vert \lambda ^k(\omega )\Vert _{}^{}=\infty \Big \}, \ {\mathcal {A}}=\Big \{\omega \in \varOmega : \sup _k\Vert G^{k}(\omega )\Vert _{}^{}<\infty \Big \},\\ {\mathcal {B}}&=\Big \{\omega \in \varOmega : \lim _k \Vert x^{k+1}(\omega )-x^k(\omega )\Vert _{}^{}=0\Big \}. \end{aligned}$$

We just argued \(\mathbb {P}({\mathcal {B}})=1\). It is easy to see that if both conditions (i) \(\mathbb {P}({\mathcal {A}})=1\) and (ii) \({\mathcal {U}}\subseteq {\mathcal {A}}^c\cup {\mathcal {B}}^c\) hold, then we have \(\mathbb {P}({\mathcal {U}})\le \mathbb {P}({\mathcal {A}}^c)+\mathbb {P}({\mathcal {B}}^c)=0\). Hence \(\{\lambda ^k\}\) is a bounded sequence a.s.

Since \(\{b_k^{-1}\}\) is summable, we have \(\sum _{k=1}^\infty \mathbb {E}[\Vert \zeta ^k\Vert _{}^{2}] \le \tfrac{\sigma ^2}{b_k} < \infty \). Hence, using Lemma 4, we have that \(\lim _{k\rightarrow \infty } \zeta ^k = 0\) a.s. Due to the boundedness of \(\nabla f(x^k)\), we have that \(G^k=\zeta ^k+\nabla f(x^k)\) is bounded, a.s.

We prove condition (ii) by contradiction. Suppose that our claim fails. We take an element \(\omega \in {\mathcal {U}}\cap ({\mathcal {A}}\cap {\mathcal {B}})\) and then pass to a subsequence \(\{j_k\}\) such that \(\lim _{k\rightarrow \infty } \Vert \lambda ^{j_k}(\omega )\Vert _{}^{}=\infty \). In the rest of the proof, we skip \(\omega \) for brevity. Passing to another subsequence if necessary, let \(\bar{x}\) be a limit point of \(\{x^{j_k}\}\). By our presumption, \(\bar{x}\) satisfies MFCQ. Moreover, the KKT condition implies that

$$\begin{aligned}&\langle G^{j_k},x^{j_k+1}\rangle +\chi _0(x^{j_k+1})+\tfrac{\gamma _{j_k}}{2}\Vert x^{j_k+1}-x^{j_k}\Vert _{}^{2} +\langle \lambda ^{j_k+1},\psi ^{j_k}(x^{j_k+1})\rangle \nonumber \\&\quad \le \langle G^{j_k},x\rangle +\chi _0(x)+\tfrac{\gamma _{j_k}}{2}\Vert x-x^{j_k}\Vert _{}^{2} +\langle \lambda ^{j_k+1},\psi ^{j_k}(x)\rangle ,\quad \forall x\in {\textrm{dom}}_{\chi _{0}}.\qquad \quad \end{aligned}$$
(4.10)

Dividing both sides by \(\Vert \lambda ^{j_k+1}\Vert _{}^{}\) gives

$$\begin{aligned}&\big [\langle G^{j_k},x^{j_k+1}\rangle +\chi _0(x^{j_k+1})+\tfrac{\gamma _{j_k}}{2} \Vert x^{j_k+1}-x^{j_k}\Vert _{}^{2}\big ]/\Vert \lambda ^{j_k+1}\Vert _{}^{} +\langle u^{k},\psi ^{j_k}(x^{j_k+1})\rangle \nonumber \\&\quad \le \big [\langle G^{j_k},x\rangle +\chi _0(x)+\tfrac{\gamma _{j_k}}{2}\Vert x-x^{j_k}\Vert _{}^{2}\big ]/\Vert \lambda ^{j_k+1}\Vert _{}^{} +\langle u^{k},\psi ^{j_k}(x)\rangle ,\quad \forall x\in {\textrm{dom}}_{\chi _{0}}.\nonumber \\ \end{aligned}$$
(4.11)

where we denote \({u}^{k}=\tfrac{\lambda ^{j_k+1}}{\Vert \lambda ^{j_k+1}\Vert }\). Since \(\{u^k\}\) is bounded, passing to a subsequence if needed, we have \(\lim _{k\rightarrow \infty }u^{k}=\bar{u}\). Since \(\omega \in {\mathcal {A}}\cap {\mathcal {B}}\), \(\{G^{j_k}\}\) is bounded and \(\{\tfrac{\gamma _{j_k}}{2}\Vert x^{j_k+1}-x^{j_k}\Vert _{}^{2}\}\) converges to 0. Therefore, taking \(k\rightarrow \infty \) on both sides of (4.11), we have

$$\begin{aligned} \langle \bar{u},\chi (\bar{x})\rangle \le \big \langle \bar{u},\psi (\bar{x})+\langle \nabla \psi (\bar{x}),x-\bar{x}\rangle +\tfrac{L}{2}\Vert x-\bar{x}\Vert ^2+\chi (x)\big \rangle ,\ \forall x\in {\textrm{dom}}_{\chi _{0}}. \end{aligned}$$
(4.12)

Analogous to the proof of Theorem 1, it is easy to show that \(\bar{x}\) violates MFCQ, which however, contradicts Assumption 3. As a consequence of this argument, we have \({\mathcal {U}}\subseteq {\mathcal {A}}^c \cup {\mathcal {B}}^c\). Hence, we claim that the event \(\sup _k \Vert \lambda ^k\Vert _{}^{}<\infty \) will happen a.s. and complete our proof of the boundedness condition.

Next, we prove asymptotic convergence to KKT solutions. For any random element \(\omega \), let \(\bar{x}(\omega )\) be any limit point of \(\{x^k\}\). Passing to some subsequence if necessary, we assume that \(\lim _{k\rightarrow \infty }x^k=\bar{x}\) and \(\lim _{k\rightarrow \infty }\lambda ^{k+1}=\bar{\lambda }\).

$$\begin{aligned}&\langle G^k+ \nabla f(x^{k})\lambda ^{k+1}, x^{k+1}-x\rangle +\chi _{0}(x^{k+1})-\chi _{0}(x)+\langle \lambda ^{k+1},\chi (x^{k+1})-\chi (x)\big \rangle \nonumber \\&\quad \le \tfrac{\gamma _k+\langle \lambda ^{k+1},L\rangle }{2}\big (\Vert x-x^{k}\Vert ^{2}-\Vert x^{k+1}-x^{k}\Vert ^{2}-\Vert x-x^{k+1}\Vert ^{2}\big ). \end{aligned}$$

Moreover, we have

$$\begin{aligned} \langle G^k,x^{k+1}-x\rangle&= \langle \nabla f_0(x^k),x^{k+1}-x\rangle +\langle \zeta ^k,x^{k+1}-x\rangle \nonumber \\&= \langle \nabla f_0(x^k),x^{k+1}-x\rangle + \langle \zeta ^k,x^{k+1}-x^k\rangle + \langle \zeta ^k,x^{k}-x\rangle \nonumber \\&\ge \langle \nabla f_0(x^k),x^{k+1}-x\rangle - \Vert \zeta ^k\Vert \Vert x^{k+1}-x^k\Vert + \langle \zeta ^k,x^{k}-x\rangle . \end{aligned}$$

Combining the above two results, we have

$$\begin{aligned}&\langle \nabla f_0(x^k){+} \nabla f(x^{k})\lambda ^{k+1}, x^{k+1}-x\rangle +\chi _{0}(x^{k+1})-\chi _{0}(x)+\langle \lambda ^{k+1}, \chi (x^{k+1})-\chi (x)\rangle \nonumber \\&\quad \le \tfrac{\gamma _k+\langle \lambda ^{k+1},L\rangle }{2}\big (\Vert x-x^{k}\Vert ^{2}-\Vert x^{k+1}-x^{k}\Vert ^{2}-\Vert x-x^{k+1}\Vert ^{2}\big )\nonumber \\&\qquad + \Vert \zeta ^k\Vert \Vert x^{k+1}-x^k\Vert + \langle \zeta ^k,x-x^{k}\rangle \nonumber \\&\quad \le \tfrac{\gamma _k+\langle \lambda ^{k+1},L\rangle }{2}\big (\Vert x-x^{k}\Vert ^{2}-\Vert x-x^{k+1}\Vert ^{2}\big ) + \tfrac{\Vert \zeta ^k\Vert ^2}{2(\gamma _k+\langle \lambda ^{k+1},L\rangle )} + \langle \zeta ^k,x-x^{k}\rangle . \end{aligned}$$

Taking \(k\rightarrow \infty \) in the above relation and noting that almost surely we have \(\lim _{k\rightarrow \infty }\zeta ^k = 0\) and \(\lim _{k\rightarrow \infty } \Vert x^k-x^{k+1}\Vert =0\), then

$$\begin{aligned} \langle \nabla f_0(\bar{x})+ \nabla f(\bar{x})\lambda ^{k+1}, \bar{x}-x\rangle +\chi _{0}(\bar{x})-\chi _{0}(x)+\langle \lambda ^{k+1}, \chi (\bar{x})-\chi (x)\rangle \le 0, \quad a.s. \end{aligned}$$

Using an argument similar to the one in Theorem 1, we can show that \(\bar{x}\) is almost surely a KKT point. \(\square \)

Our next goal is to develop the iteration complexity of Algorithm 2. To achieve this goal, we need to assume that the dual is uniformly bounded, namely, condition (3.9) holds for all the random events. While this condition is stronger than the almost sure boundedness of \(\lambda ^{k+1}\) shown by Theorem 3, it is indeed satisfied in many scenarios, e.g., when strong feasibility (Assumption 4) holds or other scenarios described in [4, 5]. We present the main complexity result in the following theorem.

Theorem 4

Suppose that condition (3.9) holds. Then, the sequence \(\{(x^{k+1},\lambda ^{k+1})\}\) satisfies that

$$\begin{aligned}&\sum _{k=0}^{K}\tfrac{\alpha _k(2\gamma _{k}-\beta _{k}-L_{0})}{4(\gamma _{k}+L_{0} +2B\Vert L\Vert )^{2}}\Vert \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{2}\nonumber \\&\quad \le L_0D^2\alpha _K + \sum _{k=0}^{K}\Big (\tfrac{\alpha _k(2\gamma _{k} -\beta _{k}-L_{0})}{2(\gamma _{k}+L_{0}+2B\Vert L\Vert )^{2}} +\tfrac{\alpha _K}{2\beta _{k}}\Big )\Vert \zeta ^{k}\Vert ^{2} \end{aligned}$$
(4.13)
$$\begin{aligned}&\sum _{k=0}^{K}\alpha _k\big (2\gamma _{k}-\beta _{k}-L_{0}\big ) \langle \lambda ^{k+1},\vert \psi (x^{k+1})-\eta \vert \rangle \nonumber \\&\le 2 BL_0\Vert L\Vert D^2\alpha _K+B\Vert L\Vert \sum _{k=0}^{K} \tfrac{\alpha _K\Vert \zeta ^{k}\Vert ^{2}}{\beta _{k}} \nonumber \\&\quad +B\sum _{k=0}^{K} \alpha _k\big (2\gamma _{k}-\beta _{k} -L_{0}\big )\Vert \eta -\eta ^{k}\Vert . \end{aligned}$$
(4.14)

Proof

First, appealing to (4.3), (2.1) and (3.27), we have

$$\begin{aligned}&\partial _x{\mathcal {L}}_{k}(x^{k+1},\lambda ^{k+1}) \\&\quad =\partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})+\nabla f_{0}(x^{k})-\nabla f_{0}(x^{k+1}) \\&\qquad +\sum _{i=1}^{m}\lambda _{i}^{k+1}\big [\nabla f_{i}(x^{k})-\nabla f_{i}(x^{k+1})\big ]\\&\qquad +\big (\gamma _{k}+\langle \lambda ^{k+1},L\rangle \big )\big (x^{k+1}-x^{k}\big )+\zeta ^{k}. \end{aligned}$$

It follows that

$$\begin{aligned}&\Vert \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}\\&\quad \le {} \Vert \nabla f_{0}(x^{k})-\nabla f_{0}(x^{k+1})\Vert +\Vert \zeta ^{k}\Vert \\&\qquad +\sum _{i=1}^{m}\lambda _{i}^{k+1}\Vert \nabla f_{i}(x^{k})-\nabla f_{i}(x^{k+1})\Vert +(\gamma _{k}+\langle \lambda ^{k+1},L\rangle )\Vert x^{k+1}-x^{k}\Vert \\&\quad \le {} \big (\gamma _{k}+L_{0}+2\langle \lambda ^{k+1}, L\rangle \big )\Vert x^{k+1}-x^{k}\Vert +\Vert \zeta ^{k}\Vert . \end{aligned}$$

In view of the above result and basic inequality \((a+b)^{2}\le 2a^{2}+2b^{2}\), we have

$$\begin{aligned} \Vert \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{2}\le 2\big (\gamma _{k}+L_{0}+2B\Vert L\Vert \big )^{2}\Vert x^{k+1}-x^{k}\Vert ^{2}+2\Vert \zeta ^{k}\Vert ^{2}. \end{aligned}$$
(4.15)

Let us denote an auxiliary sequence \(C_k= {\left\{ \begin{array}{ll} \psi _0(x^0) &{} k=0\\ \psi _0(x^k) -\sum _{s=0}^{k-1}\tfrac{\Vert \zeta ^s\Vert ^2}{2\beta _s} &{} k>0 \end{array}\right. }.\) Proposition 3 implies that

$$\begin{aligned} \tfrac{2\gamma _{k}-\beta _{k}-L_{0}}{2}\Vert x^{k+1}-x^k\Vert ^2\le C_k - C_{k+1}. \end{aligned}$$
(4.16)

Putting this relation and (4.15) together, we have

$$\begin{aligned} \tfrac{2\gamma _{k}-\beta _{k}-L_{0}}{4(\gamma _{k}+L_{0} +2B\Vert L\Vert )^{2}}\Vert \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{2}&\le C_k-C_{k+1}+ \tfrac{2\gamma _{k}-\beta _{k}-L_{0}}{2(\gamma _{k}+L_{0}+2B\Vert L\Vert )^{2}} \Vert \zeta ^{k}\Vert ^{2}. \end{aligned}$$
(4.17)

Summing up (4.17) over \(k=0, 1,\ldots ,K\) weighted by \(\alpha _k\) leads to

$$\begin{aligned}&\sum _{k=0}^{K}\tfrac{\alpha _k(2\gamma _{k}-\beta _{k}-L_{0})}{4(\gamma _{k}+L_{0}+2B\Vert L\Vert )^{2}}\Vert \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert ^{2} \\&\quad \le \sum _{k=0}^K\alpha _k(C_k-C_{k+1}) +\sum _{k=0}^{K}\tfrac{\alpha _k(2\gamma _{k}-\beta _{k}-L_{0})}{2(\gamma _{k}+L_{0}+2B\Vert L\Vert )^{2}}\Vert \zeta ^{k}\Vert ^{2}. \end{aligned}$$

Moreover, since \(\{C_k\}\) is monotonically decreasing, we have

$$\begin{aligned} \sum _{k=0}^K\alpha _k(C_k-C_{k+1})&\le \alpha _0 C_0+\sum _{k=1}^K(\alpha _k-\alpha _{k-1})C_k-\alpha _K C_{K+1} \\&\le \alpha _K (C_0 - C_{K+1}) \le L_0D^2\alpha _K +\alpha _K\sum _{k=0}^K \tfrac{\Vert \zeta ^k\Vert ^2}{2\beta _k}. \end{aligned}$$

Combining the above two relations leads to our first result (4.13).

For the second part, note that (3.34) remains valid in the stochastic setting. Putting (3.34) and (4.16) together, we obtain

$$\begin{aligned} (2\gamma _{k}-\beta _{k}-L_{0})\langle \lambda ^{k+1},\vert \psi (x^{k+1})-\eta \vert \rangle \le 2 B\Vert {L}\Vert (C_k -C_{k+1}) + B(2\gamma _k-\beta _k-L_0)\Vert \eta -\eta ^k\Vert . \end{aligned}$$

Multiplying both ends by \(\alpha _k\) and then summing up the resulting terms over \(k=0,\ldots , K\) gives (4.14). \(\square \)

We next obtain more specific convergence rate by choosing the parameters properly.

Corollary 2

In Algorithm LCSPG, set \(\gamma _{k}=L_{0}\), \(\beta _{k}=L_{0}/2\), \(\alpha _{k}=k+1\), \(b_{k}=K+1\) and \(\delta ^k = \tfrac{(\eta -\eta ^0)}{(k+1)(k+2)}\). Then \(x^{\hat{k}+1}\) is a randomized \(\epsilon \) type-I KKT point with

$$\begin{aligned} \epsilon =\tfrac{4}{K+2} \max \big \{8(L_0+B\Vert L\Vert )^{2} \big (D^2 + \tfrac{17\sigma ^2}{16L_0^2}\big ), 2B\Vert L\Vert D^2 +\tfrac{2B\Vert L\Vert \sigma ^2}{L_0^2}+\tfrac{B\Vert \eta -\eta ^{0}\Vert }{2}\big \}. \end{aligned}$$

Proof

Plugging in the value of \(\gamma _k\), \(\alpha _k\), \(\beta _k\) in the relation (4.13) and taking expectation over all the randomness, we have

$$\begin{aligned}&\tfrac{L_0}{32(L_0+B\Vert L\Vert )^{2}} \sum _{k=0}^{K}(k+1)\mathbb {E}[\Vert \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-} ^{2}] \nonumber \\&\quad \le L_0D^2(K+1) + \sum _{k=0}^{K}\Big (\tfrac{L_0(k+1)}{16(L_{0}+B\Vert L\Vert )^{2}} +\tfrac{k+1}{L_0}\Big )\mathbb {E}[\Vert \delta ^{k}\Vert ^{2}] \nonumber \\&\quad \le L_0D^2(K+1) + \tfrac{17\sigma ^2}{16 L_0} (K+1). \end{aligned}$$

Moreover, due to the random sampling of \(\hat{k}\), we have

$$\begin{aligned} \mathbb {E}_{\hat{k}}\big [\Vert \partial _{x}{\mathcal {L}}(x^{\hat{k}+1},\lambda ^{\hat{k}+1})\Vert _{-}^2\big ] =\tfrac{2}{(K+1)(K+2)}\sum _{k=0}^{K}(k+1)\Vert \partial _{x}{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{2}. \end{aligned}$$

Combining the above two results, we have

$$\begin{aligned} \mathbb {E}\big [\Vert \partial _x{\mathcal {L}}(x^{\hat{k}+1},\lambda ^{\hat{k}+1})\Vert _{-}^{2}\big ] \le \tfrac{32(L_0+B\Vert L\Vert )^{2}}{K+2} \Big (D^2 + \tfrac{17\sigma ^2}{16L_0^2}\Big ). \end{aligned}$$

Second, plugging in the values of \(\gamma _k\), \(\beta _k\) and \(\delta ^k\) in (4.14), we have

$$\begin{aligned}{} & {} \tfrac{L_{0}}{2}\sum _{k=0}^{K}(k+1)\langle \lambda ^{{k+1}},\vert \psi (x^{{k}+1})-\eta \vert \rangle \le 2BL_0 \Vert L\Vert D^2(K+1) \nonumber \\{} & {} \quad + 2B\Vert L\Vert \tfrac{\sigma ^2(K+1)}{L_0}+\tfrac{BL_0\Vert \eta -\eta ^{0}\Vert (K+1)}{2}. \end{aligned}$$
(4.18)

It then follows from (4.18) and the definition of \(\hat{k}\) that

$$\begin{aligned} \mathbb {E}\big [\langle \lambda ^{\hat{k}+1},\vert \psi (x^{\hat{k}+1})-\eta \vert \rangle \big ] \le \tfrac{4}{K+2}\Big \{2B\Vert L\Vert D^2 +\tfrac{2B\Vert L\Vert \sigma ^2}{L_0^2}+\tfrac{B\Vert \eta -\eta ^{0}\Vert }{2}\Big \}. \end{aligned}$$

This completes our proof. \(\square \)

Remark 3

In order to obtain some \(\epsilon \)-error in satisfying the type I KKT condition, LCSPG requires a number of \({\mathcal {O}}(\varepsilon ^{-2})\) calls to the SFO, which matches the complexity bound of stochastic gradient descent for unconstrained nonconvex optimization [15]. Moreover, due to minibatching, LCSPG obtains an even better \({\mathcal {O}}(\varepsilon ^{-1})\) complexity in the number of evaluations of \(f_i(x)\) and \(\nabla f_i(x)\) (\(i\in [m]\)).

4.2 Level constrained stochastic variance reduced gradient descent

We consider the finite sum problem:

$$\begin{aligned} f_{0}(x)=\tfrac{1}{n}\sum _{i=1}^{n}F(x,\xi _{i}), \end{aligned}$$
(4.19)

where each \(F(x,\xi _{i})\) is Lipschitz smooth with the parameter \(L_{0}\), \(i=1,2,\ldots , n\). To further improve the convergence performance in this setting, we present a new variant of the stochastic gradient method by extending the stochastic variance reduced gradient descent to the constrained setting.

Algorithm 3
figure c

Level constrained stochastic variance-reduced gradient descent (LCSVRG)

We present the level constrained stochastic variance-reduced gradient descent (LCSVRG) in Algorithm 3, which extends the nonconvex variance reduced mirror descent (see [20]) to handle nonlinear constraint. Algorithm 3 can be viewed as a double-loop algorithm in which the outer loop computes the full gradient \(\nabla f(x^{k})\) once every T iterations and the nested loop performs stochastic proximal gradient updates based on an unbiased estimator of the true gradient. In this view, we let k indicate the tth iteration at the rth epoch, for some values t and r. Then we use k and (rt) interchangeably throughout the rest of this section. We keep the notation \(\zeta ^{k}\) (or \(\zeta ^{(r,j)}\)) for expressing \(G^{k}-\nabla f(x^{k})\) and note that \(\zeta ^{(r,0)}=0\).

Our next goal is to develop some iteration complexity results of LCSVRG. We skip the asymptotic analysis since it is similar to that of LCSPG. The following Lemma (see [20, Lemma 6.10]) presents a key insight of Algorithm 3 that the variance is controlled by the point distances. We provide proof for completeness.

Lemma 5

In Algorithm 3, \(G^{k}\) is an unbiased estimator of \(\nabla f_0(x^{k})\). Moreover, Let (rt) correspond to k. If \(t>0\), then we have

$$\begin{aligned} \mathbb {E}\big [\Vert \zeta ^{(r,t)}\Vert ^{2}\big ]\le \tfrac{L_{0}^{2}}{b} \sum _{i=0}^{t-1}\mathbb {E}\big [\Vert x^{(r,i+1)}-x^{(r,i)}\Vert ^{2}\big ]. \end{aligned}$$

Proof

We prove the first part by induction. When \(k=0\), we have \(G^{0}=\nabla f_0(x^{0})\). Then for \(k>0\), if \(k\%T==0\), we have \(G^{k}=\nabla f(x^{k})\) by definition. Otherwise, we have

$$\begin{aligned} \mathbb {E}_{k}\big [G^{k}\big ]=\nabla f(x^{k})-\nabla f(x^{k-1})+G^{k-1}=\nabla f_0(x^{k}) \end{aligned}$$

by induction hypothesis \(\mathbb {E}_{k-1}\big [G^{k-1}\big ]=\nabla f(x^{k-1})\).

Next, we estimate the variance of the stochastic gradient. Appealing to (4.20), we have

$$\begin{aligned}&\mathbb {E}_{k}\big [\Vert \zeta ^{k}\Vert ^{2}\big ] =\mathbb {E}\big [\big \Vert \tfrac{1}{b}\sum _{i\in B_{k}}\big [\nabla F(x^{k},\xi _{i})-\nabla F(x^{k-1},\xi _{i})\big ]+G^{k-1}-\nabla f(x^{k})\big \Vert ^{2}\big ]\\&\quad =\mathbb {E}\big [\big \Vert \tfrac{1}{b}\sum _{i\in B_{k}}\big [\nabla F(x^{k},\xi _{i})-\nabla F(x^{k-1},\xi _{i})\big ]-\big [\nabla f(x^{k})-\nabla f(x^{k-1})\big ]+\zeta ^{k-1}\big \Vert ^{2}\big ]\\&\quad =\mathbb {E}\big \Vert \tfrac{1}{b}\sum _{i\in B_{k}}\big [\nabla F(x^{k},\xi _{i})-\nabla F(x^{k-1},\xi _{i})-\nabla f(x^{k})+\nabla f(x^{k-1})\big ]\big \Vert ^{2}+\big \Vert \zeta ^{k-1}\big \Vert ^{2}\\&\quad \le \tfrac{1}{b^{2}}\sum _{i\in B_{k}}\mathbb {E}_{\xi }\Vert \nabla F(x^{k},\xi )-\nabla F(x^{k-1},\xi )\Vert ^{2}+\big \Vert \zeta ^{k-1}\big \Vert ^{2}\\&\quad \le \tfrac{L_{0}^{2}}{b}\Vert x^{k}-x^{k-1}\Vert ^{2}+\big \Vert \zeta ^{k-1}\big \Vert ^{2}, \end{aligned}$$

where the third equality uses the independence of \(B_{k}\) and \(\zeta ^{k-1}\), the first inequality uses the bound \({\text {Var}}(x)\le \mathbb {E}\Vert x\Vert ^{2}\), and the second inequality uses the Lipschitz smoothness of \(F(\cdot ,\xi )\). Taking expectation over all the randomness generating \(B_{(r,1)},B_{(r,2)},\ldots ,B_{(r,t)}\), we have

$$\begin{aligned} \mathbb {E}\big [\Vert \zeta ^{k}\Vert ^{2}\big ]\le \tfrac{L_0^{2}}{b} \sum _{i=1}^{t}\mathbb {E}\big [\Vert x^{(r,i)}-x^{(r,i-1)}\Vert ^{2}\big ]. \end{aligned}$$

\(\square \)

The next Lemma shows that the generated solutions satisfy a property of sufficient descent on expectation.

Lemma 6

Assume that \(\gamma _{k}=\gamma \) and \(\beta _{k}=\beta \) and \(\tilde{L} :=\tfrac{2\gamma - \beta - L_{0}}{2}-\tfrac{L_{0}^{2}(T-1)}{2\beta b}>0.\) Then we have

$$\begin{aligned} \tilde{L}\sum _{j=0}^{t}\mathbb {E}\big [\Vert x^{(r,j+1)}-x^{(r,j)}\Vert ^{2}\big ]\le \mathbb {E}\big [\psi _{0}(x^{(r,0)})\big ]-\mathbb {E}\big [\psi _{0}(x^{(r,t+1)})\big ],\quad 0\le t<T . \end{aligned}$$
(4.21)

Proof

In view of (3), at the jth iteration of the rth epoch, we have

$$\begin{aligned} \psi _{0}(x^{(r,j+1)})&\le \psi _{0}(x^{(r,j)})-\tfrac{2\gamma -\beta -L_{0}}{2}\Vert x^{(r,j+1)}-x^{(r,j)}\Vert ^{2}+\tfrac{\Vert \zeta ^{(r,j)}\Vert ^{2}}{2\beta }. \end{aligned}$$

Summing up the above result over \(j=0,1,2,...,t\) (\(t<T\)) and using Lemma 5, we have

$$\begin{aligned}&\tfrac{2\gamma -\beta -L_{0}}{2}{\textstyle {\sum }}_{j=0}^{t}\mathbb {E}\big [\Vert x^{(r,j+1)}-x^{(r,j)}\Vert ^{2}\big ]\\&\quad \le {} \mathbb {E}\big [\psi _{0}(x^{(r,0)})\big ]-\mathbb {E}\big [\psi _{0}(x^{(r,t+1)})\big ]+\tfrac{L_{0}^{2}}{2\beta b}{\textstyle {\sum }}_{j=0}^{t}{\textstyle {\sum }}_{i=0}^{j-1}\mathbb {E}\big [\Vert x^{(r,i+1)}-x^{(r,i)}\Vert ^{2}\big ]\\&\quad \le {} \mathbb {E}\big [\psi _{0}(x^{(r,0)})\big ]-\mathbb {E}\big [\psi _{0}(x^{(r,t+1)})\big ]+\tfrac{L_{0}^{2}t}{2\beta b}{\textstyle {\sum }}_{i=0}^{t-1}\mathbb {E}\big [\Vert x^{(r,i+1)}-x^{(r,i)}\Vert ^{2}\big ] \\&\quad \le {} \mathbb {E}\big [\psi _{0}(x^{(r,0)})\big ]-\mathbb {E}\big [\psi _{0}(x^{(r,t+1)})\big ]+\tfrac{L_{0}^{2}(T-1)}{2\beta b}{\textstyle {\sum }}_{i=0}^{t-1}\mathbb {E}\big [\Vert x^{(r,i+1)}-x^{(r,i)}\Vert ^{2}\big ]. \end{aligned}$$

Here we use \({\textstyle {\sum }}_{j=0}^{-1}\cdot =0\). \(\square \)

We present the main convergence property of Algorithm 3 in the next theorem.

Theorem 5

Suppose that condition (3.9) and assumptions of Lemma 6 hold, \(b\ge 2T\) and \(K=r_0T+j_0\) for some \(r_0, j_0 \ge 0\). Let \(\{\alpha _k\}\) be a non-decreasing sequence and \(\{\alpha _{(r,j)}\}\) be its equivalent form in (rj) notations. Suppose that \(\alpha _{(r,j)}=\alpha _{(r,0)}\) for \(j=1,2,...,T-1\). Then we have

$$\begin{aligned}&{\textstyle {\sum }}_{k=0}^{K}\alpha _k\mathbb {E}\big [\Vert \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{2}\big ]\nonumber \\&\quad \le 8\tilde{L}^{-1}L_0(\gamma +L_{0}+B\Vert L\Vert )^{2}D^2 \alpha _{(r_0,0)}, \end{aligned}$$
(4.22)
$$\begin{aligned}&\quad {\textstyle {\sum }}_{k=0}^{K}\alpha _k\mathbb {E}\big [\langle \lambda ^{k+1},\vert \psi (x^{k+1})-\eta \vert \rangle \big ]\nonumber \\&\quad \le B{\textstyle {\sum }}_{k=0}^K \alpha _k \Vert \eta - \eta _k\Vert _{}^{} + B\tilde{L}^{-1}\Vert L\Vert _{}^{}L_0D^2 \alpha _{(r_0,0)}. \end{aligned}$$
(4.23)

Moreover, if we take \(T=\lceil \sqrt{n}\rceil \), \(b=8T,\gamma =L_{0},\beta =L_{0}/2,\) and \(\alpha _k= T\,\lfloor k/T\rfloor +1\), and set \(\delta ^{k}=\tfrac{\eta -\eta ^{0}}{(k+1)(k+2)}\). Then \(x^{\hat{k}+1}\) is a randomized Type-I \(\epsilon \)-KKT point with

$$\begin{aligned} \epsilon = \tfrac{K+1}{(K-T+1)^2}\max \big \{128(2L_{0}+B\Vert L\Vert )^{2}D^{2}, B\Vert \eta -\eta ^0\Vert _{}^{} + 16B\Vert L\Vert _{}^{}D^2\big \}. \end{aligned}$$
(4.24)

Proof

First, using Lemma 5 and the assumption that \(b\ge 2T\), for any \(t\le T-1\) we have

$$\begin{aligned}&{\textstyle {\sum }}_{j=0}^{t}\mathbb {E}\big [\Vert \zeta ^{(r,j)}\Vert ^{2}\big ]\nonumber \\&\quad \le \tfrac{L_{0}^{2}}{b}{\textstyle {\sum }}_{j=0}^{t}{\textstyle {\sum }}_{i=0}^{j-1}\mathbb {E}\big [\Vert x^{(r,i+1)}-x^{(r,i)}\Vert ^{2}\big ]\nonumber \\&\le \tfrac{L_{0}^{2}t}{b}{\textstyle {\sum }}_{j=0}^{t-1}\mathbb {E}\big [\Vert x^{(r,i+1)}-x^{(r,i)}\Vert ^{2}\big ]\nonumber \\&\le \tfrac{L_{0}^{2}}{2}{\textstyle {\sum }}_{j=0}^{t-1}\mathbb {E}\big [\Vert x^{(r,i+1)}-x^{(r,i)}\Vert ^{2}. \end{aligned}$$
(4.25)

Note that (4.15) still holds. Therefore, combining (4.15) and (4.25) leads to

$$\begin{aligned}&{\textstyle {\sum }}_{j=0}^{t}\mathbb {E}\big [\Vert {\mathcal {L}}(x^{(r,j+1)},\lambda ^{(r,j+1)})\Vert _{-}^{2}\big ]\nonumber \\&\quad \le 2\big (\gamma +L_{0}+2B\Vert L\Vert \big )^{2}{\textstyle {\sum }}_{j=0}^{t}\mathbb {E}\big [\Vert x^{(r,j+1)} -x^{(r,j)}\Vert ^{2}\big ]\nonumber \\&\qquad +2{\textstyle {\sum }}_{j=0}^{t}\mathbb {E}\big [\Vert \zeta ^{(r,j)}\Vert ^{2}\big ] \\&\le 8(\gamma +L_{0}+B\Vert L\Vert )^{2}{\textstyle {\sum }}_{j=0}^{t}\mathbb {E}\big [\Vert x^{(r,j+1)}-x^{(r,j)}\Vert ^{2}\big ]. \end{aligned}$$

It then follows from Lemma 6 that

$$\begin{aligned}&{\textstyle {\sum }}_{j=0}^{t}\mathbb {E}\big [\Vert \partial _x{\mathcal {L}}(x^{(r,j+1)},\lambda ^{(r,j+1)})\Vert _{-}^{2}\big ]\nonumber \\&\quad \le 8\tilde{L}^{-1}(\gamma +L_{0}+B\Vert L\Vert )^{2}\mathbb {E}\big [\psi _{0}(x^{(r,0)})-\psi _{0}(x^{(r,t+1)})\big ]. \end{aligned}$$
(4.26)

Let \(K=r_0T+j_0\). Summing up the above inequality weighted by \(\alpha _k\) and exchanging the notation \(\alpha _k \leftrightarrow \alpha _{(r,j)}\), then we have

$$\begin{aligned}&{\textstyle {\sum }}_{k=0}^{K}\alpha _k\mathbb {E}\big [\Vert \partial _x{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{2}\big ] \nonumber \\&\quad = {\textstyle {\sum }}_{r=0}^{r_0-1}{\textstyle {\sum }}_{j=0}^{T-1}\alpha _{(r,j)}\mathbb {E}\big [\Vert \partial _x{\mathcal {L}}(x^{(r,j+1)},\lambda ^{(r,j+1)})\Vert _{-}^{2}\big ]\nonumber \\&\qquad + {\textstyle {\sum }}_{j=0}^{j_0}\alpha _{(r_0,j)}\mathbb {E}\big [\Vert \partial _x {\mathcal {L}}(x^{(r_0,j+1)},\lambda ^{(r_0,j+1)})\Vert _{-}^{2}\big ] \nonumber \\&\quad \le 8\tilde{L}^{-1}(\gamma +L_{0}+B\Vert L\Vert )^{2}\Big \{{\textstyle {\sum }}_{r=0}^{r_0-1}\alpha _{(r,0)} \mathbb {E}[\psi _0(x^{(r,0)})-\psi _0(x^{(r+1,0)})] \nonumber \\&\qquad +\alpha _{(r_0,0)} \mathbb {E}[\psi _0(x^{(r_0,0)}) - \psi _0(x^{(r_0,j_0+1)})]\Big \} \nonumber \\&\quad \le 8\tilde{L}^{-1}(\gamma +L_{0}+B\Vert L\Vert )^{2}\alpha _{(r_0,0)} \mathbb {E}[\psi _0(x^{(0,0)})- \psi _0(x^{(r_0,j_0+1)})]\nonumber \\&\quad \le 8\tilde{L}^{-1}L_0(\gamma +L_{0}+B\Vert L\Vert )^{2}D^2 \alpha _{(r_0,0)}. \end{aligned}$$
(4.27)

Above, the first inequality applies (4.26) and uses \(x^{(r,T)}=x^{(r+1,0)}\) while the second inequality uses the monotonicity of \(\{\psi _0(x^k)\}\) and an argument similar to (3.33).

The second part is similar to the argument of Theorem 4. Particularly, combining (3.34) and (4.21) gives

$$\begin{aligned}{} & {} {\textstyle {\sum }}_{j=0}^t \mathbb {E}\langle \lambda ^{(r,j+1)},\vert \psi _i(x^{(r,j+1)})-\eta \vert \rangle \le B\tilde{L}^{-1}\Vert L\Vert _{}^{}\mathbb {E}[\psi _0(x^{(r,0)}) - \psi _0 (x^{(r,t+1)}) ]\nonumber \\{} & {} \quad +B{\textstyle {\sum }}_{j=0}^t\big \Vert \eta -\eta ^{(r,j+1)}\big \Vert \end{aligned}$$
(4.28)

Consequently, using the above relation and an argument similar to show (3.33), we deduce

$$\begin{aligned}&{\textstyle {\sum }}_{k=0}^{K} \alpha _k\mathbb {E}\,\langle \lambda ^{k+1},\vert \psi (x^{k+1})-\eta \vert \rangle \\&\quad = {\textstyle {\sum }}_{r=0}^{r_0-1}{\textstyle {\sum }}_{j=0}^{T-1} \alpha _{(r,j)} \mathbb {E}\,\langle \lambda ^{(r,j+1)},\vert \psi (x^{(r,j+1)})-\eta \vert \rangle \\&\qquad + {\textstyle {\sum }}_{j=0}^{j_0}\alpha _{(r_0,j)} \mathbb {E}\,\langle \lambda ^{(r_0,j+1)},\vert \psi (x^{(r_0,j+1)})-\eta \vert \rangle \\&\quad = B{\textstyle {\sum }}_{k=0}^K \alpha _k \Vert \eta - \eta _k\Vert _{}^{} \\&\quad \quad \quad +B\tilde{L}^{-1}\Vert L\Vert _{}^{}\mathbb {E}\Big \{ {\textstyle {\sum }}_{r=0}^{r_0-1} \alpha _{(r,0)} [\psi _0(x^{(r,0)})- \psi _0(x^{(r+1,0)})] \\&\qquad + \alpha _{(r_0,0)}[\psi _0(x^{(r_0,0)}) - \psi _0(x^{(r_0,j_0+1)})] \Big \}\\&\quad \le B{\textstyle {\sum }}_{k=0}^K \alpha _k \Vert \eta - \eta _k\Vert _{}^{} + B\tilde{L}^{-1}\Vert L\Vert _{}^{}L_0D^2 \alpha _{(r_0,0)}. \end{aligned}$$

Therefore, we complete the proof of (4.22) and (4.23).

Using the provided parameter setting, we have \(\tilde{L}=\tfrac{2\gamma -L_{0}-\beta }{2}-\tfrac{L_{0}^{2}(T-1)}{2\beta b}\ge \tfrac{L_{0}}{4}-\tfrac{L_{0}}{8}=\tfrac{L_{0}}{8}\). Moreover, since \(\alpha _k=T\,\lfloor k/T\rfloor + 1\), we have \(\alpha _{k}\le T\cdot k / T+1\le k+1\). It is easy to check

$$\begin{aligned} {\textstyle {\sum }}_{k=0}^K\alpha _k= T + {\textstyle {\sum }}_{k=T}^K \big (\big \lfloor \tfrac{k}{T}\big \rfloor T + 1\big ) \ge T+ {\textstyle {\sum }}_{k=T}^K \big [ \big [ \tfrac{k}{T}-1\big ] T+1\big ] \ge \tfrac{(K-T+1)^2}{2}. \end{aligned}$$

\(\square \)

Remark 4

It is interesting to compare the performance of LCSVRG with the other two level constrained first-order methods in the finite sum setting (4.19). Similar to LCPG, LCSVRG runs for \({\mathcal {O}}(\varepsilon ^{-1})\) iterations to compute Type-I \(\epsilon \)-KKT point. Moreover, LCSVRG has an appealing feature that the number of stochastic gradient \(\nabla F(x,\xi )\) computed can be significantly reduced for a large value of n. Specifically, Algorithm 3 requires a full gradient \(\nabla f_0(x)\) every T iterations, which contributes \(N_{1}={\mathcal {O}}\big (n\big \lceil \tfrac{K}{T}\big \rceil \big )={\mathcal {O}}(\sqrt{n}K)\) stochastic gradient computations. During the other iterations, Algorithm 3 invokes a batch of size \(b={\mathcal {O}}(T)\) each time, exhibiting a complexity of \(N_{2}={\mathcal {O}}\big (bK\big )={\mathcal {O}}(\sqrt{n}K)\). Therefore, the total number of stochastic gradient computations is \(N=N_{1}+N_{2}={\mathcal {O}}\big (\sqrt{n}K\big ).\) This is better than the O(nK) stochastic gradients needed by LCPG. Moreover, it is better than the bound \(O(K^2)\) of LCSPG when K is at an order larger than \(\varOmega (\sqrt{n})\), which corresponds to a higher accuracy regime of \(\epsilon \ll \tfrac{1}{\sqrt{n}}\). The complexities of all the proposed algorithms for getting some \(\varepsilon \)-KKT solutions are listed in Table 2.

Remark 5

While we mainly discuss the finite-sum objective (4.19), it is possible to extend the variance reduction technique to handle the expectation-based objective (4.1) and improve the \({\mathcal {O}}(\varepsilon ^{-2})\) bound of LCSPG to \({\mathcal {O}}(\varepsilon ^{-3/2})\). To achieve this goal, we impose an additional assumption that \(F(x,\xi )\) is \(L_0\)-Lipschitz smooth for each \(\xi \) in the support set. We choose to omit a detailed discussion on this particular extension, as the technical development for this can be readily derived from the arguments in Sec. 6.5.2. [21] and our previous analysis.

5 Smooth optimization of nonsmooth constrained problems

In this section, we consider the constrained problem (1.1) with nonsmooth objective and nonsmooth constraint functions. We assume that \(f_i\) (\(i\in \{0,1,...,m\}\)) exhibits a difference-of-convex (DC) structure \(f_i(x):= g_i(x) - h_i(x)\): 1) \(h_i\) is an \(L_{h_i}\)-Lipschitz-smooth convex function and 2) \(g_i\) is a structured nonsmooth convex function:

$$\begin{aligned} g_i(x) = \max _{y_i \in {\mathcal {Y}}_i} \langle A_ix, y_i\rangle - p_i(y_i), \end{aligned}$$

where \(A_i \in \mathbb {R}^{a_i\times n}\) is a linear mapping, \({\mathcal {Y}}_i \subset \mathbb {R}^{a_i}\) is a convex compact set and \(p_i:{\mathcal {Y}}_i \rightarrow \mathbb {R}\) is a convex continuous function. In view of such a nonsmooth structure, we can not simply apply the LCPG method, as the crucial quadratic upper-bound on \(f_i(x)\) does not hold in the nonsmooth cases. However, as pointed out by Nesterov [29], the nonsmooth convex function \(g_i\) can be closely approximated by a smooth convex function. Let us denote \(\widehat{y}_i :=\mathop {\text {argmin}}\limits _{y_i \in {\mathcal {Y}}_i} \Vert y_i\Vert _{}^{}\), \(D_{{\mathcal {Y}}_i} :=\max _{y_i \in {\mathcal {Y}}_i} \Vert y_i-\widehat{y}_i\Vert _{}^{}\) and define the approximation function

$$\begin{aligned} g_i^{\beta _i}(x):= & {} \max _{y_i \in {\mathcal {Y}}_i} \langle A_ix, y_i\rangle - p(y_i) - \tfrac{\beta _i}{2}\Vert y_i-\widehat{y}_i\Vert _{}^{2}, \ f^{\beta _i}_{i}(x):= g^{\beta _i}_{i}(x) - h_i(x),\\ {}{} & {} \text {where } \beta _i > 0. \end{aligned}$$

Given some properly chosen smoothing parameter \(\beta _i\), we propose to apply LCPG to solve the following smooth approximation problem:

$$\begin{aligned}&\mathop {\text {min}}\limits _{x}\quad {\psi _0^{\beta _0}(x)=f_0^{\beta _0}(x)+\chi _0(x)}\nonumber \\&\quad \text {s.t.}\quad {\psi _i^{\beta _i}(x)=f_i^{\beta _i}(x)+\chi _{i}(x)}{\le \eta _i\quad i=1,\dots ,m}. \end{aligned}$$
(5.1)

Prior to the analysis of our algorithm, we need to develop some properties of the smooth function \(f_i^{\beta _i}\). We first present a key Lemma which builds some important connection between the quadratic approximation of smooth function and Lipschitz smoothness. The proof is left in Appendix A.

Lemma 7

Suppose \(p(\cdot )\) is continuously differentiable function satisfying

$$\begin{aligned} -\tfrac{\mu }{2} \Vert x-y\Vert _{}^{2} \le p(x) -p(y) -\langle \nabla p(y), x-y\rangle \le \tfrac{L}{2}\Vert x-y\Vert _{}^{2}, \end{aligned}$$
(5.2)

for all xy. Then, \(p(\cdot )\) satisfies

$$\begin{aligned} \Vert \nabla p(x) - \nabla p(y)\Vert _{}^{} \le \max \{L , \mu \} \Vert x-y\Vert _{}^{}. \end{aligned}$$
(5.3)

In smooth approximation, it is shown in [29] that \(g_i^{\beta _i}\) is a Lipschitz smooth function and it approximates the function value of \(g_i\) with some \({\mathcal {O}}(\beta _i)\)-error:

$$\begin{aligned}&g_i^{\beta _i}(x) \le g_i(x) \le g_i^{\beta _i}(x) + \tfrac{\beta _i D_{{\mathcal {Y}}_i}^2}{2}, \qquad \forall x, \end{aligned}$$
(5.4)
$$\begin{aligned}&\Vert \nabla g_i^{\beta _i}(x) - \nabla g_i^{\beta _i}(z)\Vert _{}^{} \le L^{\beta _i}_{g_i} \Vert x-z\Vert _{}^{},\quad \forall x, z,\quad L^{\beta _i}_{g_i} :=\tfrac{\Vert A_i\Vert _{}^{2}}{\beta _i}. \end{aligned}$$
(5.5)

Similar properties of \(f_i^{\beta _i}\) are developed in the following proposition.

Proposition 4

We have the following properties about the approximation function \(f_i^{\beta _i}\) \((\beta _i>0)\).

  1. 1.

    Let \( \bar{\beta }_i \in [0, \beta _i]\), then we have

    $$\begin{aligned} f_i^{\beta _i}(x) \le f_i^{\bar{\beta }_i}(x) \le f_i^{\beta _i}(x) + \tfrac{(\beta _i-\bar{\beta }_i) D_{{\mathcal {Y}}_i}^2}{2}. \end{aligned}$$
    (5.6)
  2. 2.

    \(f_i^{\beta _i}(x)\) has upper curvature \(L^{\beta _i}_{g_i}\) and negative lower curvature \(-L_{h_i}\), namely,

    $$\begin{aligned} f_i^{\beta _i}(x)&\le f_i^{\beta _i}(y)+\langle \nabla f_i^{\beta _i}(y),x-y\rangle +\tfrac{L^{\beta _i}_{g_i}}{2}\Vert x-y\Vert ^{2}, \end{aligned}$$
    (5.7)
    $$\begin{aligned} f_i^{\beta _i}(x)&\ge f_i^{\beta _i}(y)+\langle \nabla f_i^{\beta _i}(y),x-y)-\tfrac{L_{h_i}}{2}\Vert x-y\Vert ^{2}. \end{aligned}$$
    (5.8)
  3. 3.

    \(f_i^{\beta _i}\) is Lipschitz smooth with modulus \(L_i^{\beta _i}:= \max \{ L_{g_i}^{\beta _i}, L_{h_i} \}\). Namely, for any xy, we have

    $$\begin{aligned} \Vert \nabla f_i^\beta (x) - \nabla f_i^\beta (y)\Vert _{}^{} \le L_i^{\beta _i} \Vert x-y\Vert _{}^{}. \end{aligned}$$
    (5.9)

Proof

Part 1. If \(\bar{\beta }<\beta \), then by definition we have \(f_i^{\bar{\beta }_i}(x) \ge f_i^{\beta _i}(x)\). On the other hand, using the boundedness of \({\mathcal {Y}}_i\), we have

$$\begin{aligned} f_i^{\beta _i}(x)&= \max _{y_i\in {\mathcal {Y}}_i}\, \langle A_ix, y_i\rangle - p(y_i)-\tfrac{\bar{\beta }_i}{2}\Vert y_i-\hat{y}_i\Vert _{}^{2} -\tfrac{\beta _i-\bar{\beta }_i}{2}\Vert y_i-\hat{y}_i\Vert _{}^{2} - h(x)\\&\ge \max _{y_i\in {\mathcal {Y}}_i}\, \langle A_ix, y_i\rangle - p(y_i)-\tfrac{\bar{\beta }_i}{2}\Vert y_i-\hat{y}_i\Vert _{}^{2} - h(x) - \tfrac{\beta _i-\bar{\beta }_i}{2} D_{{\mathcal {Y}}_i}^2 \\&= f_i^{\bar{\beta }_i}(x) - \tfrac{\beta _i-\bar{\beta }_i}{2} D_{{\mathcal {Y}}_i}^2. \end{aligned}$$

Combining the above two results gives the desired inequality.

Part 2. Since \(g^{\beta _i}_{i}\) and \(h^i\) are both convex and smooth functions, we have

$$\begin{aligned} g^{\beta _i}_i(x_1)&\le g^{\beta _i}_i(x_2) + \langle \nabla g^{\beta _i}_{i}(x_2), x_1-x_2\rangle + \tfrac{L^{\beta _i}{g_i}}{2}\Vert x_1-x_2\Vert _{}^{2}, \\ h_i(x_1)&\le h_i(x_2) - \langle \nabla h_i(x_2), x_1-x_2\rangle . \end{aligned}$$

Summing up the above two inequalities and noting the definition of \(f^{\beta _i}_i, \nabla f^{\beta _i}_i\), we conclude that \(f^{\beta _i}_{i}\) has an upper curvature of \(L^{\beta _i}_{g_i}\). Similarly, using convexity of \(g^{\beta _i}_{i}\) and smoothness of \(h_i\), we obtain that \(f^{\beta _i}_{i}\) has a negative lower curvature \(-L_{h_i}\).

Part 3. The Lipschitz continuity (5.9) is an immediate consequence of part 2) and Lemma 7. \(\square \)

Remark 6

When \(\bar{\beta _i}=0\), Relation (5.6) reads \(f_i^{\beta _i}(x) \le f_i(x) \le f_i^{\beta _i}(x) + \tfrac{\beta _i D_{{\mathcal {Y}}_i}^2}{2}.\) Together with Assumption 2, it can be seen that \(x^0\) is also strictly feasible for problem (5.1). This justifies that LCPG is well-defined for problem (5.1).

Remark 7

The Lipschitz constant of \(\nabla f^{\beta _i}_i\) can be derived in a different way. Since \(\nabla g^{\beta _i}_i\) and \(\nabla h_i\) are \(L_{g_i}^{\beta _i}\) and \(L_{h_i}\) Lipschitz continuous, respectively, we can show by triangle inequality that \(\nabla f^{\beta _i}_{i}(x)\) is \(L_{g_i}^{\beta _i} + L_{h_i}\)-Lipschitz continuous. In contrast, by exploiting the asymmetry between lower and upper curvature, Proposition 4 derived a slightly sharper bound on the gradient Lipschitz constant.

Throughout this section, we choose specific \(\beta _i\) to ensure \(\beta _iD_{{\mathcal {Y}}_i}^2\) is constant for all \(i \in [m]\). Hence, we can define the additive approximation factor above as

$$\begin{aligned} \nu := \tfrac{\beta _iD_{{\mathcal {Y}}_i}^2}{2},\ i\in [m]. \end{aligned}$$
(5.10)

Note that (5.4) provides an approximation error for function values, or the so-called zeroth-order oracle of function \(g_i\). However, convergence results for nonconvex optimization are generally given in terms of first-order stationarity measure, implying that we need approximation for first-order oracle for the function \(f_i\) and consequently function \(g_i\). Below we discuss a widely used approximate subdifferential for convex functions and generalize it for nonsmooth nonconvex functions.

Definition 5

[\(\nu \)-subdifferential] We say that a vector \(v \in \mathbb {R}^n\) is a \(\nu \)-subgradient of the convex function \(p(\cdot )\) at x if for any z, we have

$$\begin{aligned} p(z) \ge p(x) + \langle v, z-x\rangle -\nu . \end{aligned}$$

The set of all \(\nu \)-subgradients of p at x is called the \(\nu \)-subdifferential, denoted by \(\partial ^{\nu }p(x)\). Moreover, we define \(\nu \) subdifferential of nonconvex function \(f_i\) as \(\partial ^{\nu } f_i(x):= \partial ^{\nu } g_i(x) + \{-\nabla h_i(x)\}\) where the addition of sets is in Minkowski sense.

Finally, we define a generalization of type-I KKT convergence criterion for structured nonsmooth nonconvex function constrained optimization problem:

Definition 6

We say that a point x is an \((\epsilon ,\nu )\) type-III KKT point of (1.1) if there exists \(\lambda \ge 0\) satisfying the following conditions:

$$\begin{aligned} \Vert \partial ^{\nu } \psi _{0}(x) + {\textstyle {\sum }}_{i=1}^m \lambda _i\partial ^{\nu }\psi _i(x)\Vert _{-}^{2}&\le \epsilon , \end{aligned}$$
(5.11)
$$\begin{aligned} {\textstyle {\sum }}_{i=1}^m\lambda _i \vert \psi _i(x)-\eta _i\vert&\le \epsilon , \end{aligned}$$
(5.12)
$$\begin{aligned} \Vert [\psi (x)-\eta ]_+\Vert _{1}^{}&\le \epsilon . \end{aligned}$$
(5.13)

Moreover, we say that x is a randomized \((\epsilon ,\nu )\) type-III KKT point of (1.1) if (5.11), (5.12) and (5.13) are satisfied in expectation.

The \(\epsilon \)-subdifferential and the type-III KKT point are essential for associating smooth approximation with the original nonsmooth problem. We build some important properties in the following proposition.

Proposition 5

Let \(\beta _i\) and \(\nu \) satisfy (5.10).

  1. 1.

    For any \(x\in \mathbb {R}^d\), we have \(\nabla f_i^{\beta _i}(x) \in \partial ^{\nu }f_i(x)\), \(i=0,1,2,...,m\).

  2. 2.

    Suppose that x is a (randomized) Type-I \(\epsilon \)-KKT point of problem (5.1) and \(\lambda \) is the associated dual variable with bound \(\Vert \lambda \Vert \le B\), then x is a (randomized) Type-III \((\bar{\epsilon }, \nu )\)-KKT point of problem (1.1) for \(\bar{\epsilon }=\max \{\epsilon +B\nu , m\nu \}\).

Proof

Part 1. It suffices to show \(\nabla g_i^\beta (x) \in \partial ^{\nu }g_i(x)\). Due to the convexity of \(g_i^{\beta _i}\) and (5.4), we have

$$\begin{aligned} g_i(z) \ge g_i^{\beta _i}(z) \ge g_i^{\beta _i}(x) + \langle \nabla g_i^{\beta _i}(x), z-x\rangle \ge g_i(x) + \langle \nabla g_i^{\beta _i}(x), z-x\rangle - \tfrac{\beta _i D_{{\mathcal {Y}}_i}^2}{2}, \end{aligned}$$

where the first inequality follows from the first relation in (5.4), and the third inequality follows from second relation in (5.4). Noting the definition of \(\nu \)-subgradient, we conclude the proof.

Part 2. It suffices to show the conversion of randomized Type-I KKT points to randomized Type-III KKT points. Suppose that x is a randomized Type-I \(\epsilon \)-KKT point and we have \(\Vert \lambda \Vert _1\le B\). Using Part 1 it is easy to see \(\partial {\mathcal {L}}(x, \lambda ) \subseteq \partial ^\nu \psi _0(x)+{\textstyle {\sum }}_{i=1}^m \partial ^\nu \psi _i(x)\), therefore, we have

$$\begin{aligned} \mathbb {E}\Vert \partial ^\nu \psi (x)+{\textstyle {\sum }}_{i=1}^m \partial ^\nu \psi _i^\nu (x)\Vert _{-}^{2}\le \epsilon . \end{aligned}$$

Using Proposition 4, we have

$$\begin{aligned} \lambda _i(\psi _i^{\beta _i}(x)-\eta _i) \le {\textstyle {\sum }}_{i=1}^m \lambda _i(\psi _i(x)-\eta _i)\le \lambda _i(\psi _i^{\beta _i}(x)-\eta _i) + \lambda _i{\nu } \le \lambda _i{\nu }. \end{aligned}$$

This implies

$$\begin{aligned} \vert \lambda _i(\psi _i(x)-\eta _i)\vert \le \max \{|\lambda _i(\psi _i^{\beta _i}(x)-\eta _i)|, \lambda _i{\nu }\}. \end{aligned}$$

Summing up this inequality over \(i=1,2,3,...,m\) and taking expectation with all the randomness, we have the

$$\begin{aligned} {\textstyle {\sum }}_{i=1}^m \mathbb {E}\vert \lambda _i(\psi _i(x)-\eta _i)\vert \le \mathbb {E}{\textstyle {\sum }}_{i=1}^m |\lambda _i(\psi _i^{\beta _i}(x)-\eta _i)| + {\textstyle {\sum }}_{i=1}^m\lambda _i{\nu } \le \epsilon + B\nu . \end{aligned}$$

Moreover, we have

$$\begin{aligned}{} & {} {\textstyle {\sum }}_{i=1}^m [\psi _i(x)-\eta _i]_+= {\textstyle {\sum }}_{i=1}^m [\psi _i^{\beta _i}(x)-\eta _i+\psi _i(x)-\psi _i^{\beta _i}(x)]_+\\{} & {} \quad \le {\textstyle {\sum }}_{i=1}^m [\psi _i(x)-\psi _i^{\beta _i}(x)] \le m \nu . \end{aligned}$$

\(\square \)

Now, we are ready to discuss the convergence rate of LCPG for nonsmooth nonconvex function constrained optimization.

Theorem 6

Assume that \(\beta _i, \nu \) satisfy (5.10) and set \(\delta ^{k}=\tfrac{\eta -\eta ^{0}}{(k+1)(k+2)}\) when running LCPG to solve problem (5.1). Denote \(c_i = \Vert A_i\Vert _{}^{2}D_{{\mathcal {Y}}_i}^2\) \((0\le i \le m)\), \(c=[c_1,c_2,..., c_m]^T\) and let \(\Vert \lambda ^k\Vert _1\le B\). Suppose that \(\nu =o(\tfrac{c_i}{L_{h_i}})\) for \(i=0,1,2,...,m\), then \(x^{\hat{k}+1}\) is a randomized Type-III \((\bar{\epsilon },\nu )\)-KKT point with

$$\begin{aligned} \bar{\epsilon } ={\mathcal {O}}\Big \{\tfrac{2}{K+2}\big [\big (\tfrac{8(c_0+B\Vert c\Vert _{}^{})^2}{c_0\nu }+\tfrac{2B\Vert c\Vert _{}^{}}{c_0}\big )(\varDelta +\nu ) +B\Vert \eta -\eta ^0\Vert _{}^{}\big ] + B \nu + m\nu \Big \}. \end{aligned}$$

Proof

Our analysis resembles the proof of Theorem 2. Using a similar argument in (3.33), we have

$$\begin{aligned} {\textstyle {\sum }}_{k=0}^K {\alpha _k}\Vert x^k-x^{k+1}\Vert _{}^{2}&\le \tfrac{2\alpha _K}{L_0} \big [\psi _0^{\beta _0}(x^0) - \psi _0^{\beta _0}(x^{K+1})\big ] \nonumber \\&\le \tfrac{2\alpha _K}{L_0^{\beta _0}} \big [\psi _0(x^0) - \psi _0(x^{K+1})+\nu \big ]. \end{aligned}$$
(5.14)

Combining this result with Lemma 3 we obtain

$$\begin{aligned} \begin{aligned} {\textstyle {\sum }}_{k=0}^{K}\alpha _k\Vert \partial _{x}{\mathcal {L}}(x^{k+1},\lambda ^{k+1})\Vert _{-}^{2}&\le 8\tfrac{(L_{0}^{\beta _0}+B\Vert L^{\beta }\Vert _{}^{})^{2}}{L_0^{\beta _0}}\alpha _K (\varDelta +\nu ),\\ {\textstyle {\sum }}_{k=0}^{K} \alpha _k \langle \lambda ^{k+1}, \vert \psi (x^{k+1})-\eta \vert \rangle&\le 2B\tfrac{\Vert L^{\beta }\Vert _{}^{}}{L_0^{\beta _0}}\alpha _K (\varDelta +\nu ) + B {\textstyle {\sum }}_{k=0}^K\alpha _k\Vert \eta -\eta ^k\Vert _{}^{}, \end{aligned} \end{aligned}$$
(5.15)

where \(\varDelta =\psi _0(x^0)-\psi _0(x^*)\), \(\alpha _k\ge 0\) and \({\mathcal {L}}^\beta (x, \lambda ) :=\psi _{0}^{\beta _0}(x) + {\textstyle {\sum }}_{i=1}^m\lambda _{i}(\psi ^{\beta _i}_i(x) -\eta _i)\). Taking \(\delta ^{k}=\tfrac{\eta -\eta ^{0}}{(k+1)(k+2)}\) and \(\alpha _k=k+1\) in (5.15), we see that \(x^{\hat{k}+1}\) is a Type-I \(\epsilon \)-KKT point for

$$\begin{aligned} \epsilon = \tfrac{2}{K+2} \max \big \{\tfrac{8(L_{0}^{\beta _0}+B\Vert L^{\beta }\Vert _{}^{})^{2}}{L_0^{\beta _0}}(\varDelta +\nu ), \tfrac{2B\Vert L^{\beta }\Vert _{}^{}}{L_0^{\beta _0}} (\varDelta +\nu )+B\Vert \eta -\eta ^0\Vert _{}^{}\big \}. \end{aligned}$$

Noting that \(L_{g_i}^{\beta _i} = \tfrac{\Vert A_i\Vert _{}^{2}}{\beta _i} = \tfrac{\Vert A_i\Vert _{}^{2}D_{{\mathcal {Y}}_i}^2}{2\nu }=\tfrac{c_i}{2\nu }\) and \(\nu =o(\tfrac{c_i}{L_{h_i}})\), we have \(\tfrac{(L_{0}^{\beta _0}+B\Vert L^{\beta }\Vert _{}^{})^{2}}{L_0^{\beta _0}}={\mathcal {O}}\big (\tfrac{(c_0+B\Vert c\Vert _{}^{})^2}{c_0\nu }\big )\) and \(\tfrac{\Vert L^{\beta }\Vert _{}^{}}{L_0^{\beta _0}}=\tfrac{\Vert c\Vert _{}^{}}{c_0}\). Using the definition of \(\hat{k}\) and Proposition 5 we obtain the desired result. \(\square \)

6 Inexact LCPG

LCPG requires the exact optimal solution of subproblem (3.1), which, however, poses a great challenge when the subproblem is difficult to solve. To alleviate such an issue, we consider an inexact variant of LCPG method for which the update of \(x^{k+1}\) only solves problem (3.1) approximately. This section is organized as follows. First, we present a general convergence property of inexact LCPG when the subproblem solutions satisfy certain approximation criterion. Next, we analyze the efficiency of inexact LCPG when the subproblems are handled by different external solvers. When the subproblem is a quadratically constrained quadratic program (QCQP), we propose an efficient interior point algorithm by exploiting the diagonal structure. When the subproblem has general proximal components, we propose to solve it by first-order methods. Particularly, we consider solving the subproblem by the constraint extrapolation (ConEx) method and develop the total iteration complexity of ConEx-based LCPG.

6.1 Convergence analysis under an inexactness criterion

Throughout the rest of this section, we will denote the exact primal-dual solution of (3.1) as \((\widetilde{x}^{k+1}, \widetilde{\lambda }^{k+1})\). We use the following criterion for measuring the accuracy of subproblem solutions.

Definition 7

We say that a point x is an \(\epsilon \)-solution of (3.1) if

$$\begin{aligned} \psi _0^{{k}}(x)-\psi _0^{k}{(\widetilde{x}^{k+1})} \le \epsilon ,\quad \Vert [\psi ^{k}(x)]_+\Vert _{}^{} \le \epsilon ,\quad {\mathcal {L}}_{k}(x, \widetilde{\lambda }^{k+1}) \le {\mathcal {L}}_{k}(\widetilde{x}^{k+1}, \widetilde{\lambda }^{k+1} ) + \epsilon . \end{aligned}$$

The following theorem shows asymptotic convergence to stationarity for inexact LCPG method under mild assumptions. Since the proof is similar to the previous argument, we present the details in Appendix B for the sake of completeness. Note that the theorem applies to a general nonconvex problem and hence applies to convex problems as well.

Theorem 7

Suppose that Assumption 3 holds and let \(x^{k+1}\) be an \(\epsilon _k\)-solution of (3.1) satisfying \(\epsilon _k < \min _{i \in [m]} \delta _{i}^k\). Then all the conclusions of Theorem 1 still hold. Then the dual sequence \(\{\tilde{\lambda }^{k}\}\) is uniformly bounded by a constant \(B>0\). Moreover, every limit point of inexact LCPG is a KKT point.

Under the inexactness condition in Definition 7, we establish the complexity of inexact LCPG in the following theorem.

Theorem 8

Under the assumptions of Theorem 7, we have

$$\begin{aligned} {\textstyle {\sum }}_{k=0}^K \alpha _k\Vert \partial _{x}{\mathcal {L}}(\tilde{x}^{k+1},\tilde{\lambda }^{k+1})\Vert _{-}^{2}&\le \tfrac{8(L_0+B\Vert L\Vert _{}^{})^2}{L_0} \tilde{\varDelta }, \end{aligned}$$
(6.1)
$$\begin{aligned} {\textstyle {\sum }}_{k=0}^K \alpha _k\langle \tilde{\lambda }^{k+1}, \vert \psi (\tilde{x}^{k+1})-\eta \vert \rangle&\le B{\textstyle {\sum }}_{k=0}^K \alpha _k\Vert \eta -\eta ^k\Vert _{}^{} + \tfrac{2B\Vert L\Vert _{}^{}}{L_0}\tilde{\varDelta }, \end{aligned}$$
(6.2)
$$\begin{aligned} {\textstyle {\sum }}_{k=0}^K\alpha _{k}\Vert x^k-\widetilde{x}^{k+1}\Vert _{}^{2}&\le \tfrac{2}{L_0}\tilde{\varDelta }, \end{aligned}$$
(6.3)

where \(\tilde{\varDelta }={\textstyle {\sum }}_{k=0}^K\alpha _{k}[\psi _0(x^k)-\psi _0(x^{k+1})+{\varepsilon }_k]\). Moreover, if we choose the index \(\hat{k} \in \{0,1, \dots , K\}\) with probability \(\mathbb {P}(\hat{k} = k) = \alpha _k/({\textstyle {\sum }}_{i=0}^K\alpha _{i}) \), then \(x^{\hat{k}}\) is a randomized \((\epsilon ,\delta )\) type-II KKT point with

$$\begin{aligned} \begin{aligned} \epsilon&= 1\big /({\textstyle {\sum }}_{i=0}^K\alpha _{i})\max \big \lbrace \tfrac{8(L_0+B\Vert L\Vert _{}^{})^2}{L_0} \tilde{\varDelta }, B{\textstyle {\sum }}_{k=0}^K\alpha _k \Vert \eta -\eta ^k\Vert _{}^{} + \tfrac{2B\Vert L\Vert _{}^{}}{L_0}\tilde{\varDelta } \big \rbrace ,\\ \delta&= 2\tilde{\varDelta }\big /{(L_0{\textstyle {\sum }}_{i=0}^K\alpha _{i})}. \end{aligned} \end{aligned}$$
(6.4)

In particular, using \(\alpha _k = k+1\), \(\epsilon _k = \min _{i \in [m]} \tfrac{\delta ^k_i}{2}\) and \(\delta _{i}^k = \tfrac{\eta _i - \eta ^k_i}{(k+1)(k+2)}\), we have \(x^{\hat{k}}\) is \((\epsilon , \delta )\) type-II KKT point of (1.1) where

$$\begin{aligned} \begin{aligned} \epsilon&= \tfrac{2}{K{+}2}\max \big \{ 4(L_0{+}B\Vert L\Vert _{}^{})^2 [ 2D^2 {+} \tfrac{\Vert \eta {-}\eta ^0\Vert _{}^{}}{L_0} ] , B\Vert \eta -\eta ^0\Vert _{}^{} {+} B\Vert L\Vert _{}^{} [ 2D^2 {+} \tfrac{\Vert \eta {-}\eta ^0\Vert _{}^{}}{L_0} ] \big \},\\ \delta&= \tfrac{2}{K+2}\big ( 2D^2 + \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{L_0}\big ). \end{aligned} \end{aligned}$$
(6.5)

Proof

Using (3.8) with \(x^{k+1}\) replaced by \(\widetilde{x}^{k+1}\) (the optimal solution of problem (3.1)) and adding \(f(x^k) + \tfrac{L_0}{2}\Vert x^k-\widetilde{x}^{k+1}\Vert _{}^{2}\) on both sides, we have

$$\begin{aligned} \tfrac{L_0}{2}\Vert x^k-\widetilde{x}^{k+1}\Vert _{}^{2}&\le \psi _{0}^k(x^k)-\psi _{0}^k(\widetilde{x}^{k+1}) \nonumber \\&\le \psi _{0}(x_k) - \psi _{0}^k(x^{k+1}) + \epsilon _k \nonumber \\&\le \psi _{0}(x_k) - \psi _{0}(x^{k+1}) + \epsilon _k, \end{aligned}$$
(6.6)

where the second inequality follows from \(\psi _{0}^k(x^k) = \psi _{0}(x^k)\) as well as \(x^{k+1}\) being an \(\epsilon _k\)-solution (see Definition 7) of subproblem (3.1), and the third inequality follows from the fact that \(\psi _{0}^k(x) \ge \psi _{0}(x)\) for all \(x \in {\textrm{dom}}{\chi _{0}}\).

Using Lemma 3 (again \(x^{k+1}\) is replaced by \(\widetilde{x}^{k+1}\)), noting that \(\epsilon _{k}\) satisfies the requirements of Theorem 7 implying that \(\Vert \widetilde{\lambda }^k\Vert _{}^{} \le B\) and using (6.6), we have

$$\begin{aligned} \Vert \partial _{x}{\mathcal {L}}(\tilde{x}^{k+1},\tilde{\lambda }^{k+1})\Vert _{-}^{2}&\le 4(L_0+B\Vert L\Vert _{}^{})^2 \Vert \tilde{x}^{k+1}-x^k\Vert _{}^{2} \nonumber \\&\le \tfrac{8(L_0+B\Vert L\Vert _{}^{})^2}{L_0}\big [\psi _0(x^k)-\psi _0(x^{k+1})+\varepsilon _k\big ]. \end{aligned}$$
(6.7)

Similar to the argument of (3.34), we have

$$\begin{aligned} {\textstyle {\sum }}_{i=1}^m\tilde{\lambda }_{i}^{k+1}\vert \psi _{i}(\tilde{x}^{k+1})-\eta _{i}\vert&\le B \Vert \eta -\eta ^k\Vert _{}^{} + B\Vert L\Vert _{}^{}\Vert \tilde{x}^{k+1}-x^k\Vert _{}^{2} \nonumber \\&\le B \Vert \eta -\eta ^k\Vert _{}^{} + \tfrac{2B\Vert L\Vert _{}^{}}{L_0}\big [\psi _0(x^k)-\psi _0(x^{k+1})+\varepsilon _k\big ]. \end{aligned}$$
(6.8)

Multiplying (6.6), (6.7) and (6.8) by \(\alpha _{k}\) and summing over \(k=0, 1, \ldots , K\) give (6.1), (6.2) and (6.3).

We derive a convergence rate based on the specified parameters. First, from relation (6.6), we note that \(\psi _{0}(x^{k+1}) \le \psi _{0}(x^k) + \epsilon _k\). Hence, we have by induction that

$$\begin{aligned} \psi _{0}(x^{k+1}) \le \psi _{0}(x^0) + {\textstyle {\sum }}_{i=0}^k\epsilon _i. \end{aligned}$$
(6.9)

By setting \(\alpha _k = (k+1)\) and \(\epsilon _{k} = \min _{i \in [m]} \tfrac{\delta _{i}^k}{2} = \min _{i \in [m]} \tfrac{\eta _i-\eta ^k_i}{2(k+1)(k+2)}\) for all \(k \ge 0\) (note that \(\epsilon _k\) satisfies the requirement of Theorem 7), we have

$$\begin{aligned} \widetilde{\varDelta }&= {\textstyle {\sum }}_{k =0}^K\alpha _k[\psi _{0}(x^k) - \psi _{0}(x^{k+1})] + {\textstyle {\sum }}_{k =0}^K\alpha _{k}\epsilon _k \nonumber \\&= \alpha _0\psi _{0}(x^0) + {\textstyle {\sum }}_{k =0}^{K-1}(\alpha _{k+1}-\alpha _{k})\psi _{0}(x^{k+1}) -\alpha _{K}\psi _{0}(x^{K+1}) + {\textstyle {\sum }}_{k =0}^K \alpha _k\epsilon _k \nonumber \\&\mathrel {\overset{{\tiny \mathsf{(i)}}}{\le }} \alpha _0\psi _{0}(x^0) {+} {\textstyle {\sum }}_{k =0}^{K-1}(\alpha _{k+1}{-}\alpha _{k})[\psi _{0}(x^0) {+} {\textstyle {\sum }}_{i=0}^k\epsilon _i]- \alpha _K\psi _{0}(x^{K+1}) + {\textstyle {\sum }}_{k =0}^K\alpha _{k}\epsilon _k \nonumber \\&\mathrel {\overset{{\tiny \mathsf{(ii)}}}{=}} \alpha _K[\psi _{0}(x^0) - \psi _{0}(x^{K+1})] + {\textstyle {\sum }}_{k =0}^{K-1}{\textstyle {\sum }}_{i=0}^k\epsilon _i + {\textstyle {\sum }}_{k =0}^K \alpha _k\epsilon _k \nonumber \\&=\alpha _K[\psi _{0}(x^0) - \psi _{0}(x^{K+1})] + {\textstyle {\sum }}_{i=0}^{K-1}{\textstyle {\sum }}_{k =i}^{K-1}\epsilon _i + {\textstyle {\sum }}_{k =0}^K \alpha _k\epsilon _k \nonumber \\&\mathrel {\overset{{\tiny \mathsf{(iii)}}}{=}}\alpha _K[\psi _{0}(x^0) - \psi _{0}(x^{K+1})] + {\textstyle {\sum }}_{i=0}^{K}(K-i)\epsilon _i + {\textstyle {\sum }}_{k =0}^K \alpha _k\epsilon _k \nonumber \\&\mathrel {\overset{{\tiny \mathsf{(iv)}}}{\le }} \alpha _K[\psi _{0}(x^0) - \psi _{0}(x^{K+1})] + \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{2}{\textstyle {\sum }}_{k=0}^{K}\tfrac{K-k}{(k+1)(k+2)} + \tfrac{1}{k+2} \nonumber \\&=\alpha _K[\psi _{0}(x^0) - \psi _{0}(x^{K+1})] + \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{2}{\textstyle {\sum }}_{k=0}^{K}\tfrac{K+1}{(k+1)(k+2)} \nonumber \\&\le \alpha _K\big [ \psi _{0}(x^0) - \psi _{0}(x^{K+1}) + \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{2}\big ] . \end{aligned}$$
(6.10)

Here, (i), (ii) follows from (6.9) and \(\alpha _{k+1}-\alpha _k = 1 (> 0)\), (iii) follows \((K-i)\) is 0 at \(i = K\), (iv) follows by observing \(\epsilon _k \le \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{2(k+1)(k+2)}\) and last inequality follows since \({\textstyle {\sum }}_{k =0}^\infty \tfrac{1}{(k+1)(k+2)} = 1\) and \(\alpha _K = K+1\).

Applying same arguments as those in Corollary 1, we have \({\textstyle {\sum }}_{k =0}^K\alpha _k\Vert \eta -\eta ^k\Vert _{}^{} = \alpha _K\Vert \eta -\eta ^0\Vert _{}^{}\). Using this relation along with \({\textstyle {\sum }}_{k =0}^K\alpha _k = \tfrac{(K+1)(K+2)}{2}\) and (6.10) inside (6.4), we have (6.5). Hence, we conclude the proof. \(\square \)

Remark 8

Compared to the convergence result (3.31) for exact LCPG, we have to control the accumulated error in \(\tilde{\varDelta }\) for the inexact case (6.4). However, we need an even more stringent condition on the error to ensure asymptotic convergence. Specifically, we assume \({\varepsilon }_k\) to be smaller than the level increments \(\delta _{i}^k\) to ensure that each subsequent subproblem is strictly feasible. As long as the subproblems are solved deterministically with sufficient accuracy, we can ensure such feasibility as well as the boundedness of the dual.

Remark 9

Note that the convergence analysis of the inexact method for the stochastic case will go through in a similar fashion. In particular, the subproblems of LCSPG are still deterministic in nature. Hence, a deterministic error can be easily incorporated into the analysis of the stochastic outer loop. In particular, Proposition 3 will have an additional \(\epsilon _k\) in the RHS. We can use \(\epsilon _k = \min _{i \in [m]} \tfrac{\delta ^k_i}{2}\) to ensure the strict feasibility. Following the analysis in Theorem 4, we will get the additional term \({\textstyle {\sum }}_{k=0}^K \alpha _k\epsilon _k \). Note that we have identical policies for \(\alpha _k\) in the above analysis and Corollary 2. Furthermore, since \(\delta _k\) used above and in Corollary 2 are the same, we have identical values of \(\epsilon _k\) as well. Following the above development, we can easily bound the additional \({\textstyle {\sum }}_{k=0}^K \alpha _k\epsilon _k \) term.

6.2 Solving the subproblem with the interior point method

Our goal is to develop an efficient interior point algorithm to solve problem (3.1) when \(\chi _i(x)=0\), \(i\in [m]\). Without loss of generality, we express the subproblem as the following QCQP:

$$\begin{aligned} \begin{aligned} \min _{x\in \mathbb {R}^d}&\quad g_0(x):=\tfrac{L_0}{2}\Vert x-a_0\Vert ^2 \\ \text {s.t.}&\quad g_i(x):=\tfrac{L_0}{2}\Vert x-a_i\Vert ^2 - b_i \le 0, \quad i\in [m]. \end{aligned} \end{aligned}$$
(6.11)

We assume that the initial solution \(\hat{x}\) of such problem is strictly feasible, namely, there exists \(\delta >0 \) such that

$$\begin{aligned} g_i(\hat{x})\le -\delta ,\quad i=1,2,\ldots m. \end{aligned}$$
(6.12)

Let \(e_1=[1,0,\ldots ,0]^\textrm{T}\in \mathbb {R}^{d+1}\). With a slight abuse of notation, we can formulate (6.11) as the following problem

$$\begin{aligned} \begin{aligned} \min&\qquad \quad e_1^\textrm{T}u \\ \text {s.t.}&\quad \tilde{g}_0(u) = g_0(x)-\eta \le 0, \\&\quad \tilde{g}_i(u) = g_i(x)\le 0,\ i\in [m], \\&\quad \tilde{g}_{m+1}(u)=\tfrac{L_{m+1}}{2}\Vert u-(0, a_{m+1})^\textrm{T}\Vert ^2-b_{m+1} \le 0, \\&\quad u = (\eta , x) \in \mathbb {R}\times \mathbb {R}^d. \end{aligned} \end{aligned}$$
(6.13)

Here we set artificial variables \(L_{m+1}=1, a_{m+1}=0\) and \(b_{m+1}=\tfrac{1}{2}R^2\) for some sufficiently large R. We explicitly add such a ball constraint to ensure bound on \((\eta , x)\). Note that the bound R always exists since our domain is compact and the objective is Lipschitz continuous. Our goal is to apply the path-following method to solve (6.13). We denote

$$\begin{aligned} \phi (u)= -{\textstyle {\sum }}_{i=0}^{m+1}\log -\tilde{g_i}(u) \end{aligned}$$
(6.14)

Since each \(\tilde{g}_i(u)\) is convex quadratic in u, \(\phi (u)\) is a self-concordant barrier with \(\upsilon =m+2\). The key idea of the path-following algorithm is to approximately solve a sequence of penalized problems

$$\begin{aligned} \mathop {\text {min}}\limits _{u}{\phi _\tau (u) := \tau \eta +\phi (u)}{}{} \end{aligned}$$
(6.15)

with increased values of \(\tau \), and generate a sequence of strictly feasible solution \(u_\tau \) close to the central path-a trajectory composed of the minimizers \(u_\tau ^*=\mathop {\text {argmin}}\limits _u \phi _\tau (u)\).

We apply a standard path-following algorithm (See [28, Chapter 4]) for solving (6.13), and outline the overall procedure in Algorithm 4. This algorithm consists of two main steps:

  1. 1.

    Initialization: We seek a solution \(u^0\) near the analytic center (i.e. minimizer of \(\phi (u)\)). To this end, we solve a sequence of auxiliary problems \(\hat{\phi }_\tau (u)=\tau w^\textrm{T}u + \phi (u)\) where \(w=-\nabla \phi (\hat{u})\). It can be readily seen that \(\hat{u}\) is in the central path of this auxiliary problem with \(\tau =1\). Performing a reverse path-following scheme ( decreasing rather than increasing \(\tau \)), we gradually converge to the analytic center.

  2. 2.

    Path-following: We solve a sequence of penalized problems with an increasing value of \(\tau \) by a damped version of Newton’s method, which ensures the solutions in the proximity of the central path.

Algorithm 4
figure d

Path-following Interior Point Method ([28])

6.2.1 Solving the Newton equation

First, we calculate the gradient and Hessian map of \(\phi _t(\cdot )\):

$$\begin{aligned} \nabla \phi _\tau (u)&= \tau e_1 +{\textstyle {\sum }}_{i=0}^{m+1} \theta _i \nabla \tilde{g}_i(u), \\ \nabla ^2 \phi _\tau (u)&= {\textstyle {\sum }}_{i=0}^{m+1} \theta _i^2 \nabla \tilde{g}_i(u)\tilde{g}_i(u)^\textrm{T}+{\textstyle {\sum }}_{i=0}^{m+1}\theta _i \nabla ^2\tilde{g}_i(u) = NN^\textrm{T}+\varGamma , \end{aligned}$$

where \(\theta _i = -\tilde{g}_i(u)^{-1}\), and

$$\begin{aligned} \begin{aligned} N&= \begin{bmatrix} \theta _0 \nabla \tilde{g}_0(u),\ldots , \theta _{m+1} \nabla \tilde{g}_{m+1}(u) \end{bmatrix} \in \mathbb {R}^{(d+1)\times (m+2)}, \\ \varGamma&= \begin{bmatrix} \theta _{m+1}L_{m+1} &{} 0 \\ 0 &{} {\textstyle {\sum }}_{i=0}^{m+1}\theta _iL_i I_d \end{bmatrix} \in \mathbb {R}^{(d+1)\times (d+1)}. \end{aligned} \end{aligned}$$

Note that computing the gradient \(\nabla \phi _t(u)\) takes \({\mathcal {O}}(dm)\), hence the computation burden is from forming and solving the Newton systems. This is divided into two cases.

  1. 1.

    \(m < d\). Then the Hessian is the sum of a low rank matrix and a diagonal matrix. Based on the Sherman-Morrison-Woodbury formula, we have

    $$\begin{aligned}{}[\nabla ^2\phi _\tau (u)]^{-1} = \varGamma ^{-1}-\varGamma ^{-1}N\big (I+N^\textrm{T}\varGamma ^{-1} N\big )^{-1}N^\textrm{T}\varGamma ^{-1}. \end{aligned}$$
    (6.16)

    Computing the product \(N^\textrm{T}\varGamma ^{-1} N\) takes \({\mathcal {O}}(m^2d)\) while performing Cholesky factorization takes \({\mathcal {O}}({m^3})\). Therefore, the overall complexity of each Newton step is \({\mathcal {O}}(m^3+m^2d)={\mathcal {O}}(m^2d)\).

  2. 2.

    \(m \ge d\). In such case, we can directly compute \(NN^\textrm{T}\) in \({\mathcal {O}}(md^{2})\) and then perform Cholesky factorization \(\nabla ^2\phi _\tau (x)=LL^\textrm{T}\) in \({\mathcal {O}}(d^{3})\), followed by two triangle systems. Hence the overall complexity of a Newton step is \({\mathcal {O}}(d^{3}+md^2)={\mathcal {O}}(md^2)\).

Due to the above discussion, the cost of computing each Newton system is

$$\begin{aligned} {\mathcal {O}}\big (\min \{d, m\}\cdot md\big ). \end{aligned}$$
(6.17)

6.2.2 Complexity

Before deriving the complexity of solving the subproblems, we require some additional assumptions. We assume that \(M=\max _x\big \{ \Vert \nabla g_i(x)\Vert \big \}\) and \(\max g_0({x})-\min _x g_0(x)\le V\). Note that these assumptions are easily satisfied if we assume functions in the original problem have bounded level sets.

According to [28, Theorem 4.5.1], the complexity of interior point methods depends not only on the time to follow the central path, but also on the time to arrive near the analytic center from an arbitrary initial point. Let us put it in the context of Algorithm 1. Despite the strict feasibility guarantee, we do not know whether \(x^k\) is near the analytic center of each subproblem. It remains to show how to control the complexity of approximating the analytic center.

To measure the strict feasibility of the initial point, we use the Minkowsky function of the domain, which is defined by \(\pi _x(y)=\inf \{t>0: \, x+t^{-1}(y-x)\in D\}\) for any given x in the interior of the domain. With the help of the Minkowsky function, we bound the distance between the initial point and the boundary in the following proposition.

Proposition 6

Let \(\hat{u}=(g_0(\hat{\eta }), \hat{x})\) where \(\hat{\eta }=g_0(\hat{x})+\delta \). If \(\Vert u-\hat{u}\Vert \le \tfrac{\delta }{M+1}\), then u is feasible for problem (6.13). Moreover, we have

$$\begin{aligned} \pi _{u^*}(\hat{u})\le \tfrac{(M+1)R}{(M+1)R+\delta }, \end{aligned}$$
(6.18)

where \(u^*\) is defined in phase zero of Algorithm 4.

Proof

We have

$$\begin{aligned} \vert \tilde{g}_0(u)-\tilde{g}_0(\hat{u})\vert \le \vert g_0(x)-g_0(\hat{x}) \vert +\vert \eta -\hat{\eta }\vert \le \tfrac{M}{M+1}\delta + \tfrac{\delta }{M+1} = \delta , \end{aligned}$$

Analogously, for \(i=0,1,\ldots ,m\), we have

$$\begin{aligned} | g_i(x)-g_i(\hat{x})| \le M \Vert x-\hat{x}\Vert \le \tfrac{M}{M+1}\delta \le \delta . \end{aligned}$$

Using triangle inequality, we have \(\tilde{g}_i(u)=g_i(x)\le g_i(\hat{x})+\delta \)=0. The last constraint in (6.13) is trivially satisfied for sufficiently large R. Therefore, u is a feasible point of (6.13).

Let \(t^+=\tfrac{(M+1)\Vert \hat{u}-u^* \Vert }{(M+1)\Vert \hat{u}-u^*\Vert +\delta }\), then from the above analysis, we know that the point

$$\begin{aligned} u^+ =u^*+\tfrac{1}{t^+} (\hat{u}-u^*) = \hat{u}+\tfrac{\delta (\hat{u}-u^*)}{(M+1)\Vert \hat{u}-u^*\Vert } \end{aligned}$$

must be a feasible solution. Using the last constraint \(\Vert u\Vert \le R\), we immediately obtain the bound (6.18). \(\square \)

Using [28, Theorem 4.5.1] and Proposition 6, we can derive the total complexity of solving the diagonal QCQP.

Theorem 9

Under the assumptions of Proposition 6, the total number of Newton steps to get an \(\varepsilon \) solution is

$$\begin{aligned} N_{\varepsilon } = {\mathcal {O}}(1)\sqrt{m+2}\ln \left( \tfrac{(m+2) V((M+1)R+\delta )}{\delta \varepsilon } +1\right) . \end{aligned}$$

Corollary 3

In the inexact LCPG method, assume that the subproblems are solved by Algorithm 4 and the returned solution satisfies the inexactness requirement in Theorem 8. Then, to get an \({\mathcal {O}}(\epsilon , \epsilon )\) Type-II KKT point, the overall arithmetic cost of Algorithm 4 is

$$\begin{aligned} {\mathcal {T}}= {\mathcal {O}}\big (\min \{m, d\}\cdot m^{1.5}d \cdot \tfrac{1}{\varepsilon } \ln \big (\tfrac{1}{\varepsilon }\big )\big ). \end{aligned}$$

Proof

According to Theorem 8, the total number of LCPG is \(K={\mathcal {O}}(1/\varepsilon )\). In the kth iteration of LCPG, we set the error criteria \(\nu ={\mathcal {O}}(\tfrac{1}{k^2})\) and \(\varepsilon ={\mathcal {O}}(\tfrac{1}{k^2})\). Theorem 9 implies that the number of Newton steps is \(N_k={\mathcal {O}}(\sqrt{m}\ln (k))\). Therefore, the total number of Newton steps in LCPG is \(T_K = {\textstyle {\sum }}_{k=0}^K N_k = {\mathcal {O}}\big (\sqrt{m}\tfrac{1}{\varepsilon }\ln \big (\tfrac{1}{\varepsilon }\big )\big ). \) Combining this result with (6.17) gives us the desired bound. \(\square \)

Remark 10

First, at the kth step of LCPG, we need \(\log (k)\) iterations of interior point methods, of which the complexity order is equally contributed by the two phases of IPM. Specifically, we first require \({\mathcal {O}}(\ln (k))\) Newton steps to pull the iterates from near the boundary to the proximity of the central path, and then require \({\mathcal {O}}(\ln (k))\) to obtain an \({\mathcal {O}}(1/k^2)\)-accurate solution. Second, it is interesting to consider the case when the constraint is far less than the feature dimensionality, namely, \(m \ll d \). We observe that the total computation

$$\begin{aligned} {\mathcal {O}}\big (d m^{2.5} K\ln K\big ) \end{aligned}$$

is linear in dimensionality. Third, despite the simplicity, the basic barrier method offers a relatively stronger approximate solution than what is needed in Theorem 8, the feasibility of the solution path allows us to weaken the assumption to \(\hat{\varepsilon }_k=0\). Nevertheless, besides our approach, it is possible to employ long-step and infeasible primal-dual interior point methods which may give a better empirical performance.

6.3 Solving subproblems with the first-order method

In this section, we use a previously proposed ConEx method [4] to solve the subproblem (3.1) when general proximal functions \(\chi _{i}\) are present. Then, we analyze the overall complexity of LCPG method with ConEx method as a subproblem solver. First, we formally state the extended version of problem (6.11) as follows:

$$\begin{aligned} \begin{aligned} \min _{x \in X} \quad&\phi _0(x):=g_0(x) + \chi _{0}(x)\\ \text {s.t.} \quad&\phi _i(x) :=g_i(x) + \chi _{i}(x) \le 0, \quad i = 1, \dots , m. \end{aligned} \end{aligned}$$
(6.19)

For the application of ConEx for the subproblem, we need access to a convex compact set X such that \(\cap _i{\textrm{dom}}{\chi _{i}} \subseteq X\). Moreover, X is a “simple” set in the sense that it allows easy computation of the proximal operator of \(\chi _{0}(x) + {\textstyle {\sum }}_{i=1}w_i\chi _{i}(x)\) for any given weights \(w_i, i =1, \dots , m\). Such assumptions are not very restrictive as many machine learning and engineering problems explicitly seek the optimal solution from a bounded set. Under these assumptions, we apply ConEx to solve the subproblem (3.1) of LCPG. We now reproduce a simplified optimality guarantee of the ConEx method below without necessarily going into the details of the algorithm.

Theorem 10

[4] Let x be the output of ConEx after T iterations for problem (6.19). Assume that \(\phi _0\) is a strongly convex function and \((\widetilde{x}, \widetilde{\lambda })\) is the optimal primal-dual solution. Moreover, Let B be a parameter of the ConEx method which satisfies \(B> \Vert \widetilde{\lambda }\Vert _{}^{}\). Then, the solution x satisfies

$$\begin{aligned} \phi _0(x) - \phi _0(\widetilde{x})&\le O\big (\tfrac{1}{T^2}( B^2 + \Vert \widetilde{\lambda }\Vert _{}^{2} )\big ),\\ \Vert [\phi (x)]_+ \Vert _{}^{}&\le O\big (\tfrac{1}{T^2}( B^2 + \Vert \widetilde{\lambda }\Vert _{}^{2} )\big ). \end{aligned}$$

Even though ConEx can be applied to a wider variety of convex function constrained problems, it has two vital and intricate issues that need to be addressed in our context:

  1. 1.

    The solution path of ConEx can be arbitrarily infeasible in the early iterations, while the successive iterations make the solutions infeasibility smaller. Note that the approximation criterion in Definition 7 requires guarantees on the amount of infeasibility. This implies ConEx has to run a significant number of iterations before getting sufficiently close to the feasible set.

  2. 2.

    Since ConEx is a primal-dual method, its convergence guarantees depend on the optimal dual solution \(\lambda ^*\). Moreover, a bound on the dual, \(B (> \Vert \lambda ^*\Vert _{}^{})\), is required to implement the algorithm to achieve an accelerated convergence rate of \(O(1/T^2)\) for strongly convex problems.

From Theorem 10, it is clear that ConEx requires a bound B. This requirement naturally leads to two cases: (1) bound B can be estimated apriori, e.g., see Lemma 2; and (2) bound B is known to exist but cannot be estimated, e.g., see Theorem 1. Both cases have different convergence rates for the subproblem which leads to different overall computational complexity.

Case 1: B can be estimated apriori. In this case, we do not need to estimate \(B^k\) as in (6.20). Using the bound B, we can get accelerated convergence of ConEx in accordance with Theorem 10 which leads to better performance of the LCPG method. The corollary below formally states the total computational complexity of LCPG method for this case.

Corollary 4

If an explicit value of B is known, the LCPG method with ConEx as subproblem solver obtains \(O(\tfrac{1}{K},\tfrac{1}{K})\) type-II KKT point in \(O(K^2)\) computations.

Proof

According to Theorem 10, the required ConEx iterations for each subproblem can be bounded by

$$\begin{aligned} T^k = O(\tfrac{B}{\sqrt{\epsilon _k}}). \end{aligned}$$

Since B is a constant, we have \(T^k = O(\epsilon _k^{-1/2}) = O(k)\). Finally, we have total computations \({\textstyle {\sum }}_{k=1}^K T^k = O(K^2)\). Hence, we conclude the proof. \(\square \)

Case 2: B is known to exist but cannot be estimated. For the subproblem (3.1), we can easily find \(B^k > \Vert \widetilde{\lambda }^{k+1}\Vert _{}^{}\) by using the difference in levels of successive iterations. This bound is weak, especially in the limiting case as it does not take into account

Proposition 7

For subproblem (3.1), we have

$$\begin{aligned} \Vert \widetilde{\lambda }^{k+1}\Vert _{}^{} \le \tfrac{\psi _{0}(x^k) - \psi _{0}^*}{\min _{i \in [m]}\delta ^k_i}. \end{aligned}$$
(6.20)

Proof

By Slater’s condition, we know that \(\widetilde{\lambda }^{k+1}\) exists. Then, due to saddle point property of \((\widetilde{x}^{k+1}, \widetilde{\lambda }^{k+1})\), we have for all \(x \in X\)

$$\begin{aligned} \psi ^k_{0}(x) + \widetilde{\lambda }^{k+1}[\psi ^k(x) -\eta ^{k+1}]&\ge \psi ^k_{0}(\widetilde{x}^{k+1}) + \widetilde{\lambda }^{k+1}[\psi ^k(\widetilde{x}^{k+1}) -\eta ^{k+1}] \\&=\psi ^k_{0}(\widetilde{x}^{k+1}) , \end{aligned}$$

where the equality follows by complementary slackness. Using \(x = x^k\) in the above relation and noting that \(x^k\) satisfies \(\psi (x^k) \le \psi ^{k-1}(x^k) \le \eta ^k\), we have \(\eta ^{k+1} - \psi ^{k}(x^k) = \eta ^{k+1} - \psi (x^k) \ge \eta ^{k+1} - \eta ^k = \delta ^k\) implying that

$$\begin{aligned} \psi _{0}^k(x^k) - \psi _{0}^k(\widetilde{x}^{k+1})&\ge \langle \widetilde{\lambda }^{k+1}, \delta ^k\rangle \ge \Vert \lambda ^{k+1}\Vert _{1}^{} \cdot \min _{i \in [m]}\delta _{i}^k \ge \Vert \lambda ^{k+1}\Vert _{}^{} \cdot \min _{i \in [m]}\delta _{i}^k, \end{aligned}$$

where second inequality follows from \(\widetilde{\lambda }^{k+1} \ge 0\) and \(\delta ^k > 0\) and last inequality follows due to the fact that \(\Vert \widetilde{\lambda }^{k+1}\Vert _{1}^{} \ge \Vert \widetilde{\lambda }^{k+1}\Vert _{}^{}\). We can further upper bound the LHS of the above relation as follows

$$\begin{aligned} \psi ^k_{0}(x^k) - \psi _{0}^k(\widetilde{x}^{k+1})= \psi _{0}(x^k) - \psi _{0}^k(\widetilde{x}^{k+1}) \le \psi _{0}(x^k) - \psi _{0}(\widetilde{x}^{k+1}) \le \psi _{0}(x^k) - \psi _{0}^*, \end{aligned}$$

where the last inequality follows since \(\widetilde{x}^{k+1}\) is feasible for the original problem (1.1). Combining the above two relations, we obtain (6.20). Hence, we conclude the proof. \(\square \)

We now state the final computation complexity of LCPG withConEx which uses the bound in (6.20).

Corollary 5

If an explicit value of B is not known, the LCPG method with ConEx as subproblem solver obtains \(O(\tfrac{1}{K},\tfrac{1}{K})\) type-II KKT point in \(O(K^4)\) computations.

Proof

Using Proposition 7, we can set \(B^k:= \tfrac{\psi _{0}(x^k)-\psi _{0}^*}{\delta _{i_*}^k}\) where \(i_*:= \mathop {\text {argmin}}\limits _{i \in m} \eta _i - \eta ^0_i\). Then, required ConEx iterations \(T^k\) can be bounded by

$$\begin{aligned} T^k = O\big (\tfrac{B^k}{\sqrt{\epsilon _{k}}}\big ). \end{aligned}$$

Finally, in view of (6.9) and the fact that \({\textstyle {\sum }}_{i=0}^\infty \epsilon _i \le \Vert \eta - \eta ^0\Vert _{}^{}\) implies that \(B^k \le \tfrac{1}{\delta _{i_*}^k}[\psi _{0}(x^0) - \psi _{0}^*+ \Vert \eta -\eta ^0\Vert _{}^{}]\) for all k. Moreover, for all \(k \le K\), we have \(\epsilon _k = \tfrac{\delta ^k_{i_*}}{2}\). Hence, we get \(T^k = O({\epsilon _k^{-3/2}}) = O(k^3)\). Finally, we have \(\sum _{k= 1}^{K} T^k = O(K^4)\) which is the overall computational complexity of LCPG method with ConEx as subproblem solver to obtain \((O(\tfrac{1}{K}), O(\tfrac{1}{K}))\) type-II KKT point. \(\square \)

Remark 11

[Gradient complexity vs. computational complexity] Note that evaluating the gradient of \(\psi _{i}^k(x)\) is relatively simple since it does not involve any new computation of \(\nabla f_i(x)\). In that sense, the entire inner loop requires only one \(\nabla f_i\) computation; hence the total gradient complexity of \(\nabla f_i\) equals the total outer loops of inexact LCPG. On the other hand, inner loop computation does contribute to the problem’s computational complexity. However, such iterations are expected to be very cheap given the ease of obtaining gradients for the QP subproblem (3.1) with identity hessian matrices.

7 LCPG for convex optimization

In this section, we establish the complexity of LCPG (i.e., Algorithm 1) when the objective \(f_0\) and constraint \(f_i, i \in [m]\) are convex. In particular, we consider two convex problems, depending on whether \(f_0\) is convex or strongly convex. To provide a combined analysis of the two cases, we assume the following:

Assumption 6

\(f_0(x)\) is \(\mu _0\)-convex function for some \(\mu _0\ge 0\). Namely,

$$\begin{aligned} f_0(x)\ge f_0(y)+\langle \nabla f_0(y),x-y\rangle +\tfrac{\mu _0}{2}\Vert x-y\Vert _{}^{2}, \hspace{1em}\text {for any }x,y\in \mathbb {R}^d. \end{aligned}$$

Note that if \(\mu _0= 0\) then \(f_0\) is simply a convex function. Now we provide the convergence rate of LCPG to optimality.

For more generality, we consider an inexact variant of LCPG for which an approximate solution in terms of Definition 7 is returned in each iteration. Let \((\widetilde{x}^{k+1}, \widetilde{\lambda }^{k+1})\) be the saddle-point solution \({\mathcal {L}}_k(x, \widetilde{\lambda })\), i.e., \(\widetilde{x}^{k+1}\) is an exact solution of the subproblem (3.1). First, we extend the three-point inequality in Lemma 1 for an inexact solution.

Lemma 8

Let \(z^+\) be an \(\epsilon \)-approximate solution of problem \(\min _{x \in \mathbb {R}^d}\{g(x) + \tfrac{\gamma }{2}\Vert x-z\Vert _{}^{2}\}\) where g(x) is a proper, lsc. and convex function. Then,

$$\begin{aligned} g(z^+) -g(x) \le \tfrac{\gamma }{2}\big [ \Vert z-x\Vert _{}^{2} - \Vert z^+-x\Vert _{}^{2} - \Vert z^+-z\Vert _{}^{2}\big ] + \epsilon + {\sqrt{2\gamma \epsilon }}\, \Vert z^+-x\Vert _{}^{}. \end{aligned}$$
(7.1)

Proof

First, let \(x^+\) be the optimal solution of \(\min _{x \in \mathbb {R}^d}\{g(x) + \tfrac{\gamma }{2}\Vert x-z\Vert _{}^{2}\}\). In view of Lemma 1, for any x, we have

$$\begin{aligned} g(x^+)+\tfrac{\gamma }{2}\Vert x^+-z\Vert ^2 + \tfrac{\gamma }{2}\Vert x-x^+\Vert ^2 \le g(x) + \tfrac{\gamma }{2}\Vert x-z\Vert ^2. \end{aligned}$$
(7.2)

Placing \(x=z^+\) above, we have

$$\begin{aligned} g(x^+)+\tfrac{\gamma }{2}\Vert x^+-z\Vert ^2 + \tfrac{\gamma }{2}\Vert z^+-x^+\Vert ^2 \le g(z^+) + \tfrac{\gamma }{2}\Vert z^+-z\Vert ^2. \end{aligned}$$
(7.3)

On the other hand, by the definition of \(\epsilon \)-solution, we have

$$\begin{aligned} g(z^+)+\tfrac{\gamma }{2}\Vert z^+-z\Vert ^2 \le g(x^+)+\tfrac{\gamma }{2}\Vert x^+-z\Vert ^2 + \epsilon . \end{aligned}$$
(7.4)

Combining the above two inequalities gives

$$\begin{aligned} \tfrac{\gamma }{2}\Vert z^+-x^+\Vert ^2 \le \epsilon . \end{aligned}$$
(7.5)

Summing up (7.2) and (7.4) again and then rearranging the terms, we get

$$\begin{aligned} g(z^+) - g(x)&\le \tfrac{\gamma }{2}\Vert x-z\Vert ^2 - \tfrac{\gamma }{2}\Vert z^+-z\Vert ^2 - \tfrac{\gamma }{2}\Vert x-x^+\Vert ^2 +\epsilon \nonumber \\&\le \tfrac{\gamma }{2}\Vert x-z\Vert ^2 {-} \tfrac{\gamma }{2}\Vert z^+-z\Vert ^2 {-} \tfrac{\gamma }{2}\Vert x-z^+\Vert ^2 {+}\gamma \Vert x-z^+\Vert \Vert z^+{-}x^+\Vert {+}\epsilon , \end{aligned}$$

where the last inequality uses the fact that \(-\tfrac{1}{2}\Vert a+b\Vert ^2\le -\tfrac{1}{2}\Vert a\Vert ^2-\langle a, b\rangle \le -\tfrac{1}{2}\Vert a\Vert ^2+\Vert a\Vert \Vert b\Vert \) with \(a=x-z^+\) and \(b=z^+-x^+\). Finally, combining the above two results gives the desired inequality (7.1). \(\square \)

Using the above lemma, we provide the main convergence property of LCPG for convex optimization.

Lemma 9

Let x be feasible solution. Then, we have

$$\begin{aligned} \psi _{0}(x^{k+1}) - \psi _{0}(x)&\le \langle \widetilde{\lambda }^{k+1}, \psi (x) - \eta ^k\rangle + \tfrac{L_0+ \langle \widetilde{\lambda }^{k+1}, L\rangle -\mu _0}{2} \Vert x^k-x\Vert _{}^{2} \nonumber \\&\quad - \tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle }{2} \Vert x^{k+1}-x\Vert _{}^{2} + {2}\epsilon _k + {\sqrt{2(L_0+ \langle \widetilde{\lambda }^{k+1}, L\rangle ) \epsilon _k}} \Vert x^{k+1}-x\Vert _{}^{}. \end{aligned}$$
(7.6)

Proof

Note that

$$\begin{aligned} {\psi _{0}( {x}^{k+1})}&\ {\mathrel {\overset{{\tiny \mathsf{(i)}}}{\le }}}\ \psi _{0}^k( {x}^{k+1}) \mathrel {\overset{{\tiny \mathsf{(ii)}}}{\le }} \psi _{0}^k(\widetilde{x}^{k+1}) + \epsilon _k \nonumber \\&\mathrel {\overset{{\tiny \mathsf{(iii)}}}{=}} \psi _{0}^k(\widetilde{x}^{k+1}) + \langle \widetilde{\lambda }^{k+1}, \psi ^k(\widetilde{x}^{k+1}) - \eta ^k\rangle +\ \epsilon _k \nonumber \\&{\le } \psi _{0}^k(x^{k+1}) + \langle \widetilde{\lambda }^{k+1}, \psi ^k(x^{k+1}) - \eta ^k\rangle {+\ \epsilon _k}\nonumber \\&\mathrel {\overset{{\tiny \mathsf{(iv)}}}{\le }} \psi _{0}^k(x) + \langle \widetilde{\lambda }^{k+1}, \psi ^k(x) - \eta ^k\rangle - \tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle }{2} \Vert x^{k+1}-x\Vert _{}^{2} \nonumber \\&\quad + 2\epsilon _k +{\sqrt{2(L_0+ \langle \widetilde{\lambda }^{k+1}, L\rangle ) \epsilon _k}} \Vert x^{k+1}-x\Vert _{}^{}, \end{aligned}$$
(7.7)

where (i) follows from the definition of \(\psi _{0}^k\), (ii) follows since \(x^{k+1}\) is an \(\epsilon _k\) solution of (3.1), (iii) follows by complementary slackness for the optimal primal-dual solution for (3.1) and (iv) follows from Lemma 8. In particular, we use \(g(x) + \tfrac{\gamma }{2}\Vert x-z\Vert _{}^{2} = \psi _0^k(x) + \langle \widetilde{\lambda }^{k+1}, \psi ^k(x) - \eta ^k\rangle \) with \(z = x^k\), \(z^+ = x^{k+1}\), \(\epsilon = \epsilon _k\) and \(\gamma = L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle \). Note that \(x^{k+1}\) is an \(\epsilon _k\)-approximate solution for \(\min _{x \in \mathbb {R}^d} \psi _{0}^k(x) + \langle \widetilde{\lambda }^{k+1}, \psi ^k(x) -\eta ^k\rangle \) due to Definition 7.

Finally, note that

$$\begin{aligned} \psi _{0}^k(x) + \tfrac{\mu _0}{2} \Vert x-x^k\Vert _{}^{2}&\le \psi _{0}(x) + \tfrac{L_0}{2}\Vert x-x^k\Vert _{}^{2},\\ \psi _{i}^k(x)&\le \psi _i(x) + \tfrac{L_i}{2} \Vert x-x^k\Vert _{}^{2} . \end{aligned}$$

Using the above two relations in (7.7), we obtain (7.6). Hence, we conclude the proof. \(\square \)

Let \(x^*\) be an optimal solution of (1.1) and \(\widetilde{D}:= \max \{\Vert x-y\Vert _{}^{}:x, y \in {\textrm{dom}}{\chi _{0}}, \psi _i(x) \le \eta _i, \psi _{i}(y) \le \eta _{i}, \text { for all } i\in [m] \}\). Now, we show convergence rate guarantees.

Theorem 11

Consider general convex optimization problems with \(\mu _0 = 0\). Suppose Assumption 3 is satisfied and set \(\delta ^k = \tfrac{(\eta -\eta ^0)}{(k+1)(k+2)}\). Then we have

$$\begin{aligned} \psi _{0}(\bar{x}_K) - \psi _{0}(x^*) \le \tfrac{L_0+ B\Vert L\Vert _{}^{}}{(K+1)}\big [ \widetilde{D}^2 + \tfrac{(4B+2)\Vert \eta -\eta ^0\Vert _{}^{}}{L_0} + \widetilde{D}\sqrt{\tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{L_0}} +\tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{L_0}\tfrac{\log {K}}{K} \big ] \end{aligned}$$
(7.8)

Proof

From Lemma 9 with \(\mu _0 = 0\) for convex part and \(\psi (x^*) \le \eta \), we have

$$\begin{aligned}&\psi _{0}(x^{k+1}) - \psi _{0}(x^*)\\&\quad = \langle \widetilde{\lambda }^{k+1}, \eta -\eta ^k\rangle + \tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle }{2}\Vert x^k-x^*\Vert _{}^{2} - \tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle }{2}\Vert x^{k+1}-x^*\Vert _{}^{2} \\&\qquad + 2\epsilon _k + {\sqrt{2(L_0+ \langle \widetilde{\lambda }^{k+1}, L\rangle ) \epsilon _k}} \Vert x^{k+1}-x\Vert _{}^{}. \end{aligned}$$

Dividing both sides by \(\tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle }{2}\), we have

$$\begin{aligned}{} & {} \tfrac{2[\psi _{0}(x^{k+1}) - \psi _{0}(x^*) ]}{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle } \le \Vert x^k-x^*\Vert _{}^{2} - \Vert x^{k+1}-x^*\Vert _{}^{2} \nonumber \\{} & {} \qquad + \tfrac{2\langle \widetilde{\lambda }^{k+1}, \eta - \eta ^k\rangle }{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle } + \tfrac{4\epsilon _k}{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle } + \sqrt{\tfrac{8\epsilon _k}{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle } }\Vert x^{k+1}-x^*\Vert _{}^{} \end{aligned}$$

Note that the sequence \(\{\widetilde{\lambda }^{k+1}\}\) is uniformly bounded above such that \(\Vert \widetilde{\lambda }^{k+1}\Vert _{}^{} \le B\) for all \(k \ge 0\). Using this fact and the above relation, we have

$$\begin{aligned} \tfrac{2[\psi _{0}(x^{k+1}) {-} \psi _{0}(x^*) ]}{L_0 + B\Vert L\Vert _{}^{}} {\le } \Vert x^k{-}x^*\Vert _{}^{2} {-} \Vert x^{k+1}{-}x^*\Vert _{}^{2} {+} \tfrac{2B\Vert \eta {-}\eta ^k\Vert _{}^{}}{L_0} {+} \tfrac{4\epsilon _k}{L_0} {+} \sqrt{\tfrac{8\epsilon _k}{L_0}}\Vert x^{k+1}{-}x^*\Vert _{}^{}. \end{aligned}$$
(7.9)

Using \(\delta _k = \tfrac{\eta -\eta ^0}{(k+1)(k+2)}\) and \(\epsilon _k = \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{2(k+1)(k+2)}\), we have \(x^{k}\) is strictly feasible solutions for (3.1) for all k. Hence, under Assumption 3, we can follow the steps of Theorem 1 to show uniform bound B on sequence \(\{\Vert \widetilde{\lambda }^k\Vert _{}^{}\}\). Using these values in (7.9), we have

$$\begin{aligned}{} & {} \tfrac{2[\psi _{0}(x^{k+1}) - \psi _{0}(x^*) ]}{L_0 + B\Vert L\Vert _{}^{}} \le \Vert x^k-x^*\Vert _{}^{2} - \Vert x^{k+1}-x^*\Vert _{}^{2} + \tfrac{2B\Vert \eta -\eta ^0\Vert _{}^{}}{L_0(k+1)}\nonumber \\{} & {} \quad + \tfrac{2\Vert \eta -\eta ^0\Vert _{}^{}}{L_0(k+1)(k+2)} + \sqrt{\tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{L_0}}\tfrac{1}{k+1}\Vert x^{k+1}-x^*\Vert _{}^{}. \end{aligned}$$
(7.10)

Due to the optimality of the exact solution \(\widetilde{x}^{k+1}\), we have \(\psi ^k_{0}(\widetilde{x}^{k+1}) \le \psi ^k_{0}(x^k) = \psi _{0}(x^k)\). We also have \(\psi _0(x^{k+1}) \le \psi _0^k(x^{k+1}) \le \psi _{0}^k(\widetilde{x}^{k+1}) + \epsilon _k\). Combining these two relations, we get:

$$\begin{aligned} \psi _{0}(x^{k+1}) \le \psi _{0}(x^k) + \epsilon _k. \end{aligned}$$

Effectively, inexact LCPG method is almost (up to an additive error of \(\epsilon _k\)) a descent method. Using this relation recursively, we have

$$\begin{aligned} \psi _{0}(x^K)&\le \psi _{0}(x^{k+1}) + {\textstyle {\sum }}_{i=k+1}^{K-1}\epsilon _i.\\&\le \psi _0(x^{k+1}) + \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{2}{\textstyle {\sum }}_{i=k+1}^{K-1}\tfrac{1}{(i+1)(i+2)}\\&= \psi _0(x^{k+1}) + \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{2}\tfrac{(K-k-1)}{(k+2)(K+1)}. \end{aligned}$$

Using the above relation in (7.10), we have

$$\begin{aligned}&\tfrac{2[\psi _{0}(x^K) - \psi _{0}(x^*)]}{L_0 + B\Vert L\Vert _{}^{}} \\&\quad \le \Vert x^k-x^*\Vert _{}^{2} - \Vert x^{k+1}-x^*\Vert _{}^{2} + \tfrac{2B\Vert \eta -\eta ^0\Vert _{}^{}}{L_0(k+1)} \\&\qquad + \tfrac{{2}\Vert \eta -\eta ^0\Vert _{}^{}}{L_0(k+1)(k+2)} + \sqrt{\tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{L_0}}\tfrac{1}{k+1}\Vert x^{k+1}-x^*\Vert _{}^{}\\&\qquad +\tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{L_0+B\Vert L\Vert _{}^{}}\tfrac{K-k-1}{(k+2)(K+1)}. \end{aligned}$$

Multiplying the above relation by \(k+1\) and summing from \(k = 0\) to \(K-1\), we have

$$\begin{aligned} \tfrac{K(K+1)[\psi _{0}({x_K}) -\psi _{0}(x^*)]}{L_0 + B\Vert L\Vert _{}^{}}&\le {\textstyle {\sum }}_{k =0}^{K-1}\Vert x^k-x^*\Vert _{}^{2} + \tfrac{2B\Vert \eta -\eta ^0\Vert _{}^{}K}{L_0} + \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}\log (K+2)}{L_0} \\&\qquad + \sqrt{\tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{L_0}}{\textstyle {\sum }}_{k =0}^{K-1}\Vert x^{k+1}-x^*\Vert _{}^{} {+ \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}K}{L_0 + B\Vert L\Vert _{}^{}}}\\&\le K\widetilde{D}^2 + \tfrac{2(B{+1})\Vert \eta -\eta ^0\Vert _{}^{}K}{L_0} +\sqrt{\tfrac{\Vert \eta -\eta ^0\Vert _{}^{}}{L_0}}K\widetilde{D}. \end{aligned}$$

After rearranging, this relation implies (7.8). Hence, we conclude the proof. \(\square \)

Theorem 12

Consider strongly convex problems (\(\mu _0>0\)) and suppose that Assumption 3 is satisfied. Set \(\delta ^k = \rho ^k(1-\rho )(\eta -\eta ^0)\) where \(\rho =\tfrac{L_0 - \mu _0}{2(L_0-a\mu _0)}\), \(2\epsilon _k \le a (1-\rho )\rho ^k\Vert \eta -\eta ^0\Vert _{}^{}\) and \(a\in (0,1)\). Then we have

$$\begin{aligned} \psi _{0}(x^K) - \psi _{0}(x^*)&\le \exp \big (-\tfrac{(1-a)\mu _0 K}{L_0+B\Vert L\Vert _{}^{}-a\mu _0}\big ) (L_0+B\Vert L\Vert _{}^{}-\mu _0)\nonumber \\&\quad \big \{\big [\tfrac{(4B+1)}{2(L_0-\mu _0)}+\tfrac{L_0+B\Vert L\Vert +2a\mu _0}{\mu _0(L_0 - \mu _0)}(1-\rho )\big ] \Vert \eta -\eta ^0\Vert _{}^{} + \tfrac{1}{2}\Vert x^0-x^*\Vert ^2\big \}. \end{aligned}$$
(7.11)

Moreover, if \(\epsilon _k=0\), we have

$$\begin{aligned} \psi _{0}(x^K) - \psi _{0}(x^*)\le & {} \exp \big (-\tfrac{\mu _0 K}{L_0+B\Vert L\Vert _{}^{}}\big ) (L_0+B\Vert L\Vert _{}^{}-\mu _0)\nonumber \\{} & {} \Big \{\big [\tfrac{(4B+1)}{2(L_0-\mu _0)}\big ] \Vert \eta -\eta ^0\Vert _{}^{} + \tfrac{1}{2}\Vert x^0-x^*\Vert ^2\Big \}. \end{aligned}$$
(7.12)

Proof

Proceeding similar to the convex case, using Lemma 9, we obtain

$$\begin{aligned}&\psi _{0}({x^{k+1}}) - \psi _{0}(x^*)\\&\quad = \langle \widetilde{\lambda }^{k+1}, \eta -\eta ^k\rangle + \tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle - \mu _0}{2}\Vert x^k-x^*\Vert _{}^{2} \\&\qquad - \tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle }{2}\Vert x^{k+1}-x^*\Vert _{}^{2} \\&\qquad + 2\epsilon _k + {\sqrt{2(L_0+ \langle \widetilde{\lambda }^{k+1}, L\rangle ) \epsilon _k}} \Vert x^{k+1}-x\Vert _{}^{}. \end{aligned}$$

For \(0< a < 1\), we have

$$\begin{aligned} {\sqrt{2(L_0+ \langle \widetilde{\lambda }^{k+1}, L\rangle ) \epsilon _k}} \Vert x^{k+1}-x\Vert _{}^{} \le \tfrac{L_0+\langle \widetilde{\lambda }^{k+1}, L\rangle }{a\mu _0}\epsilon _k + \tfrac{a\mu _0}{2} \Vert x^{k+1}-x^*\Vert ^2. \end{aligned}$$

Combining the above two results, we have

$$\begin{aligned}&\psi _{0}({x^{k+1}}) - \psi _{0}(x^*)\\&\quad \le \langle \widetilde{\lambda }^{k+1}, \eta -\eta ^k\rangle + \tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle - \mu _0}{2}\Vert x^k-x^*\Vert _{}^{2}\\&\qquad - \tfrac{L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle -a\mu _0}{2}\Vert x^{k+1}-x^*\Vert _{}^{2} \\&\qquad + \big (2+\tfrac{L_0+\langle \widetilde{\lambda }^{k+1}, L\rangle }{a\mu _0}\big ) \epsilon _k. \end{aligned}$$

Let us denote

$$\begin{aligned} \varGamma _k= {\left\{ \begin{array}{ll} 1 &{} \text { if } k = 0;\\ \tfrac{L_0+\langle \widetilde{\lambda }^k,L\rangle -a\mu _0}{L_0+\langle \widetilde{\lambda }^k,L\rangle -\mu _0} \varGamma _{k-1} &{}\text { if } k \ge 1. \end{array}\right. } \end{aligned}$$

Multiplying both sides of the above inequality by \(\tfrac{\varGamma _{k}}{L_0+\langle \widetilde{\lambda }^{k+1},L\rangle -\mu _0}\) and noting that \(\eta -\eta ^k = \rho ^k (\eta -\eta ^0)\) (follows by the choice of \(\delta ^k\)), we obtain

$$\begin{aligned}&\tfrac{\varGamma _{k}}{L_0+\langle \widetilde{\lambda }^{k+1},L\rangle -\mu _0} \big [\psi _0(x^{k+1})- \psi _0(x^*)\big ] \nonumber \\&\quad \le \tfrac{\varGamma _{k}}{L_0+\langle \widetilde{\lambda }^{k+1},L\rangle -\mu _0} \rho ^k\langle \widetilde{\lambda }^{k+1},\eta -\eta ^0\rangle \nonumber \\&\qquad +\tfrac{\varGamma _k}{2} \Vert x^*-x^{k}\Vert _{}^{2} - \tfrac{\varGamma _{k+1}}{2} \Vert x^*-x^{k+1}\Vert _{}^{2}\nonumber \\&\qquad + \tfrac{ L_0+\langle \widetilde{\lambda }^{k+1}, L\rangle +2a\mu _0}{a\mu _0(L_0 + \langle \widetilde{\lambda }^{k+1}, L\rangle - \mu _0)} \varGamma _{k}\epsilon _k. \end{aligned}$$
(7.13)

Since \(\Vert \widetilde{\lambda }^k\Vert _{}^{} \le B\), we have \((\tfrac{L_0 + B\Vert L\Vert _{}^{}-a\mu _0}{L_0 + B\Vert L\Vert _{}^{} - \mu _0})^k \le \varGamma _k \le (\tfrac{L_0-a\mu _0}{L_0 - \mu _0})^k\). Moreover, we have \(2\epsilon _k \le a (1-\rho )\rho ^k\Vert \eta -\eta ^0\Vert _{}^{}\) and \(\rho =\tfrac{L_0 - \mu _0}{2(L_0-a\mu _0)}\). Using these relations in (7.13), we have

$$\begin{aligned}&\tfrac{\varGamma _{k}}{L_0+B\Vert L\Vert _{}^{}-\mu _0} \big [\psi _0(x^{k+1})- \psi _0(x^*)\big ] \nonumber \\&\quad \le \tfrac{B\Vert \eta -\eta ^0\Vert _{}^{}}{L_0-\mu _0}\varGamma _{k}\rho ^k + \tfrac{L_0+B\Vert L\Vert +2a\mu _0}{a\mu _0(L_0 - \mu _0)}\varGamma _k\epsilon _k +\tfrac{\varGamma _k}{2} \Vert x^k-x^*\Vert _{}^{2} - \tfrac{\varGamma _{k+1}}{2} \Vert x^{k+1}-x^*\Vert _{}^{2} \nonumber \\&\quad \le \tfrac{B\Vert \eta -\eta ^0\Vert _{}^{}}{L_0-\mu _0}\tfrac{1}{2^k} {+} \tfrac{L_0+B\Vert L\Vert +2a\mu _0}{\mu _0(L_0 {-} \mu _0)} \tfrac{1-\rho }{2^{k+1}}\Vert \eta -\eta ^0\Vert +\tfrac{\varGamma _k}{2} \Vert x^k-x^*\Vert _{}^{2} { -} \tfrac{\varGamma _{k+1}}{2} \Vert x^{k+1}-x^*\Vert _{}^{2}. \end{aligned}$$
(7.14)

Similar to the convex part, we also have

$$\begin{aligned}&\psi _{0}(x^K) \le \psi _{0}(x^{k+1}) \\&\qquad + {\textstyle {\sum }}_{i=k+1}^{K-1}\epsilon _i \le \psi _0(x^{k+1}) \\&\qquad + \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}(1-\rho )}{2}{\textstyle {\sum }}_{i=k+1}^{K-1}\rho ^i \le \psi _0(x^{k+1}) \\&\qquad + \tfrac{\Vert \eta -\eta ^0\Vert _{}^{}\rho ^{k+1}}{2}. \end{aligned}$$

Using the above relation into (7.14), we have

$$\begin{aligned} \tfrac{\varGamma _{k}}{L_0+B\Vert L\Vert _{}^{}-\mu _0} \big [\psi _0(x^K)- \psi _0(x^*)\big ]&\le \tfrac{(4B+1)\Vert \eta -\eta ^0\Vert _{}^{}}{L_0-\mu _0}\tfrac{1}{2^{k+2}} + \tfrac{L_0+B\Vert L\Vert +2a\mu _0}{\mu _0(L_0 - \mu _0)} \tfrac{1-\rho }{2^{k+1}}\Vert \eta -\eta ^0\Vert \nonumber \\&\quad +\tfrac{\varGamma _k}{2} \Vert x^k-x^*\Vert _{}^{2} - \tfrac{\varGamma _{k+1}}{2} \Vert x^{k+1}-x^*\Vert _{}^{2}. \end{aligned}$$

Summing the above relation from \(k = 0\) to \(K-1\), we have

$$\begin{aligned}&\tfrac{\varGamma _{K-1}}{L_0+B\Vert L\Vert _{}^{}-\mu _0} \big [\psi _0(x^K)- \psi _0(x^*)\big ] \\&\quad \le {\textstyle {\sum }}_{k=0}^{K-1}\tfrac{\varGamma _{k}}{L_0+B\Vert L\Vert _{}^{}-\mu _0} \big [\psi _0(x^K)- \psi _0(x^*)\big ] \\&\quad \le \big [\tfrac{(4B+1)}{2(L_0-\mu _0)}\\&\qquad +\tfrac{L_0+B\Vert L\Vert +2a\mu _0}{\mu _0(L_0 - \mu _0)}(1-\rho )\big ] \Vert \eta -\eta ^0\Vert _{}^{} + \tfrac{1}{2}\Vert x^0-x^*\Vert ^2. \end{aligned}$$

Note that

$$\begin{aligned} \varGamma _{K-1}^{-1}\le \big (1-\tfrac{(1-a)\mu _0}{L_0+B\Vert L\Vert _{}^{}-a\mu _0}\big )^{K}\le \exp \big (-\tfrac{(1-a)\mu _0 K}{L_0+B\Vert L\Vert _{}^{}-a\mu _0}\big ). \end{aligned}$$

Combining the above two relations we obtain the desired result (7.11). \(\square \)

8 Numerical study

In this section, we conduct some preliminary studies to examine our theoretical results and the performance of the LCPG method. The experiments are run on CentOS with Intel Xeon (2.60 GHz) and 128 GB memory.

8.1 A simulated study on the QCQP

In the first experiment, we compare LCPG with some established open-source solvers such as CVXPY [11] and DCCP [32]. We consider the penalized Quadratically Constrained Quadratic Program (QCQP) described as follows,

$$\begin{aligned} \begin{aligned} \min _{x\in \mathbb {R}^n}&\quad \tfrac{1}{2}x^TQ_0x + b_0^T x +\alpha \Vert x\Vert _1 \\ \text {s.t.}&\quad \tfrac{1}{2}x^TQ_i x + b_i^T x +c_i\le 0, \quad i=1,2,\ldots , m-1 \\ \text {s.t.}&\quad \Vert x\Vert \le r \end{aligned} \end{aligned}$$
(8.1)

where each \(Q_i\) (\(0\le i \le m\)) is an \(n\times n\) matrix, \(b_0,b_1,\ldots , b_m\) are n-dimensional real vectors, \(\alpha \) is a positive weight on the \(\ell _1\) norm penalty, which helps to promote sparse solution. In the first setting, we consider a convex constrained problem where each \(Q_i\) is a positive semidefinite matrix. We set \(Q_i=VDV^\textrm{T}\) where V is an \(n\times n\) random sparse matrix with density 0.01, and its nonzero entries are uniformly distributed in [0, 1]. D is a diagonal matrix whose diagonal elements are uniformly distributed in between [0, 100]. We set \(b_i=10e+v\), where e is a vector of ones and \(v\in {\mathcal {N}}(0, I_{n\times n})\) is sampled from standard Gaussian distribution. We set \(c_i=-10\) to make \(x=0\) a strictly feasible initial solution. Furthermore, we add a ball constraint to ensure that the domain is a compact set. We set \(r=\sqrt{20}\) and \(\alpha =1\). We fix \(m=10\) and explore different dimensions n from the set \(\{500, 1000, 2000, 3000, 4000\}\).

We solve Problem (8.1) by both CVXPY and LCPG. Both use the initial solution \(x=0\). For CVXPY, we use MOSEK as the internal solver due to its superior performance in quadratic optimization. In LCPG, for simplicity, we also solve the diagonal quadratic subproblem by MOSEK through CVXPY. Note that calling the external API repetitively for each LCPG subproblem only causes more overheads to run LCPG. Nonetheless, as we shall see, the standard IPM solvers can still fully leverage the diagonal structure and exhibit fast convergence.

In Table 3, we present the experiment results of the compared algorithms. The final objective, the norm of the dual solution (DNorm), and for LCPG, the maximum dual norm in the solution path (Max DNorm) are reported. All values represent the average of 5 independent runs. From the results, we observe that while LCPG does not outperform CVXPY for the small-size problem (\(n=500\)), LCPG becomes increasingly favorable as the problem dimension increases. This justifies the empirical advantage of our proposed approach as we do not need to construct a full Hessian matrix. Moreover, interestingly, we observe that the dual solution norm \(\{\Vert \lambda ^k\Vert \}\) is increasing, reaching the maximum at the last iteration. This accounts for the equal values of DNorm and Max DNorm. Meanwhile, in all the cases, the dual remains bounded and the reported dual norm closely aligns with the solution returned by CVXPY. This result confirms our intuition that the dual bound is intricately tied to the nature of problems.

Table 3 Comparison of algorithms on convex quadratic problems. Running time is measured in seconds

In the second setting of this experiment, we examine the performance of LCPG on nonconvex constrained optimization. Specifically, we express \(Q_i\) as the difference of two matrices: \({Q}_i=P_i-S_i\), where \(P_i\) is generated in the same manner as \(Q_i\) in the first setting, and \(S_i = 10 I_{n\times n}\). Given the construction of the quadratic components, it is natural to view the function \( \tfrac{1}{2}x^TQ_i x + b_i^T x +c_i\) as a difference of two convex quadratic functions: \(\tfrac{1}{2}x^TP_i x + b_i^T x +c_i-\tfrac{1}{2}x^T S_i x\). Leveraging this decomposition, we apply the DC programming, and more specifically, the DCCP framework to solve (8.1). Each convex subproblem of DCCP is solved by MOSEK through the CVXPY interface. In Table 4, we describe the performance of LCPG and the DCCP algorithm. It can be observed that LCPG compares favorably against the DCCP solver. Furthermore, the boundedness of the dual for both algorithms is also observed, which is consistent with our intuition.

Table 4 Comparison of algorithms on nonconvex quadratic problems

8.2 Study of gradient complexities

Fig. 3
figure 3

Comparison of LCPG, LCSPG and LCSVRG. The first row reports the results on covtype (left: \(\sigma =0.4\); right: \(\sigma =0.6\)). The second row reports the results on real-sim (left: \(\sigma =0.1\); right: \(\sigma =0.2\))

In the next experiment, our primary goal is to examine the main theoretical results, namely, the gradient complexities of LCPG, its stochastic variants LCSPG and LCSVRG. We apply all these algorithms to a sparsity-induced finite-sum problem, wherein a nonconvex constraint is incorporated into the supervised learning framework to actively enforce a sparse solution. The optimization problem is as follows

$$\begin{aligned} \begin{aligned} \min _{x\in \mathbb {R}^d}&\quad \psi _0(x):=\tfrac{1}{n}\sum \limits _{i=1}^n f_i(x) \\ \text {s.t.}&\quad \psi _1(x)= \beta \Vert x\Vert _1 - g(x) \le \eta _1, \end{aligned} \end{aligned}$$
(8.2)

where \(f_i(x)\) is a smooth loss function associated with the ith sample, \(\psi _1(x)\) is the difference between \(\ell _1\)-penalty and a convex smooth function g(x). Employing a difference-of-convex constraint is seen as a tighter relaxation of the cardinality constraint \(\Vert x\Vert _0\le \kappa \) than the \(\ell _1\) relaxation. The appealing properties of difference-of-convex penalties have been demonstrated in various studies [5, 14, 16, 17, 36, 37].

In view of the concave structure of \(-g(x)\), there is a strong asymmetry between the lower and upper curvature of \(-g(x)\), namely, the following

$$\begin{aligned} -\tfrac{L_g}{2}\Vert y-x\Vert ^2 -\nabla g(y)^T(x-y)-g(y)\le -g(x)\le -g(y) -\nabla g(y)^T(x-y) \end{aligned}$$
(8.3)

holds for certain \(L_g>0\). Note that this is much stronger than the \(L_g\) smoothness condition which adds an extra \(\tfrac{L_g}{2}\Vert y-x\Vert ^2\) on the right-hand side of (8.3). Due to this feature, one can impose a tighter piece-wise linear surrogate function constraint

$$\begin{aligned} \beta \Vert x\Vert _1 - g(x^k)-\nabla g(x^k)(x-x^k) \le \eta ^k_1 \end{aligned}$$

in the LCPG subproblem. It should be noted that our analysis is readily adaptable to accommodate this scenario since it is the smoothness, as opposed to concavity/convexity, that plays a central role in our convergence analysis and that remains valid. An empirical advantage of this approach is that we now have a tractable subproblem solvable in nearly linear time. See more discussion in [5].

Our experiment considers the task of binary classification with logistic loss, denoted by \(f_i(x)=\log (1+\exp (-b_i(a_i^Tx))\), where \(a_i\in \mathbb {R}^d,b_i\in \{1,-1\}\), \(1\le i \le n\). We use the SCAD penalty \(g(x)=\sum _{j=1}^d h_{\beta ,\theta }(x_j) \) where \(h_{\beta ,\theta }(\cdot )\) is defined in (3.22). We use the real-sim dataset from the LibSVM repository [10] and the covtype data from the UCI repository [19]. For the latter, we formulate a binary classification task by distinguishing class “3” from the other classes. We set \(\beta =2\), \(\theta =5\), and \(\eta _1=\sigma d\), with \(\sigma \in \{0.4, 0.6\}\) for covtype and \(\sigma \in \{0.1, 0.2\}\) for real-sim dataset. For each algorithm, we use its theoretically suggested batch size and stepsize. for a fair comparison, we count n evaluations of the stochastic gradient as an effective pass over the dataset and plot the objective value over the number of effective passes. Figure 3 plots the convergence result of the compared algorithms. It can be readily seen that LCSPG performs better than LCPG in terms of the number of gradient samples, and LCSVRG achieves the best performance among the three algorithms. The empirical findings further confirm our theoretical complexity analysis.

9 Conclusion

In this work, we presented a new LCPG method for nonconvex function constrained optimization which can achieve gradient complexity of the same order as that of unconstrained nonconvex problems. The key ingredient in our algorithm design is the use of constraint levels to ensure the subproblem feasibility, which allows us to overcome a well-known difficulty in bounding the Lagrange multipliers in the presence of nonsmooth constraints. Moreover, a merit of our convergence analysis is its striking similarity with that of gradient descent methods for unconstrained problems. Therefore, we can easily extend our method to minimizing stochastic, finite-sum, and structured nonsmooth functions with nonconvex function constraints; many of the complexity results were not known before. Another important feature of our work is that the method can deal with complex scenarios where the subproblems are not exactly solvable. To the best of our knowledge, existing work on sequential convex optimization (SQP, MBA) only assumes the subproblems to be exactly solved. We provided a detailed complexity analysis of LCPG when the subproblems are inexactly solved by customized interior point method and first-order method. Finally, we clearly distinguished the notion of gradient complexity from that of computational complexity. In terms of gradient complexity, all of our proposed methods are state-of-the-art and easy to implement. Whether the computational complexity can be further improved for composite cases remains an open problem that we leave as a future direction.