Abstract
We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming (StoSQP) algorithm that utilizes a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian, and performs a stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the KKT residuals converge to zero almost surely. Our algorithm and analysis further develop the prior work of Na et al. (Math Program, 2022. https://doi.org/10.1007/s10107-022-01846-z). Specifically, we allow nonlinear inequality constraints without requiring the strict complementary condition; refine some of designs in Na et al. (2022) such as the feasibility error condition and the monotonically increasing sample size; strengthen the global convergence guarantee; and improve the sample complexity on the objective Hessian. We demonstrate the performance of the designed algorithm on a subset of nonlinear problems collected in CUTEst test set and on constrained logistic regression problems.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We study stochastic nonlinear optimization problems with deterministic equality and inequality constraints:
where \(f: \mathbb {R}^d\rightarrow \mathbb {R}\) is an expected objective, \(c: \mathbb {R}^d\rightarrow \mathbb {R}^{m}\) are deterministic equality constraints, \(g: \mathbb {R}^d\rightarrow \mathbb {R}^{r}\) are deterministic inequality constraints, \(\xi \sim {{\mathcal {P}}}\) is a random variable following the distribution \({{\mathcal {P}}}\), and \(F(\cdot ; \xi ):\mathbb {R}^d\rightarrow \mathbb {R}\) is a realized objective. In stochastic optimization regime, the direct evaluation of f and its derivatives is not accessible. Instead, it is assumed that one can generate independent and identically distributed samples \(\{\xi _i\}_{i}\) from \({{\mathcal {P}}}\), and estimate f and its derivatives based on the realizations \(\{F(\cdot \;; \xi _i )\}_{i}\).
Problem (1) widely appears in a variety of industrial applications including finance, transportation, manufacturing, and power systems [8, 56]. It includes constrained empirical risk minimization (ERM) as a special case, where \({{\mathcal {P}}}\) can be regarded as a uniform distribution over n data points \(\{\xi _i = ({{\varvec{y}}}_i, {{\varvec{z}}}_i)\}_{i=1}^n\), with \(({{\varvec{y}}}_i, {{\varvec{z}}}_i)\) being the feature-outcome pairs. Thus, the objective has a finite-sum form as
The goal of (1) is to find the optimal parameter \({{\varvec{x}}}^\star \) that fits the data best. One of the most common choices of F is the negative log-likelihood of the underlying distribution of \(({{\varvec{y}}}_i, {{\varvec{z}}}_i)\). In this case, the optimizer \({{\varvec{x}}}^\star \) is called the maximum likelihood estimator (MLE). Constraints on parameters are also common in practice, which are used to encode prior model knowledge or to restrict model complexity. For example, [30, 31] studied inequality constrained least-squares problems, where inequality constraints maintain structural consistency such as non-negativity of the elasticities. [42, 45] studied statistical properties of constrained MLE, where constraints characterize the parameters space of interest. More recently, a growing literature on training constrained neural networks has been reported [15, 25, 32, 33], where constraints are imposed to avoid weights either vanishing or exploding, and objectives are in the above finite-sum form.
This paper aims to develop a numerical procedure to solve (1) with a global convergence guarantee. When the objective f is deterministic, numerous nonlinear optimization methods with well-understood convergence results are applicable, such as exact penalty methods, augmented Lagrangian methods, sequential quadratic programming (SQP) methods, and interior-point methods [41]. However, methods to solve constrained stochastic nonlinear problems with satisfactory convergence guarantees have been developed only recently. In particular, with only equality constraints, [4] designed a very first stochastic SQP (StoSQP) scheme using an \(\ell _1\)-penalized merit function, and showed that for any initialization, the KKT residuals \(\{R_t\}_t\) converge in two different regimes, determined by a prespecified deterministic stepsize-related sequence \(\{\alpha _t\}_t\):
-
(a)
(constant sequence) if \(\alpha _t = \alpha \) for some small \(\alpha >0\), then \( \sum _{i=0}^{t-1}\mathbb {E}[R_i^2]/t \le \varUpsilon /(\alpha t) + \varUpsilon \alpha \) for some \(\varUpsilon >0\);
-
(b)
(decaying sequence) if \(\alpha _t\) satisfies \(\sum _{t=0}^{\infty }\alpha _t = \infty \) and \(\sum _{t=0}^{\infty }\alpha _t^2 <\infty \), then \(\liminf _{t\rightarrow \infty }\mathbb {E}[R_t^2] = 0\).
Both convergence regimes are well known for unconstrained stochastic problems where \(R_t = \Vert \nabla f({{\varvec{x}}}_t)\Vert \) (see [12] for a recent review), while [4] generalized the results to equality constrained problems. Within the algorithm of [4], the authors designed a stepsize selection scheme (based on the prespecified deterministic sequence) to bring some sort of adaptivity into the algorithm. However, it turns out that the prespecified sequence, which can be aggressive or conservative, still highly affects the performance. To address the adaptivity issue, [40] proposed an alternative StoSQP, which exploits a differentiable exact augmented Lagrangian merit function, and enables a stochastic line search procedure to adaptively select the stepsize. Under a different setup (where the model is precisely estimated with high probability), [40] proved a different guarantee: for any initialization, \(\liminf _{t\rightarrow \infty } R_t = 0\) almost surely. Subsequently, a series of extensions have been reported. [3] designed a StoSQP scheme to deal with rank-deficient constraints. [18] designed a StoSQP that exploits inexact Newton directions. [6] designed an accelerated StoSQP via variance reduction for finite-sum problems. [5] further developed [4] to achieve adaptive sampling. [17] established the worst-case iteration complexity of StoSQP, and [39] established the asymptotic local rate of StoSQP and performed statistical inference. In addition, [43] investigated a deterministic SQP where the objective and constraints are evaluated with noise. However, all aforementioned literature does not include inequality constraints.
Our paper develops this line of research by designing a StoSQP method that works with nonlinear inequality constraints. In order to do so, we have to overcome a number of intrinsic difficulties that arise in dealing with inequality constraints, which were already noted in classical nonlinear optimization literature [7, 41]. Our work is built upon [40], where we exploited an augmented Lagrangian merit function under the SQP framework. We enhance some of designs in [40] (e.g., the feasibility error condition, the increasing batch size, and the complexity of Hessian sampling; more on these later), and the analysis of this paper is more involved. To generalize [40], we address the following two subtleties.
-
(a)
With inequalities, SQP subproblems are inequality constrained (nonconvex) quadratic programs (IQPs), which themselves are difficult to solve in most cases. Some SQP literature (e.g., [10]) supposes to apply a QP solver to solve IQPs exactly, however, a practical scheme should embed a finite number of inner loop iterations of active-set methods or interior-point methods into the main SQP loop, to solve IQPs approximately. Then, the inner loop may lead to an approximation error for search direction in each iteration, which complicates the analysis.
-
(b)
When applied to deterministic objectives with inequalities, the SQP search direction is a descent direction of the augmented Lagrangian only in a neighborhood of a KKT point [50, Propositions 8.3, 8.4]. This is in contrast to equality constrained problems, where the descent property of the SQP direction holds globally, provided the penalty parameters of the augmented Lagrangian are suitably chosen. Such a difference is indeed brought by inequality constraints: to make the (active-set) SQP direction informative, the estimated active set has to be close to the optimal active set (see Lemma 3 for details). Thus, simply changing the merit function in [40] does not work for Problem (1).
The existing literature on inequality constrained SQP has addressed (a) and (b) via various tools for deterministic objectives, while we provide new insights into stochastic objectives. To resolve (a), we design an active-set StoSQP scheme, where given the current iterate, we first identify an active set which includes all inequality constraints that are likely to be equalities. We then obtain the search direction by solving a SQP subproblem, where we include all inequality constraints in the identified active set but regard them as equalities. In this case, the subproblem is an equality constrained QP (EQP), and can be solved exactly provided the matrix factorization is within the computational budget. To resolve (b), we provide a safeguarding direction to the scheme. In each step, we check if the SQP subproblem is solvable and generates a descent direction of the augmented Lagrangian merit function. If yes, we maintain the SQP direction as it typically enjoys a fast local rate; if no, we switch to the safeguarding direction (e.g., one gradient/Newton step of the augmented Lagrangian), along which the iterates still decrease the augmented Lagrangian although the convergence may not be as effective as that of SQP.
Furthermore, to design a scheme that adaptively selects the penalty parameters and stepsizes for Problem (1), additional challenges have to be resolved. In particular, we know that there are unknown deterministic thresholds for penalty parameters to ensure one-to-one correspondence between a stationary point of the merit function and a KKT point of Problem (1). However, due to the scheme stochasticity, the stabilized penalty parameters are random. We are unsure if the stabilized values are above (or below, depending on the context) the thresholds or not. Thus, we cannot directly conclude that the iterates converge to a KKT point, even if we ensure a sufficient decrease on the merit function in each step, and enforce the iterates to converge to one of its stationary points.
The above difficulty has been resolved for the \(\ell _1\)-penalized merit function in [4], where the authors imposed a probability condition on the noise (satisfied by symmetric noise; see [4, Proposition 3.16]). [40] resolved this difficulty for the augmented Lagrangian merit function by modifying the SQP scheme when selecting the penalty parameters. In particular, [40] required the feasibility error to be bounded by the gradient magnitude of the augmented Lagrangian in each step, and generated monotonically increasing samples to estimate the gradient. Although that analysis does not require noise conditions, adjusting the penalty parameters to enforce the feasibility error condition may not be necessary for the iterates that are far from stationarity. Also, generating increasing samples is not satisfactory since the sample size should be adaptively chosen based on the iterates. In this paper, we refine the techniques of [40] and generalize them to inequality constraints. We weaken the feasibility error condition by using a (large) multiplier to rescale the augmented Lagrangian gradient, and more significantly, enforcing it only when the magnitude of the rescaled augmented Lagrangian gradient is smaller than the estimated KKT residual. In other words, the feasibility error condition is imposed only when we have a stronger evidence that the iterate is approaching to a stationary point than approaching to a KKT point. Such a relaxation matches the motivation of the feasibility error condition, i.e., bridging the gap between stationary points and KKT points. We also get rid of the increasing sample size requirement by adaptively controlling the absolute deviation of the augmented Lagrangian gradient for the new iterates only (i.e. the previous step is a successful step; see Sect. 3). Following [40], we perform a stochastic line search procedure. However, instead of using the same sample set to estimate the gradient \(\nabla f\) and Hessian \(\nabla ^2 f\) as in [40], we sharpen the analysis and realize that the needed samples for \(\nabla ^2 f\) are significantly less than \(\nabla f\).
With all above extensions from [40], we finally prove that the KKT residual \(R_t\) satisfies \(\lim _{t\rightarrow \infty } R_t = 0\) almost surely for any initialization. Such a result is stronger than [44, Theorem 4.10] for unconstrained problems and [40, Theorem 4] for equality constrained problems, which only showed the “liminf” type of convergence. Our result also differs from the (liminf) convergence of the expected KKT residual \(\mathbb {E}[R_t^2]\) established in [3,4,5,6, 18] (under a different setup).
Related work
A number of methods have been proposed to optimize stochastic objectives without constraints, varying from first-order methods to second-order methods [12]. For all methods, adaptively choosing the stepsize is particularly important for practical deployment. A line of literature selects the stepsize by adaptively controlling the batch size and embedding natural (stochastic) line search into the schemes [11, 13, 20, 22, 29]. Although empirical experiments suggest the validity of stochastic line search, a rigorous analysis is missing. Until recently, researchers revisited unconstrained stochastic optimization via the lens of classical nonlinear optimization methods, and were able to show promising convergence guarantees. In particular, [1, 9, 16, 28, 57] studied stochastic trust-region methods, and [2, 14, 19, 44] studied stochastic line search methods. Moreover, [3,4,5,6, 18, 40] designed a variety of StoSQP schemes to solve equality constrained stochastic problems. Our paper contributes to this line of works by proposing an active-set StoSQP scheme to handle inequality constraints.
There are numerous methods for solving deterministic problems with nonlinear constraints, varying from exact penalty methods, augmented Lagrangian methods, interior-point methods, and sequential quadratic programming (SQP) methods [41]. Our paper is based on SQP, which is a very effective (or at least competitive) approach for small or large problems. When inequality constraints are present, SQP can be classified into IQP and EQP approaches. The former solves inequality constrained subproblems; the latter, to which our method belongs, solves equality constrained subproblems. A clear advantage of EQP over IQP is that the subproblems are less expensive to solve, especially when the quadratic matrix is indefinite. See [41, Chapter 18.2] for a comparison. Within SQP schemes, an exact penalty function is used as the merit function to monitor the progress of the iterates towards a KKT point. The \(\ell _1\)-penalized merit function, \(f({{\varvec{x}}}) + \mu \left( \Vert c({{\varvec{x}}})\Vert _1 + \Vert \max \{g({{\varvec{x}}}),{{\varvec{0}}}\}\Vert _1\right) \), is always a plausible choice because of its simplicity. However, a disadvantage of such non-differentiable merit functions is their impedance of fast local rates. A nontrivial local modification of SQP has to be employed to relieve such an issue [10]. As a resolution, multiple differentiable merit functions have been proposed [7]. We exploit an augmented Lagrangian merit function, which was first proposed for equality constrained problems by [46, 51], and then extended to inequality constrained problems by [47, 48]. [50] further improved this series of works by designing a new augmented Lagrangian, and established the exact property under weaker conditions. Although not crucial for that exact property analysis, [50] did not include equality constraints. In this paper, we enhance the augmented Lagrangian in [50] by containing both equality and inequality constraints; and study the case where the objective is stochastic. When inequality constraints are suppressed, our algorithm and analysis naturally reduce to [40] (with refinements). We should mention that differentiable merit functions are often more expensive to evaluate, and their benefits are mostly revealed for local rates (see [38, Figure 1] for a comparison between the augmented Lagrangian and \(\ell _1\) merit functions on an optimal control problem). Thus, with only established global analysis, we do not aim to claim the benefits of the augmented Lagrangian over the popular \(\ell _1\) merit function. On the other hand, the augmented Lagrangian is a very common alternative of non-differentiable penalty functions, which has been widely utilized for inequality constrained problems and achieved promising performance [52,53,54,55, 60]. Also, our global analysis is the first step towards understanding the local rate of StoSQP when differentiable merit functions are employed.
Structure of the paper
We introduce the exploited augmented Lagrangian merit function and active-set SQP subproblems in Sect. 2. We propose our StoSQP scheme and analyze it in Sect. 3. The experiments and conclusions are in Sects. 4 and 5. Due to the space limit, we defer all proofs to Appendix.
Notation We use \(\Vert \cdot \Vert \) to denote the \(\ell _2\) norm for vectors and spectrum norm for matrices. For two scalars a and b, \(a\wedge b = \min \{a, b\}\) and \(a\vee b = \max \{a, b\}\). For two vectors \({{\varvec{a}}}\) and \({{\varvec{b}}}\) with the same dimension, \(\min \{{{\varvec{a}}}, {{\varvec{b}}}\}\) and \(\max \{{{\varvec{a}}}, {{\varvec{b}}}\}\) are vectors by taking entrywise minimum and maximum, respectively. For \({{\varvec{a}}}\in \mathbb {R}^r\), \(\textrm{diag}({{\varvec{a}}}) \in \mathbb {R}^{r\times r}\) is a diagonal matrix whose diagonal entries are specified by \({{\varvec{a}}}\) sequentially. I denotes the identity matrix whose dimension is clear from the context. For a set \(\mathcal {A}\subseteq \{1,2,\ldots , r\}\) and a vector \({{\varvec{a}}}\in \mathbb {R}^r\) (or a matrix \(A\in \mathbb {R}^{r\times d}\)), \({{\varvec{a}}}_{\mathcal {A}} \in \mathbb {R}^{|\mathcal {A}|}\) (or \(A_{\mathcal {A}}\in \mathbb {R}^{|\mathcal {A}|\times d}\)) is a sub-vector (or a sub-matrix) including only the indices in \(\mathcal {A}\); \(\varPi _{\mathcal {A}}(\cdot ): \mathbb {R}^r\rightarrow \mathbb {R}^r\) (or \(\mathbb {R}^{r\times d} \rightarrow \mathbb {R}^{r\times d}\)) is a projection operator with \([\varPi _{\mathcal {A}}({{\varvec{a}}})]_i = {{\varvec{a}}}_i\) if \(i\in \mathcal {A}\) and \([\varPi _{\mathcal {A}}({{\varvec{a}}})]_i = 0\) if \(i\notin \mathcal {A}\) (for \(A\in \mathbb {R}^{r\times d}\), \(\varPi _{\mathcal {A}}(A)\) is applied column-wise); \(\mathcal {A}^c = \{1,2,\ldots , r\}\backslash \mathcal {A}\). Finally, we reserve the notation for the Jacobian matrices of constraints: \(J({{\varvec{x}}}) = \nabla ^T c({{\varvec{x}}}) = (\nabla c_1({{\varvec{x}}}), \ldots , \nabla c_m({{\varvec{x}}}))^T \in \mathbb {R}^{m\times d}\) and \(G({{\varvec{x}}}) = \nabla ^T g({{\varvec{x}}}) = (\nabla g_1({{\varvec{x}}}), \ldots , \nabla g_r({{\varvec{x}}}))^T \in \mathbb {R}^{r\times d}\).
2 Preliminaries
Throughout this section, we suppose f, c, g are twice continuously differentiable (i.e., \(f,g,c\in C^2\)). The Lagrangian function of Problem (1) is
We denote by
the feasible set and
the active set. We aim to find a KKT point \(({{\varvec{x}}}^\star , {\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\) of (1) satisfying
When a constraint qualification holds, existing a dual pair \(({\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\) to satisfy (4) is a first-order necessary condition for \({{\varvec{x}}}^\star \) being a local solution of (1). In most cases, it is difficult to have an initial iterate that satisfies all inequality constraints, and enforce inequality constraints to hold as the iteration proceeds. This motivates us to consider a perturbed set. For \(\nu >0\), we let
Here, the perturbation radius \(\nu /2\) is not essential and can be replaced by \(\nu /\kappa \) for any \(\kappa >1\). Also, the cubic power in \(a({{\varvec{x}}})\) can be replaced by any power s with \(s>2\), which ensures that \(a({{\varvec{x}}})\in C^2\) provided \(g_i({{\varvec{x}}})\in C^2\), \(\forall i\). We also define a scaling function
where \(a_{\nu }({{\varvec{x}}})\) measures the distance of \(a({{\varvec{x}}})\) to the boundary \(\nu \), and \(q_{\nu }({{\varvec{x}}}, {\varvec{\lambda }})\) rescales \(a_{\nu }({{\varvec{x}}})\) by penalizing \({\varvec{\lambda }}\) that has a large magnitude. In the definitions of (5) and (6), \(\nu >0\) is a parameter to be chosen: given the current primal iterate \({{\varvec{x}}}_t\), we choose \(\nu = \nu _t\) large enough so that \({{\varvec{x}}}_t \in \mathcal {T}_{\nu }\). Note that while it is difficult to have \({{\varvec{x}}}_t\in \varOmega \), it is easy to choose \(\nu \) to have \({{\varvec{x}}}_t\in \mathcal {T}_{\nu }\). We also note that
With (6) and a parameter \(\epsilon >0\), we define a function to measure the dual feasibility of inequality constraints:
The following lemma justifies the reasonability of the definition (7). The proof is immediate and omitted.
Lemma 1
Let \(\epsilon , \nu >0\). For any \(({{\varvec{x}}}, {\varvec{\lambda }}) \in \mathcal {T}_{\nu }\times \mathbb {R}^r\), \(\varvec{w}_{\epsilon , \nu }({{\varvec{x}}}, {\varvec{\lambda }}) = {{\varvec{0}}} \Leftrightarrow g({{\varvec{x}}}) \le ~{{\varvec{0}}}, {\varvec{\lambda }}\ge {{\varvec{0}}}, {\varvec{\lambda }}^Tg({{\varvec{x}}}) = 0\).
An implication of Lemma 1 is that, when the iteration sequence converges to a KKT point, \(\varvec{w}_{\epsilon , \nu }({{\varvec{x}}}, {\varvec{\lambda }})\) converges to 0, i.e., \(g({{\varvec{x}}}) = {{\varvec{b}}}_{\epsilon , \nu }({{\varvec{x}}},{\varvec{\lambda }})\). This motivates us to define the following augmented Lagrangian function:
where \(\eta >0\) is a prespecified parameter, which can be any positive number throughout the paper. The augmented Lagrangian (8) generalizes the one in [50] by including equality constraints and introducing \(\eta \) to enhance flexibility (\(\eta =2\) in [50]). Without inequalities, (8) reduces to the augmented Lagrangian studied in [40]. The penalty in (8) consists of two parts. The first part characterizes the feasibility error and consists of \(\Vert c({{\varvec{x}}})\Vert ^2\) and \(\Vert g({{\varvec{x}}})\Vert ^2 - \Vert {{\varvec{b}}}_{\epsilon , \nu }({{\varvec{x}}}, {\varvec{\lambda }})\Vert ^2\). The latter term is rescaled by \(1/q_{\nu }({{\varvec{x}}}, {\varvec{\lambda }})\) to penalize \({\varvec{\lambda }}\) with a large magnitude. In fact, if \(\Vert {\varvec{\lambda }}\Vert \rightarrow \infty \), then \(q_{\nu }({{\varvec{x}}}, {\varvec{\lambda }}){\varvec{\lambda }}\rightarrow {{\varvec{0}}}\) so that \(b_{\epsilon , \nu }({{\varvec{x}}}, {\varvec{\lambda }}) \rightarrow \min \{{{\varvec{0}}}, g({{\varvec{x}}})\}\) (cf. (7)). Thus, the penalty term \((\Vert g({{\varvec{x}}})\Vert ^2 - \Vert b_{\epsilon }({{\varvec{x}}}, {\varvec{\lambda }})\Vert ^2)/q_{\nu }({{\varvec{x}}},{\varvec{\lambda }})\rightarrow \infty \), which is impossible when the iterates decrease \(\mathcal {L}_{\epsilon , \nu , \eta }\). The second part characterizes the optimality error and does not depend on the parameters \(\epsilon \) and \(\nu \). We mention that there are alternative forms of the augmented Lagrangian, some of which transform nonlinear inequalities using (squared) slack variables [7, 60]. In that case, additional variables are involved and the strict complementarity condition is often needed to ensure the equivalence between the original and transformed problems [23].
The exact property of (8) can be studied similarly as in [50], however this is incremental and not crucial for our analysis. We will only use (a stochastic version of) (8) to monitor the progress of the iterates. By direct calculation, we obtain the gradient \(\nabla \mathcal {L}_{\epsilon , \nu , \eta }\). We first suppress the evaluation point for conciseness, and define the following matrices
where \(\varvec{e}_{i, m}\in \mathbb {R}^{m}\) is the i-th canonical basis of \(\mathbb {R}^m\) (similar for \(\varvec{e}_{i, r}\in \mathbb {R}^{r}\)). Then,
where \({{\varvec{l}}}= {{\varvec{l}}}({{\varvec{x}}}) = \textrm{diag}(\max \{g({{\varvec{x}}}), {{\varvec{0}}}\})\max \{g({{\varvec{x}}}),{{\varvec{0}}}\}\). Clearly, the evaluation of \(\nabla \mathcal {L}_{\epsilon , \nu , \eta }\) requires \(\nabla f\) and \(\nabla ^2 f\), which have to be replaced by their stochastic counterparts \({\bar{\nabla }}f\) and \({\bar{\nabla }}^2 f\) for Problem (1). Based on (10), we note that, if the feasibility error vanishes, then \(\nabla \mathcal {L}_{\epsilon , \nu , \eta } = {{\varvec{0}}}\) implies the KKT conditions (4) hold for any \(\epsilon ,\nu ,\eta >0\). We summarize this observation in the next lemma. The result holds without any constraint qualifications.
Lemma 2
Let \(\epsilon , \nu , \eta >0\) and let \(({{\varvec{x}}}^\star , {\varvec{\mu }^\star }, {\varvec{\lambda }}^\star ) \in \mathcal {T}_{\nu } \times \mathbb {R}^{m}\times \mathbb {R}^{r}\) be a primal-dual triple. If \(\Vert c({{\varvec{x}}}^\star )\Vert = \Vert \varvec{w}_{\epsilon , \nu }({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star )\Vert = \Vert \nabla \mathcal {L}_{\epsilon , \nu , \eta }({{\varvec{x}}}^\star , {\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\Vert = 0\), then \(({{\varvec{x}}}^\star , {\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\) satisfies (4) and, hence, is a KKT point of Problem (1).
Proof
See Appendix A.1\(\square \)
In the next subsection, we introduce an active-set SQP direction that is motivated by the augmented Lagrangian (8).
2.1 An active-set SQP direction via EQP
Let \(\epsilon , \nu , \eta >0\) be fixed parameters. Suppose we have the t-th iterate \(({{\varvec{x}}}_t,\varvec{\mu }_t,{\varvec{\lambda }}_t)\in \mathcal {T}_{\nu }\times \mathbb {R}^m\times \mathbb {R}^r\), let us denote \(J_t = J({{\varvec{x}}}_t)\), \(G_t = G({{\varvec{x}}}_t)\) (similar for \(\nabla f_t, c_t, g_t\), \(q_{\nu }^t\) etc.) to be the quantities evaluated at the t-th iterate. We generally use index t as subscript, except for the quantities (e.g., \(q_{\nu }^t\)) that depend on \(\epsilon \), \(\nu \), or \(\eta \), which have been used as subscript. For an active set \(\mathcal {A}\subseteq \{1,\ldots , r\}\), we denote \({\varvec{\lambda }}_{t_a} = ({\varvec{\lambda }}_t)_{\mathcal {A}}\), \({\varvec{\lambda }}_{t_c} = ({\varvec{\lambda }}_t)_{\mathcal {A}^c}\) (similar for \(g_{t_a}\), \(g_{t_c}\), \(G_{t_a}\), \(G_{t_c}\) etc.) to be the sub-vectors (or sub-matrices), and denote \(\varPi _a(\cdot ) = \varPi _{\mathcal {A}}(\cdot )\), \(\varPi _c(\cdot ) = \varPi _{\mathcal {A}^c}(\cdot )\) for shorthand.
With the t-th iterate \(({{\varvec{x}}}_t,\varvec{\mu }_t,{\varvec{\lambda }}_t)\) and the above notation, we first define the identified active set as
We then solve the following coupled linear system
for some \(B_t\) that approximates the Hessian \(\nabla _{{{\varvec{x}}}}^2\mathcal {L}_t\). Our active-set SQP direction is then \(\varDelta _t:=(\varDelta {{\varvec{x}}}_t, \varDelta \varvec{\mu }_t, \varDelta {\varvec{\lambda }}_t)\). Finally, we update the iterate as
with \(\alpha _t\) chosen to ensure a certain sufficient decrease on the merit function (8).
The definition of active set was introduced in [50, (8.5)] and has been utilized, e.g., in [53]. Intuitively, for the i-th inequality constraint, if \(g_i^\star = (g({{\varvec{x}}}^\star ))_i = 0\) and \({\varvec{\lambda }}_i^\star >0\), then i will be identified when \(({{\varvec{x}}}_t, {\varvec{\lambda }}_t)\) is close to \(({{\varvec{x}}}^\star ,{\varvec{\lambda }}^\star )\); if \(g_i^\star <0\) and \({\varvec{\lambda }}^\star _i = 0\), then i will not be identified. The stepsize \(\alpha _t\) is usually chosen by line search. In Sect. 3, we will design a stochastic line search scheme to select \(\alpha _t\) adaptively. Compared to fully stochastic SQP schemes [3, 4, 18], we need a more precise model estimation. We explain the SQP direction (12) in the next remark.
Remark 1
Our dual direction \((\varDelta \varvec{\mu }_t, \varDelta {\varvec{\lambda }}_t)\) differs from the usual SQP direction introduced, for example, in [50, (8.9)]. In particular, the system (12a) is nothing but the KKT conditions of EQP:
Thus, \((\varDelta {{\varvec{x}}}_t, \varvec{\mu }_t + {\tilde{\varDelta }}\varvec{\mu }_t, {\varvec{\lambda }}_{t_a} + {\tilde{\varDelta }}{\varvec{\lambda }}_{t_a})\) solved from (12a) is also the primal-dual solution of the above EQP. However, instead of using \(({\tilde{\varDelta }}\varvec{\mu }_t, {\tilde{\varDelta }}{\varvec{\lambda }}_{t_a}, -{\varvec{\lambda }}_{t_c})\), we solve the dual direction \((\varDelta \varvec{\mu }_t, \varDelta {\varvec{\lambda }}_t)\) for both active and inactive constraints from (12b). As \(B_t\) converges to \(\nabla _{{{\varvec{x}}}}^2\mathcal {L}_t\) and \(({{\varvec{x}}}_t,\varvec{\mu }_t,{\varvec{\lambda }}_t)\) converges to a KKT point \(({{\varvec{x}}}^\star , {\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\), it is fairly easy to see that \((\varDelta \varvec{\mu }_t, \varDelta {\varvec{\lambda }}_t)\) converges to \(({\tilde{\varDelta }}\varvec{\mu }_t, {\tilde{\varDelta }}{\varvec{\lambda }}_t)\) (where we denote \({\tilde{\varDelta }}{\varvec{\lambda }}_{t_c} = -{\varvec{\lambda }}_{t_c}\)) in a higher order by noting that
Thus, the fast local rate of the SQP direction \((\varDelta {{\varvec{x}}}_t, {\tilde{\varDelta }}\varvec{\mu }_t, {\tilde{\varDelta }}{\varvec{\lambda }}_t)\) is preserved by \(\varDelta _t\). However, it turns out that the adjustment of \(\varDelta _t\) is crucial for the merit function (8) when \(B_t\) is far from \(\nabla _{{{\varvec{x}}}}^2\mathcal {L}_t\). A similar, coupled SQP system is employed for equality constrained problems [35, 40], while we extend to inequality constraints here. In fact, [50, Proposition 8.2] showed that \((\varDelta {{\varvec{x}}}_t, {\tilde{\varDelta }}\varvec{\mu }_t, {\tilde{\varDelta }}{\varvec{\lambda }}_t)\) is a descent direction of \(\mathcal {L}_{\epsilon , \nu , \eta }^t\) if \(({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t)\) is near a KKT point and \(B_t = \nabla ^2_{{{\varvec{x}}}}\mathcal {L}_t\). However, \(B_t = \nabla ^2_{{{\varvec{x}}}}\mathcal {L}_t\) (i.e., no Hessian modification) is restrictive even for a deterministic line search, and that descent result does not hold if \(B_t \ne \nabla ^2_{{{\varvec{x}}}}\mathcal {L}_t\). In contrast, as shown in Lemma 3, \(\varDelta _t\) is a descent direction even if \(B_t\) is not close to \(\nabla _{{{\varvec{x}}}}^2\mathcal {L}_t\).
2.2 The descent property of \(\varDelta _t\)
In this subsection, we present a descent property of \(\varDelta _t\). We focus on the term \((\nabla \mathcal {L}_{\epsilon , \nu , \eta }^t)^T\varDelta _t\). Different from SQP for equality constrained problems, \(\varDelta _t\) may not be a descent direction of \(\mathcal {L}_{\epsilon , \nu , \eta }^t\) for some points even if \(\epsilon \) is chosen small enough. To see it clearly, we suppress the iteration index, denote \(g_a = g_{t_a}\) (similar for \({\varvec{\lambda }}_a\), \({\varvec{\lambda }}_c\) etc.), and divide \(\nabla \mathcal {L}_{\epsilon , \nu , \eta } \) (cf. (10)) into two terms: a dominating term that depends on \((g_a, {\varvec{\lambda }}_c)\) linearly, and a higher-order term that depends on \((g_a, {\varvec{\lambda }}_c)\) at least quadratically. In particular, we write \(\nabla \mathcal {L}_{\epsilon , \nu , \eta } = \nabla \mathcal {L}_{\epsilon , \nu , \eta }^{(1)} + \nabla \mathcal {L}_{\epsilon , \nu , \eta }^{(2)}\) where
Loosely speaking (see Lemma 3 for a rigorous result), \((\nabla \mathcal {L}_{\epsilon , \nu , \eta }^{(1)})^T\varDelta \) provides a sufficient decrease provided the penalty parameters are suitably chosen, while \((\nabla \mathcal {L}_{\epsilon , \nu , \eta }^{(2)})^T\varDelta \) has no such guarantee in general. Since \(\nabla \mathcal {L}_{\epsilon , \nu , \eta }^{(2)}\) depends on \((g_a, {\varvec{\lambda }}_c)\) quadratically, to ensure \(\nabla \mathcal {L}_{\epsilon , \nu , \eta }^T\varDelta <0\), we require \(\Vert g_a\Vert \vee \Vert {\varvec{\lambda }}_c\Vert \) to be small enough to let the linear term \((\nabla \mathcal {L}_{\epsilon , \nu , \eta }^{(1)})^T\varDelta \) dominate. This essentially requires the iterate to be close to a KKT point, since \(\Vert g_a\Vert = \Vert {\varvec{\lambda }}_c\Vert = 0\) at a KKT point. With this discussion in mind, if the iterate is far from a KKT point, \(\varDelta \) may not be a descent direction of \(\mathcal {L}_{\epsilon , \nu , \eta }\). In fact, for an iterate that is far from a KKT point, the KKT matrix \(K_a\) (and its component \(G_a\)) is likely to be singular due to the imprecisely identified active set. Thus, Newton system (12) is not solvable at this iterate at all, let alone it generates a descent direction. Without inequalities, the quadratic term \(\nabla \mathcal {L}_{\epsilon , \nu , \eta }^{(2)}\) disappears and our analysis reduces to the one in [40]. We realize that the existence of \(\nabla \mathcal {L}_{\epsilon , \nu , \eta }^{(2)}\) results in a very different augmented Lagrangian to the one in [40]; and brings difficulties in designing a global algorithm to deal with inequality constraints.
We point out that requiring a local iterate is not an artifact of the proof technique. Such a requirement is imposed for different search directions in related literature. For example, [50] showed that the SQP direction obtained by either EQP or IQP is a descent direction of \(\mathcal {L}_{\epsilon , \nu , \eta }\) in a neighborhood of a KKT point (cf. Propositions 8.2 and 8.4). That work also required \(B_t = \nabla _{{{\varvec{x}}}}^2\mathcal {L}_t\), which we relax by considering a coupled Newton system. Subsequently, [53, 55] studied truncated Newton directions, whose descent properties hold only locally as well (cf. [53, Proposition 3.7], [55, Proposition 10]).
Now, we introduce two assumptions and formalize the descent property.
Assumption 1
(LICQ) We assume at \({{\varvec{x}}}^\star \) that \((J^T({{\varvec{x}}}^\star )\;\; G^T_{\mathcal {I}({{\varvec{x}}}^\star )}({{\varvec{x}}}^\star ))\) has full column rank, where \(\mathcal {I}({{\varvec{x}}}^\star )\) is the active inequality set defined in (3).
Assumption 2
For \({{\varvec{z}}}\in \{{{\varvec{z}}}\in \mathbb {R}^d: J_t{{\varvec{z}}}= {{\varvec{0}}}, G_{t_a}{{\varvec{z}}}= {{\varvec{0}}}\}\), we have \({{\varvec{z}}}^TB_t{{\varvec{z}}}\ge \gamma _{B}\Vert {{\varvec{z}}}\Vert ^2\) and \(\Vert B_t\Vert \le \varUpsilon _{B}\) for constants \(\varUpsilon _{B}\ge 1\ge \gamma _{B}>0\).
The above condition on \(B_t\) is standard in nonlinear optimization literature [7]. In fact, \(B_t = I\) with \(\gamma _{B} = \varUpsilon _{B} = 1\) is sufficient for the analysis in this paper. The condition \(\varUpsilon _{B}\ge 1\ge \gamma _{B}>0\) (similar for other constants defined later) is inessential, which is only for simplifying the presentation. Without such a requirement, our analyses hold by replacing \(\gamma _{B}\) with \(\gamma _{B}\wedge 1\) and \(\varUpsilon _{B}\) with \(\varUpsilon _{B}\vee 1\).
Lemma 3
Let \(\nu , \eta >0\) and suppose Assumptions 1 and 2 hold. There exist a constant \(\varUpsilon >0\) depending on \(\varUpsilon _{B}\) but not on \((\nu ,\eta ,\gamma _{B})\), and a compact set \(\mathcal {X}_{\epsilon ,\nu }\times \mathcal {M}\times \varLambda _{\epsilon ,\nu }\) around \(({{\varvec{x}}}^\star , {\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\) depending on \((\epsilon , \nu )\) but not on \(\eta \),Footnote 1 such that if \(({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t) \in \mathcal {X}_{\epsilon ,\nu }\times \mathcal {M}\times \varLambda _{\epsilon ,\nu }\) with \(\epsilon \) satisfying \(\epsilon \le \gamma _{B}^2(\gamma _{B}\wedge \eta )/\left\{ (1\vee \nu )\varUpsilon \right\} \), then
Furthermore, there exists a compact subset \(\mathcal {X}_{\epsilon ,\nu ,\eta }\times \mathcal {M}\times \varLambda _{\epsilon ,\nu ,\eta }\subseteq \mathcal {X}_{\epsilon ,\nu }\times \mathcal {M}\times \varLambda _{\epsilon ,\nu }\) depending additionally on \(\eta \), such that if \(({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t) \in \mathcal {X}_{\epsilon ,\nu ,\eta }\times \mathcal {M}\times \varLambda _{\epsilon ,\nu ,\eta }\), then
Proof
See Appendix A.2\(\square \)
Similar arguments for other directions can be found in [53, Proposition 3.5] and [55, Proposition 9]. By the proof of Lemma 3, we know that as long as \(M_t\) and \((J_t^T\;\; G_{t_a}^T)\) in the SQP system (12) have full (column) rank, \((\nabla \mathcal {L}_{\epsilon , \nu , \eta }^{t\; (1)})^T\varDelta _t\) ensures a sufficient decrease provided \(\epsilon \) is small enough. However, from (A.11) in the proof, we also see that \((\nabla \mathcal {L}_{\epsilon , \nu , \eta }^{t\; (2)})^T\varDelta _t\) is only bounded by
where \(\varUpsilon '>0\) is a constant independent of \((\epsilon , \nu , \eta )\). Thus, to ensure \((\nabla \mathcal {L}_{\epsilon , \nu , \eta }^t)^T\varDelta _t\) to be negative, we have to restrict to a neighborhood, in which \(\Vert g_{t_a}\Vert \vee \Vert {\varvec{\lambda }}_{t_c}\Vert \) is small enough so that \(\varUpsilon '(\frac{1\vee \nu }{\epsilon (1\wedge \nu ^2)}\vee \eta )(\Vert g_{t_a}\Vert + \Vert {\varvec{\lambda }}_{t_c}\Vert ) \le (\gamma _{B}\wedge \eta )/4\). This requirement is achievable near a KKT pair \(({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star )\), where the active set is correctly identified (implying that \(\Vert g_{t_a}\Vert \le \Vert (g_t)_{\mathcal {I}({{\varvec{x}}}^\star )}\Vert \) and \(\Vert {\varvec{\lambda }}_{t_c}\Vert \le \Vert ({\varvec{\lambda }}_t)_{\{i: 1\le i\le r, {\varvec{\lambda }}^\star _i=0\}}\Vert \)); and the radius of the neighborhood clearly depends on \((\epsilon , \nu , \eta )\).
In the next section, we exploit the introduced augmented Lagrangian merit function (8) and the active-set SQP direction (12) to design a StoSQP scheme for Problem (1). We will adaptively choose proper \(\epsilon \) and \(\nu \) (recall that \(\eta >0\) can be any positive number in this paper), incorporate stochastic line search to select the stepsize, and globalize the scheme by utilizing a safeguarding direction (e.g., Newton or steepest descent step) of the merit function \(\mathcal {L}_{\epsilon ,\nu ,\eta }\). If the system (12) is not solvable, or is solvable but does not generate a descent direction, we search along the alternative direction to decrease the merit function. However, since \(\varDelta _t\) usually enjoys a fast local rate (see [50, Proposition 8.3] for a local analysis of \((\varDelta {{\varvec{x}}}_t, {\tilde{\varDelta }}\varvec{\mu }_t, {\tilde{\varDelta }}{\varvec{\lambda }}_t)\) and Remark 1), we prefer to preserve \(\varDelta _t\) as much as possible.
3 An adaptive active-set StoSQP scheme
We design an adaptive scheme for Problem (1) that embeds stochastic line search, originally designed and analyzed for unconstrained problems in [14, 44], into an active-set StoSQP. There are two challenges to design adaptive schemes for constrained problems. First, the merit function has penalty parameters that are random and adaptively specified; while for unconstrained problems one simply uses the objective function in line search. To show the global convergence, it is crucial that the stochastic penalty parameters are stabilized almost surely. Thus, for each run, after few iterations we always target a stabilized merit function. Otherwise, if each iteration decreases a different merit function, the decreases across iterations may not accumulate. Second, since the stabilized parameters are random, they may not be below unknown deterministic thresholds. Such a condition is critical to ensure the equivalence between the stationary points of the merit function and the KKT points of Problem (1). Thus, even if we converge to a stationary point of the (stabilized) merit function, it is not necessarily true that the stationary point is a KKT point of Problem (1).
With only equality constraints, [4, 40] addressed the first challenge under a boundedness condition, and our paper follows the same type of analysis. Similar boundedness condition is also required for deterministic analyses to have the penalty parameters stabilized [7, Chapter 4.3.3]. [4] resolved the second challenge by introducing a noise condition (satisfied by symmetric noise), while [40] resolved it by adjusting the SQP scheme when selecting the penalty parameters. As introduced in Sect. 1, the technique of [40] has multiple flaws: (i) it requires generating increasing samples to estimate the gradient of the augmented Lagrangian (cf. [40, Step 1]); (ii) it imposes a feasibility error condition for each step (cf. [40, (19)]). In this paper, we refine the technique of [40] and enable inequality constraints. As revealed by Sect. 2, the present analysis of inequality constraints is much more involved; and more importantly, our “lim” convergence guarantee strengthens the existing “liminf” convergence of the stochastic line search in [40, 44]. In what follows, we use \(\bar{(\cdot )}\) to denote random quantities, except for the iterate \(({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t)\). For example, \({\bar{\alpha }}_t\) denotes a random stepsize.
3.1 The proposed scheme
Let \(\eta , \alpha _{max}, \kappa _{grad}, \chi _{grad}, \chi _{f},\chi _{err}>0; \rho >1\); \(\gamma _{B}\in (0,1]\); \(\beta , p_{grad}, p_f\in (0, 1)\); \(\kappa _f\in (0, \beta /(4\alpha _{max})]\) be fixed tuning parameters. Given quantities \(({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t, {\bar{\nu }}_t, {\bar{\epsilon }}_t, {\bar{\alpha }}_t,\bar{\delta }_t)\) at the t-th iteration with \({{\varvec{x}}}_t\in \mathcal {T}_{{\bar{\nu }}_t}\), we perform the following five steps to derive quantities at the \((t+1)\)-th iteration.
Step 1: Estimate objective derivatives
We generate a batch of independent samples \(\xi _1^t\) to estimate the gradient \(\nabla f_t\) and Hessian \(\nabla ^2f_t\). The estimators \({\bar{\nabla }}f_t\) and \({\bar{\nabla }}^2f_t\) may not be computed with the same amount of samples, since they have different sample complexities. For example, we can compute \({\bar{\nabla }}f_t\) using \(\xi _1^t\) while compute \({\bar{\nabla }}^2 f_t\) using a fraction of \(\xi _1^t\) (more on this in Sect. 3.4). With \({\bar{\nabla }}f_t\), \({\bar{\nabla }}^2f_t\), we then compute \({\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t\), \({\bar{Q}}_{1,t}\), and \({\bar{Q}}_{2,t}\) used in the system (12).
We require the batch size \(|\xi _1^t|\) to be large enough to make the gradient error of the merit function small. In particular, we define
A simple observation from (10) is that \({\bar{\varDelta }}(\nabla \mathcal {L}_\eta ^t)\) is independent of \({\bar{\epsilon }}_t\) (and \({\bar{\nu }}_t\)), which will be selected later (Step 2). We require \(|\xi _1^t|\) to satisfy two conditions:
(a) the event \(\mathcal {E}_1^t\),
satisfies
(b) if \(t-1\) is a successful step (see Step 5 for the meaning), then
The sample complexities to ensure (15) and (16) will be discussed in Sect. 3.4. Compared to [40], we do not let \(|\xi _1^t|\) increase monotonically, while we impose an expectation condition (16) when we arrive at a new iterate. By our analysis, it is easy to see that (16) can also be replaced by requiring the subsequence \(\{|\xi _1^t|: t-1 \text { is a successful step}\}\) to increase to the infinity (e.g., increase by at least one each time), which is still weaker than [40]. The right hand side of (16) will be clear when we utilize \(\bar{\delta }_t\) later in Step 5 (cf. (27)). We use \(P_{\xi _1^t}(\cdot )\) and \(\mathbb {E}_{\xi _1^t}[\cdot ]\) to denote the probability and expectation that are evaluated over the randomness of sampling \(\xi _1^t\) only, while other random quantities are conditioned on, such as \(({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t)\) and \({\bar{\alpha }}_t\). More precisely, we mean \(P_{\xi _1^t}(\mathcal {E}_1^t) = P(\mathcal {E}_1^t\mid \mathcal {F}_{t-1})\) (similar for \(\mathbb {E}_{\xi _1^t}[\cdot ]\)) where the \(\sigma \)-algebra \(\mathcal {F}_{t-1}\) is defined in (28) below.
Step 2: Set parameter \({\bar{\epsilon }}_t\). With current \({\bar{\nu }}_t\), we decrease \({\bar{\epsilon }}_t \leftarrow {\bar{\epsilon }}_t/\rho \) until \({\bar{\epsilon }}_t\) is small enough to satisfy the following two conditions simultaneously:
(a) the feasibility error is proportionally bounded by the gradient of the merit function, whenever the iterate is closer to a stationary point than a KKT point:
(we use the same multiplier \(\chi _{err}\) only for simplifying the notation.)
(b) if the SQP system (12) with \({\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t\), \({\bar{Q}}_{1,t}\), and \({\bar{Q}}_{2,t}\) is solvable, then we obtain \({\bar{\varDelta }}_t = ({\bar{\varDelta }}{{\varvec{x}}}_t, {\bar{\varDelta }}\varvec{\mu }_t, {\bar{\varDelta }}{\varvec{\lambda }}_t)\) and require
We prove in Lemma 4 and Lemma 5 that both (17) and (18) can be satisfied for sufficiently small \({\bar{\epsilon }}_t\). In fact, Lemma 3 has already established (18) for the deterministic case. Even though \({\bar{\varDelta }}_t\) is not always used as the search direction, we still enforce (18) to hold for \(({\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^{t\;(1)})^T{\bar{\varDelta }}_t\). The reason for this is to avoid ruling out \({\bar{\varDelta }}_t\) just because \({\bar{\epsilon }}_t\) is not small enough, which would result in a positive dominating term \(({\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^{t\;(1)})^T{\bar{\varDelta }}_t\). If (12) is not solvable (e.g., the active set is imprecisely identified so that \(K_{t_a}\) is singular), then (18) is not needed.
The condition (17) is the key to ensure that the stationary point of the merit function that we converge to is a KKT point of (1). Motivated by Lemma 2, we know that “the stationarity of the merit function plus vanishing feasibility error” implies vanishing KKT residual. (17) states that the feasibility error is roughly controlled by the gradient of the merit function. (17) relaxes [40, (19)] from two aspects. First, [40] had no multiplier while we allow any (large) multiplier \(\chi _{err}\). Second, [40] enforced (17) for each step, while we enforce it only when we observe a stronger evidence that the scheme is approaching to a stationary point than to a KKT point. The above relaxations are driven by the intention of imposing the condition. When adjusting \({\bar{\epsilon }}_t\), if \(\Vert {\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t,{\bar{\nu }}_t,\eta }^t\Vert \) first exceeds \({\bar{R}}_t\) before \(\Vert (c_t, \varvec{w}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t}^t)\Vert \) (which easily happens for a large \({\bar{\nu }}_t\)), then one can immediately stop the adjustment of \({\bar{\epsilon }}_t\). Compared to [40] where the SQP system is supposed to be always solvable, (17) has extra usefulness: when \({\bar{\varDelta }}_t\) is not available, (17) ensures that the safeguarding direction can be computed using the samples in Step 1. Such a desire is not easily achieved, and further relaxations of (17) can be designed if we generate new samples for the safeguarding direction (in Step 3). The subtlety lies in the fact that no penalty parameters are involved when we generate \(\xi _1^t\) in Step 1, while (17) builds a connection between \(\xi _1^t\) and the penalty parameters. It implies that the set \(\xi _1^t\) satisfying (15) and (16) also satisfies the corresponding conditions for the safeguarding direction.
Step 3: Decide the search direction.
We may obtain a stochastic SQP direction \({\bar{\varDelta }}_t\) from Step 2. However, if (12) is not solvable, or it is solvable but \({\bar{\varDelta }}_t\) is not a sufficient descent direction because
then an alternative safeguarding direction \({\hat{\varDelta }}_t\) must be employed to ensure the decrease of the merit function. In that case, we follow [53, 55] and regard \(\mathcal {L}_{{\bar{\epsilon }}_t,{\bar{\nu }}_t,\eta }\) as a penalized objective. We require \({\hat{\varDelta }}_t\) to satisfy
for a constant \(\chi _{u}\ge 1\). Similar to (17), we use the same constant \(\chi _{u}\) for the two multipliers to simplify the notation. When using two different constants \(\chi _{1,u}\) and \(\chi _{2,u}\), we can always set \(\chi _{u} = 1/\chi _{1,u}\vee \chi _{2,u}\) to let (20) hold. The condition (20) is standard in the literature [53, (60a,b)] [55, (52a,b)]. One example that satisfies (20) and is computationally cheap is the steepest descent direction \({\hat{\varDelta }}_t = - {\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^t\) with \(\chi _{u} = 1\). Such a direction can be computed (almost) without any extra cost since the two components of \({\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^t\), \({\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^{t\; (1)}\) and \({\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^{t\; (2)}\), have been computed when checking (18) and (19). Another example that is more computationally expensive is the regularized Newton step \({\hat{H}}_t{\hat{\varDelta }}_t = -{\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^t\), where \({\hat{H}}_t\) captures second-order information of \(\mathcal {L}_{{\bar{\epsilon }}_t,{\bar{\nu }}_t,\eta }^t\) and satisfies \(1/\chi _{u}I \preceq {\hat{H}}_t\preceq \chi _{u}I\). In particular, \({\hat{H}}_t\) can be obtained by regularizing the (generalized) Hessian matrix \(H_t\), which is provided and discussed in [50, 53], and has the formFootnote 2
Here, \({{\varvec{1}}} = (1,\ldots , 1)\in \mathbb {R}^r\) is the all one vector. Other examples that improve upon the regularized Newton step include the choices in [21, 54], where a truncated conjugate gradient method is applied to an indefinite Newton system [54, Proposition 3.3, (14)]. We will numerically implement the regularized Newton and the steepest descent steps in Sect. 4.
Step 4: Estimate the merit function. Let denote the adopted search direction; thus from Step 2 or from Step 3. We aim to perform stochastic line search by checking the Armijo condition (26) at the trial point
We estimate the merit function in this step and perform line search in Step 5.
First, we check if the trial primal point \({{\varvec{x}}}_{s_t}\) is in \(\mathcal {T}_{{\bar{\nu }}_t}\). In particular, if \({{\varvec{x}}}_{s_t} \notin \mathcal {T}_{{\bar{\nu }}_t}\), that is \(a_{s_t} = a({{\varvec{x}}}_{s_t}) > {\bar{\nu }}_t/2\) (cf. (5)), then we stop the current iteration and reject the trial point by letting \(({{\varvec{x}}}_{t+1}, \varvec{\mu }_{t+1}, {\varvec{\lambda }}_{t+1}) = ({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t)\), \({\bar{\epsilon }}_{t+1} = {\bar{\epsilon }}_t\), \({\bar{\alpha }}_{t+1} = {\bar{\alpha }}_t\), and \(\bar{\delta }_{t+1} = \bar{\delta }_t\). We also increase \({\bar{\nu }}_t\) by letting
where \(\lceil y\rceil \) denotes the least integer that exceeds y. The definition of \(j\ge 1\) in (22) ensures \({{\varvec{x}}}_{s_t} \in \mathcal {T}_{{\bar{\nu }}_{t+1}}\). However, \(j=1\) works as well, since \({{\varvec{x}}}_{t+1}={{\varvec{x}}}_t \in \mathcal {T}_{{\bar{\nu }}_t} \subseteq \mathcal {T}_{{\bar{\nu }}_{t+1}}\), as required for performing the next iteration. In the case of \({{\varvec{x}}}_{s_t} \notin \mathcal {T}_{{\bar{\nu }}_t}\), particularly if \(a_{s_t} \ge {\bar{\nu }}_t\), evaluating the merit function \(\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^{s_t}\) is not informative since the penalty term in \(\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^{s_t}\) may be rescaled by a negative multiplier. Thus, we increase \({\bar{\nu }}_t\) and rerun the iteration at the current point.
Otherwise \({{\varvec{x}}}_{s_t} \in \mathcal {T}_{{\bar{\nu }}_t}\), then we generate a batch of independent samples \(\xi _2^t\), that are independent from \(\xi _1^t\) as well, and estimate \(f_t, f_{s_t}, \nabla f_t, \nabla f_{s_t}\). Similar to Step 1, the estimators \({\bar{f}}_t, {\bar{f}}_{s_t}\) and \({\bar{{\bar{\nabla }}}}f_t, {\bar{{\bar{\nabla }}}}f_{s_t}\) may not be computed with the same amount of samples. For example, \({\bar{f}}_t\) and \({\bar{f}}_{s_t}\) can be computed using \(\xi _2^t\) while \({\bar{{\bar{\nabla }}}}f_t\) and \({\bar{{\bar{\nabla }}}}f_{s_t}\) can be computed using a fraction of \(\xi _2^t\). The sample complexities are discussed in Sect. 3.4. Here, we distinguish \({\bar{{\bar{\nabla }}}}f_t\) from \({\bar{\nabla }}f_t\) in Step 1. While both of them are estimates of \(\nabla f_t\), the former is computed based on \(\xi _2^t\) and the latter is computed based on \(\xi _1^t\). Using \({\bar{f}}_t,{\bar{f}}_{s_t},{\bar{{\bar{\nabla }}}}f_t,{\bar{{\bar{\nabla }}}}f_{s_t}\), we compute \({\bar{\mathcal {L}}}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^t\) and \({\bar{\mathcal {L}}}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^{s_t}\) according to (8).
We require \(|\xi _2^t|\) is large enough such that the event \(\mathcal {E}_2^t\),
satisfies
and
Similar to (15) and (16), \(P_{\xi _2^t}(\cdot )\) and \(\mathbb {E}_{\xi _2^t}[\cdot ]\) denote that the randomness is taken over sampling \(\xi _2^t\) only, while other random quantities are conditioned on. That is, \(P_{\xi _2^t}(\mathcal {E}_2^t) = P(\mathcal {E}_2^t\mid \mathcal {F}_{t-0.5})\) (similar for \(\mathbb {E}_{\xi _2^t}[\cdot ]\)) where the \(\sigma \)-algebra \(\mathcal {F}_{t-0.5} = \mathcal {F}_{t-1}\cup \sigma (\xi _1^t)\) is defined in (28) below.
Step 5: Perform line search. With the merit function estimates, we check the Armijo condition next.
(a) If the Armijo condition holds,
then the trial point is accepted by letting \(({{\varvec{x}}}_{t+1}, \varvec{\mu }_{t+1}, {\varvec{\lambda }}_{t+1}) = ({{\varvec{x}}}_{s_t}, \varvec{\mu }_{s_t}, {\varvec{\lambda }}_{s_t})\) and the stepsize is increased by \({\bar{\alpha }}_{t+1} = \rho {\bar{\alpha }}_t\wedge \alpha _{max}\). Furthermore, we check if the decrease of the merit function is reliable. In particular, if
then we increase \(\bar{\delta }_t\) by \(\bar{\delta }_{t+1} = \rho \bar{\delta }_t\); otherwise, we decrease \(\bar{\delta }_t\) by \(\bar{\delta }_{t+1} = \bar{\delta }_t/\rho \).
(b) If the Armijo condition (26) does not hold, then the trial point is rejected by letting \(({{\varvec{x}}}_{t+1}, \varvec{\mu }_{t+1}, {\varvec{\lambda }}_{t+1}) = ({{\varvec{x}}}_{t}, \varvec{\mu }_{t},{\varvec{\lambda }}_{t})\), \({\bar{\alpha }}_{t+1} = {\bar{\alpha }}_t/\rho \) and \(\bar{\delta }_{t+1} = \bar{\delta }_t/\rho \).
Finally, for both cases (a) and (b), we let \({\bar{\epsilon }}_{t+1} = {\bar{\epsilon }}_t\), \({\bar{\nu }}_{t+1} = {\bar{\nu }}_t\) and repeat the procedure from Step 1. From (27), we can see that \(\bar{\delta }_t\) (roughly) has the order \({\bar{\alpha }}_t\Vert {\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t,{\bar{\nu }}_t,\eta }^t\Vert ^2\), which justifies the definition of the right hand side of (16).
The proposed scheme is summarized in Algorithm 1. We define three types of iterations for line search. If the Armijo condition (26) holds, we call the iteration a successful step, otherwise we call it an unsuccessful step. For a successful step, if the sufficient decrease in (27) is satisfied, we call it a reliable step, otherwise we call it an unreliable step. Same notion is used in [14, 40, 44].
To end this section, let us introduce the filtration induced by the randomness of the algorithm. Given a random sample sequence \(\{\xi _1^t,\xi _2^t\}_{t=0}^\infty \),Footnote 3 we let \(\mathcal {F}_t = \sigma (\{\xi _1^j, \xi _2^j\}_{j=0}^t)\), \(t\ge 0\), be the \(\sigma \)-algebra generated by all the samples till t; \(\mathcal {F}_{t-0.5} = \sigma (\{\xi _1^j, \xi _2^j\}_{j=0}^{t-1}\cup \xi _1^t)\), \(t\ge 0\), be the \(\sigma \)-algebra generated by all the samples till \(t-1\) and the sample \(\xi _1^t\); and \(\mathcal {F}_{-1}\) be the trivial \(\sigma \)-algebra generated by the initial iterate (which is deterministic). Throughout the presentation, we let \({\bar{\epsilon }}_t\) be the quantity obtained after Step 2; that is, \({\bar{\epsilon }}_t\) satisfies (17) and (18). With this setup, it is easy to see that
We analyze Algorithm 1 in the next subsection.
3.2 Assumptions and stability of parameters
We study the stability of the parameter sequence \(\{{\bar{\epsilon }}_t, {\bar{\nu }}_t\}_t\). We will show that, for each run of the algorithm, the sequence is stabilized after a finite number of iterations. Thus, Lines 5 and 14 of Algorithm 1 will not be performed when the iteration index t is large enough. We begin by introducing the assumptions.
Assumption 3
(Regularity condition) We assume the iterate \(\{({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t)\}\) and trial point \(\{({{\varvec{x}}}_{s_t}, \varvec{\mu }_{s_t}, {\varvec{\lambda }}_{s_t})\}\) are contained in a convex compact region \(\mathcal {X}\times \mathcal {M}\times \varLambda \). Further, if \({{\varvec{x}}}_{s_t}\in \mathcal {T}_{{\bar{\nu }}_t}\), then the segment \(\{\zeta {{\varvec{x}}}_t + (1-\zeta ){{\varvec{x}}}_{s_t}: \zeta \in (0, 1)\}\subseteq \mathcal {T}_{\theta {\bar{\nu }}_t}\) for some \(\theta \in [1, 2)\). We also assume the functions f, g, c are thrice continuously differentiable over \(\mathcal {X}\), and realizations \(|F({{\varvec{x}}}, \xi )|\), \(\Vert \nabla F({{\varvec{x}}}, \xi )\Vert \), \(\Vert \nabla ^2 F({{\varvec{x}}}, \xi )\Vert \) are uniformly bounded over \({{\varvec{x}}}\in \mathcal {X}\) and \(\xi \sim {{\mathcal {P}}}\).
Assumption 4
(Constraint qualification) For any \({{\varvec{x}}}\in \varOmega \), we assume that \((J^T({{\varvec{x}}})\,\, G^T_{\mathcal {I}({{\varvec{x}}})}({{\varvec{x}}}))\) has full column rank, where \(\varOmega \) is the feasible set in (2) and \(\mathcal {I}({{\varvec{x}}})\) is the active set in (3). For any \({{\varvec{x}}}\in \mathcal {X}\backslash \varOmega \), we assume the linear system
has a solution for \({{\varvec{z}}}\in \mathbb {R}^d\).
The boundedness condition on realizations in Assumption 3 is widely used in StoSQP analysis to have a well-behaved stochastic penalty parameter sequence [3, 4, 18, 40]. The third derivatives of f, g, c are only required in the analysis and not needed in the implementation. They are required since the existence of the (generalized) Hessian of the augmented Lagrangian needs the third derivatives. See, for example, [50, Section 6] for the same requirement. For deterministic schemes, the compactness condition on the iterates is typical for the augmented Lagrangian and SQP analyses [7, Chapter 4] [41, Chapter 18]. Some literature relaxed it by assuming all quantities (e.g., the objective gradient and constraints Jacobian, etc.) are uniformly upper bounded with a lower bounded objective (so as the merit function). However, either condition is rather restrictive for StoSQP due to the underlying randomness of the scheme. That said, given the StoSQP iterates presumably contract to a deterministic feasible set, we believe that an unbounded iteration sequence is rare in general. Furthermore, compared to fully stochastic schemes in [3, 4, 18], we generate a batch of samples to have a more precise estimation of the true model in each iteration; thus, our stochastic iterates have a higher chance to closely track the underlying deterministic iterates.
The convexity of \(\mathcal {M}\times \varLambda \) can be removed by defining a closed convex hull \(\overline{\text {conv}(\mathcal {M})} \times \overline{\text {conv}(\mathcal {M})}\). However, the convexity of the set for the primal iterates is essential to enable a valid Taylor expansion. See [54, Proposition 2.2 and Section 4] [52, Proposition 2.4 and (14)] and references therein for the same requirement for doing line search with (8) and applying its Taylor expansion.
In particular, by the design of Algorithm 1, we have \({{\varvec{x}}}_t \in \mathcal {T}_{{\bar{\nu }}_t}\) for any t, while the trial step \({{\varvec{x}}}_{s_t}\) may be outside \(\mathcal {T}_{{\bar{\nu }}_t}\). If \({{\varvec{x}}}_{s_t}\notin \mathcal {T}_{{\bar{\nu }}_t}\), we enlarge \({\bar{\nu }}_t\) (Line 14) and rerun the iteration from the beginning. Assumption 3 states that if it turns out that \({{\varvec{x}}}_{s_t}\in \mathcal {T}_{{\bar{\nu }}_t}\), then the whole segment \(\zeta {{\varvec{x}}}_t+(1-\zeta ){{\varvec{x}}}_{s_t}\), which may not completely lie in \(\mathcal {T}_{{\bar{\nu }}_t}\) as \(\mathcal {T}_{{\bar{\nu }}_t}\) may be nonconvex, is supposed to lie in a larger space \(\mathcal {T}_{\theta {\bar{\nu }}_t}\) with \(\theta \in [1,2)\). Since \(\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }\) is SC\(^1\) in \(\mathcal {T}_{2{\bar{\nu }}_t}^\circ \times \mathbb {R}^m\times \mathbb {R}^r\) and \(\mathcal {T}_{\theta {\bar{\nu }}_t} \subseteq \mathcal {T}_{2{\bar{\nu }}_t}^\circ \), where \(\mathcal {T}_{2{\bar{\nu }}_t}^\circ \) denotes the interior of \(\mathcal {T}_{2{\bar{\nu }}_t}\), the second-order Taylor expansion at \(({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t)\) is allowed [50]. Note that the range of \(\theta \) is inessential. If we replace \(\nu /2\) in (5) by \(\nu /\kappa \) for any \(\kappa >1\), then we would allow the existence of \(\theta \) in \([1, \kappa )\). In other words, \(\theta \) can be as large as any \(\kappa \). In fact, the condition on the segment always holds when the input \(\alpha _{max}\), the upper bound of \({\bar{\alpha }}_t\) (cf. Line 18), is suitably upper bounded. Specifically, supposing (ensured by compactness of iterates), for any \(\theta > 1\) and \(\zeta \in (0, 1)\), as long as \(\alpha _{max} \le (\theta -1){\bar{\nu }}_0/(2\varUpsilon ^2)\), we have \(\zeta {{\varvec{x}}}_t+(1-\zeta ){{\varvec{x}}}_{s_t} \in \mathcal {T}_{\theta {\bar{\nu }}_t}\) by noting that
Clearly, the condition on the segment is not required if \(\mathcal {T}_{\nu }\) in (5) is a convex set, which is the case, for example, if we have linear inequality constraints \({{\varvec{x}}}\le {{\varvec{0}}}\); or more generally, each \(g_i(\cdot )\) is a convex function. We further investigate the effect of the range of \(\theta \) by varying \(\kappa \) (\(\kappa = 2\) by default; cf. (5)) in the experiments.
By the compactness condition and noting that \({\bar{\nu }}_t\) is increased by at least a factor of \(\rho \) each time in (22), we immediately know that \({\bar{\nu }}_t\) stabilizes when t is large. Moreover, if we let
then \({\bar{\nu }}_t \le {\tilde{\nu }}\), \(t\ge 0\), almost surely. We will show a similar result for \({\bar{\epsilon }}_t\).
Assumption 4 imposes the constraint qualifications. In particular, for feasible points \(\varOmega \), we assume the linear independence constraint qualification (LICQ), which is a standard condition to ensure the existence and uniqueness of the Lagrangian multiplier [41]. For infeasible points \(\mathcal {X}\backslash \varOmega \), we assume that the solution set of the linear system (29) is nonempty. The condition (29) restricts the behavior of the constraint functions outside the feasible set, which, together with the compactness condition, implies \(\varOmega \ne \emptyset \) (cf. [36, Proposition 2.5]). In fact, the condition (29) weakens the generalized Mangasarian-Fromovitz constraint qualification (MFCQ) [59, Definition 2.5]; and relates to the weak MFCQ, which is proposed for problems with only inequalities in [36, Definition 1] and adopted in [50, Assumption A3] and [53, Assumption 3.2]. However, [36] requires the weak MFCQ to hold for feasible points in addition to LICQ; while [50, 53] and this paper remove such a condition. The condition (29) simplifies and generalizes the weak MFCQ in [36, 50, 53] by including equality constraints. We note that the weak MFCQ is slightly weaker than (29). By the Gordan’s theorem [26], (29) implies that \(\{c_i\cdot \nabla c_i\}_{i: c_i\ne 0}\cup \{\nabla g_i\}_{i: g_i>0}\) are positively linearly independent:
for any coefficients \(a_i, b_i\ge 0\) and \(\sum _i a_i^2+b_i^2 >0\). In contrast, the weak MFCQ only requires that the above linear combination is nonzero for a particular set of coefficients. However, we adopt the simplified but a bit stronger condition only because (29) has a cleaner form and a clearer connection to SQP subproblems. The coefficients of the weak MFCQ in [36, 50, 53] are relatively hard to interpret. Instead of regarding the constraint qualification as the essence of constraints, those coefficients depend on particular choice of the merit function, although that assumption statement is sharper. That said, (29) is still weaker than other literature on the augmented Lagrangian [34, 47, 49]; and weaker than what is widely assumed in SQP analysis [10], where the IQP system, \(c_i + \nabla ^Tc_i{{\varvec{z}}}= {{\varvec{0}}}\), \(1\le i\le m\), \(g_i + \nabla ^Tg_i{{\varvec{z}}}\le {{\varvec{0}}}\), \(1\le i\le r\), is supposed to have a solution. Moreover, we do not require the strict complementary condition, which is often imposed for the merit functions that apply (squared) slack variables to transform nonlinear inequality constraints [60, A2], [23, Proposition 3.8].
The first lemma shows that (17) is satisfied for a sufficiently small \({\bar{\epsilon }}_t\). Although (17) is inspired by [40, (19)] for equalities, the proof is quite different from that paper (cf. Lemma 4 there).
Lemma 4
Under Assumptions 3 and 4, there exists a deterministic threshold \(\tilde{\epsilon }_1>0\) such that (17) holds for any \({\bar{\epsilon }}_t \le {\tilde{\epsilon }}_1\).
Proof
See Appendix B.1. \(\square \)
The second lemma shows that (18) is satisfied for small \({\bar{\epsilon }}_t\). The analysis is similar to Lemma 3. We need the following condition on the SQP system (12).
Assumption 5
We assume that, whenever (12) is solvable, \((J_t^T\; G_{t_a}^T)\) has full column rank, and there exist positive constants \(\varUpsilon _{B}\ge 1\ge \gamma _{B}\vee \gamma _{H}\) such that
and \({{\varvec{z}}}^TB_t{{\varvec{z}}}\ge \gamma _{B}\Vert {{\varvec{z}}}\Vert ^2\), \(\forall {{\varvec{z}}}\in \{{{\varvec{z}}}\in \mathbb {R}^d: J_t{{\varvec{z}}}= {{\varvec{0}}}, G_{t_a}{{\varvec{z}}}= {{\varvec{0}}}\}\).
Assumption 5 summarizes Assumptions 1 and 2. As shown in Lemma 3, the conditions on \(M_t\) and \((J_t^T\; G_{t_a}^T)\) hold locally. For the presented global analysis, the Hessian approximation \(B_t\) is easy to construct to satisfy the condition, e.g., \(B_t = I\); however, such a choice is not proper for fast local rates. In practice, given a lower bound \(\gamma _{B}>0\), \(B_t\) is constructed by doing a regularization on a subsampled Hessian (e.g., for finite-sum objectives) or a sketched Hessian (e.g., for regression objectives), which can preserve certain second-order information and be obtained with less expense. With Assumption 5, we have the following result.
Lemma 5
Under Assumptions 3 and 5, there exists a deterministic threshold \(\tilde{\epsilon }_2>0\) such that (18) holds for any \({\bar{\epsilon }}_t \le {\tilde{\epsilon }}_2\).
Proof
See Appendix B.2. \(\square \)
We summarize (30), Lemmas 4 and 5 in the next theorem.
Theorem 1
Under Assumptions 3, 4, and 5, there exist deterministic thresholds \({\tilde{\nu }}\), \({\tilde{\epsilon }}>0\) such that \(\{{\bar{\nu }}_t, {\bar{\epsilon }}_t\}_t\) generated by Algorithm 1 satisfy \({\bar{\nu }}_t \le {\tilde{\nu }}\), \({\bar{\epsilon }}_t \ge {\tilde{\epsilon }}\). Moreover, almost surely, there exists an iteration threshold \(\bar{t}<\infty \), such that \({\bar{\epsilon }}_t = {\bar{\epsilon }}_{\bar{t}}\), \({\bar{\nu }}_t = {\bar{\nu }}_{\bar{t}}\), \( t\ge \bar{t}\).
Proof
The existence of \({\tilde{\nu }}\) is showed in (30). By Lemmas 4 and 5, and defining \({\tilde{\epsilon }}= ({\tilde{\epsilon }}_1\wedge {\tilde{\epsilon }}_2)/\rho \), we show the existence of \({\tilde{\epsilon }}\). The existence of the iteration threshold \({\bar{t}}\) is ensured by noting that \(\{{\bar{\nu }}_t, 1/{\bar{\epsilon }}_t\}_t\) are bounded from above; and each update increases the parameters by at least a factor of \(\rho >1\). \(\square \)
We mention that the iteration threshold \({\bar{t}}\) is random for stochastic schemes and it changes between different runs. However, it always exists. The following analysis supposes t is large enough such that \(t\ge {\bar{t}}\) and \({\bar{\epsilon }}_t, {\bar{\nu }}_t\) have stabilized. We condition our analysis on the \(\sigma \)-algebra \(\mathcal {F}_{{\bar{t}}}\), which means that we only consider the randomness of the generated samples after \({\bar{t}}+1\) iterations and, by (28), the parameters \({\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}\) are fixed. We should point out that, although it is standard to focus only on the tail of the iteration sequence to show the global convergence (even for the deterministic case [41, Theorem 18.3]), an important aspect that is missed by such an analysis is the non-asymptotic guarantees. In particular, we know the scheme changes the merit parameters for at most \(\log ({\tilde{\nu }}{\bar{\epsilon }}_0/({\bar{\nu }}_0{\tilde{\epsilon }}))/\log (\rho )\) times; however, how many iterations it spans for all the changes is not answered by our analysis. Establishing a bound on \({\bar{t}}\) in expectation or high probability sense would help us further understand the efficiency of the scheme. However, since any characterization of \({\bar{t}}\) is difficult even for deterministic schemes, we leave such a study to the future. Another missing aspect is the iteration complexity, where we are interested in the number of iterations to attain an \(\epsilon \)-first- or second-order stationary point (we abuse \(\epsilon \) notation here to refer to the accuracy level). The iteration complexity is recently studied for two StoSQP schemes under very particular setups [5, 17]; none of the existing works allow either stochastic line search or inequality constraints. We leave the iteration complexity of our scheme to the future as well.
3.3 Convergence analysis
We conduct the global convergence analysis for Algorithm 1. We prove that \(\lim _{t\rightarrow \infty } R_t = 0\) almost surely, where \(R_t = \Vert (\nabla _{{{\varvec{x}}}}\mathcal {L}_t, c_t, \max \{g_t,-{\varvec{\lambda }}_t\})\Vert \) is the KKT residual. We suppose the line search conditions (15), (16), (24), (25) hold. We will discuss the sample complexities that ensure these generic conditions in Sect. 3.4. It is fairly easy to see that all conditions hold for large batch sizes.
Our proof structure closely follows [40]. The analyses are more involved in Lemmas 7, 9, 10, 11 and Theorem 3, which account for the differences between equality and inequality constraints, and account for our relaxations of the feasibility error condition and the increasing sample size requirement of [40]. The analysis in Theorem 5 is new, which strengthens the “liminf" convergence in [40]. The analyses are slightly adjusted in Theorem 4, and the same in Lemma 8 and Theorem 2. The adopted potential function (or Lyapunov function) is
where \(\omega \in (0, 1)\) is a coefficient to be specified later. We note that using \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\) by itself (i.e., \(\omega =1\)) to monitor the iteration progress is not suitable for the stochastic setting; it is possible that \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\) increases while \({\bar{\mathcal {L}}}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\) decreases. In contrast, \(\varTheta _{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta , \omega }^t\) linearly combines different components and has a composite measure of the progress. For example, the decrease of \(\varTheta _{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta , \omega }^t\) may come from \(\bar{\delta }_t\) (Lines 22 and 25 of Algorithm 1).
Since parameters \({\bar{\epsilon }}_{\bar{t}}, {\bar{\nu }}_{{\bar{t}}}, \eta \) in \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }\) are fixed (conditional on \(\mathcal {F}_{{\bar{t}}}\)), we denote \(\varTheta _{\omega }^t = \varTheta _{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}},\eta , \omega }^t\) for notational simplicity. In the presentation of theoretical results, we only track the parameters \((\beta , \alpha _{max}, \kappa _{grad}, \kappa _{f}, p_{grad}, p_{f}, \chi _{grad}, \chi _{f})\) that relate to the line search conditions. In particular, we use \(C_1, C_2\ldots \) and \(\varUpsilon _1, \varUpsilon _2\ldots \) to denote deterministic constants that are independent from these parameters, but may depend on \((\gamma _{B}, \gamma _{H}, \varUpsilon _B, \chi _{u}, \chi _{err},\rho , \eta , {\bar{\epsilon }}_0, {\bar{\nu }}_0)\), and thus depend on the deterministic thresholds \({\tilde{\epsilon }}\) and \({\tilde{\nu }}\). Recall that \((\gamma _{B}, \gamma _{H}, \varUpsilon _B, \chi _{u})\) come from Assumption 5 and (20), while \((\chi _{err}, \rho , \eta , {\bar{\epsilon }}_0, {\bar{\nu }}_0)\) are any algorithm inputs.
The first lemma presents a preliminary result.
Lemma 6
Under Assumptions 3, 4, 5, the following results hold deterministically conditional on \(\mathcal {F}_{t-1}\).
-
(a)
There exists \(C_1>0\) such that the following two inequalities hold for any iteration \(t \ge 0\) ((a2) also holds for \(s_t\)), any parameters \(\epsilon ,\nu \), and any generated sample set \(\xi \):
(a1) \(\left\| {\bar{\nabla }}\mathcal {L}_{\epsilon , \nu , \eta }^t - \nabla \mathcal {L}_{\epsilon , \nu , \eta }^t \right\| \le C_1\left\{ \left\| {\bar{\nabla }}f_t - \nabla f_t\right\| \vee ({\bar{R}}_t\wedge 1)\right\} \cdot \left\| {\bar{\nabla }}^2 f_t - \nabla ^2 f_t\right\| \);
(a2) \(\left| {\bar{\mathcal {L}}}_{\epsilon , \nu , \eta }^t - \mathcal {L}_{\epsilon , \nu , \eta }^t \right| \le C_1\{|{\bar{f}}_t - f_t| \vee [({\bar{R}}_t\vee \Vert {\bar{\nabla }}f_t-\nabla f_t\Vert )\wedge 1]\cdot \left\| {\bar{\nabla }}f_t - \nabla f_t\right\| \}\).
-
(b)
There exists \(C_2>0\) such that for any \(t\ge 0\) and set \(\xi \),
$$\begin{aligned} \left\| {\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t\right\| \le C_2\left\{ \Vert {\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^t\Vert + \left\| ( c_t,\;\varvec{w}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t}^t) \right\| \right\} . \end{aligned}$$ -
(c)
There exists \(C_3>0\) such that for any \(t\ge 0\) and set \(\xi \), if (12) is solvable, then
$$\begin{aligned} \left\| {\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^t\right\| \le C_3\left\| \left( \begin{array}{c} {\bar{\varDelta }}{{\varvec{x}}}_t\\ J_t{\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t\\ G_t{\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t + \varPi _c(\textrm{diag}^2(g_t){\varvec{\lambda }}_t) \end{array} \right) \right\| . \end{aligned}$$
Proof
See Appendix B.3. \(\square \)
The results in Lemma 6 hold deterministically conditional on \(\mathcal {F}_{t-1}\), because the samples \(\xi \) for computing \({\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^t\), \({\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t\) are supposed to be also given by the statement. The following result suggests that if both the gradient \(\nabla \mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\) and the function evaluations \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\), \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^{s_t}\) are precisely estimated, in the sense that the event \(\mathcal {E}_1^t\cap \mathcal {E}_2^t\) happens (cf. (14), (23)), then there is a uniform lower bound on \({\bar{\alpha }}_t\) to make the Armijo condition hold.
Lemma 7
For \(t\ge {\bar{t}}+ 1\), suppose \(\mathcal {E}_1^t\cap \mathcal {E}_2^t\) happens. There exists \(\varUpsilon _1>0\) such that the t-th step satisfies the Armijo condition (26) (i.e., is a successful step) if
Proof
See Appendix B.4. \(\square \)
The next result suggests that, if only the function evaluations \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\), \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^{s_t}\) are precisely estimated, in the sense that the event \(\mathcal {E}_2^t\) happens, then a sufficient decrease of \({\bar{\mathcal {L}}}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\) implies a sufficient decrease of \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\). The proof directly follows [40, Lemma 6], and thus is omitted.
Lemma 8
For \(t\ge {\bar{t}}+ 1\), suppose \(\mathcal {E}_2^t\) happens. If the t-th step satisfies the Armijo condition (26), then
Based on Lemmas 7 and 8, we now establish an error recursion for the potential function \(\varTheta _{\omega }^t\) in (31). Our analysis is separated into three cases according to the events: \(\mathcal {E}_1^t\cap \mathcal {E}_2^t\), \((\mathcal {E}_1^t)^c\cap \mathcal {E}_2^t\) and \((\mathcal {E}_2^t)^c\). We will show that \(\varTheta _{\omega }^t\) decreases in the case of \(\mathcal {E}_1^t\cap \mathcal {E}_2^t\), while may increase in the other two cases. Fortunately, by letting \(p_{grad}\) and \(p_{f}\) be small, \(\varTheta _{\omega }^t\) always decreases in expectation.
We first show in Lemma 9 that \(\varTheta _{\omega }^t\) decreases when \(\mathcal {E}_1^t\cap \mathcal {E}_2^t\) happens. We note that the decrease of \(\varTheta _{\omega }^t\) exceeds \({\bar{\alpha }}_t\Vert \nabla \mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\Vert ^2\) by \(\bar{\delta }_t\) (up to a multiplier).
Lemma 9
For \(t\ge {\bar{t}}+1\), suppose \(\mathcal {E}_1^t\cap \mathcal {E}_2^t\) happens. There exists \(\varUpsilon _2>0\), such that if \(\omega \) satisfies
then
Proof
See Appendix B.5. \(\square \)
We then show in Lemma 10 that \(\varTheta _{\omega }^t\) may increase, if \(\nabla \mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\) is not precisely estimated (i.e., \((\mathcal {E}_1^t)^c\) happens) but \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\), \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^{s_t}\) are precisely estimated (i.e., \(\mathcal {E}_2^t\) happens). The increase is proportional to \({\bar{\alpha }}_t\Vert \nabla \mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\Vert ^2\).
Lemma 10
For \(t\ge {\bar{t}}+1\), suppose \((\mathcal {E}_1^t)^c\cap \mathcal {E}_2^t\) happens. Under (32), we have
Proof
See Appendix B.6. \(\square \)
We finally show in Lemma 11 that \(\varTheta _{\omega }^t\) increases and the increase can exceed \({\bar{\alpha }}_t\Vert \nabla \mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\Vert ^2\), if \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\), \(\mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^{s_t}\) are not precisely estimated. In this case, the exceeding terms have to be controlled by making use of the condition (25).
Lemma 11
For \(t\ge {\bar{t}}+ 1\), suppose \((\mathcal {E}_2^t)^c\) happens. Under (32), we have
Proof
See Appendix B.7. \(\square \)
Combining Lemmas 9, 10, 11, we derive the one-step error recursion of \(\varTheta _{\omega }^t\). The proof directly follows that of [40, Theorem 2] and is omitted.
Theorem 2
(One-step error recursion) For \(t\ge {\bar{t}}+1\), suppose \(\omega \) satisfies (32) and \(p_{grad}\) and \(p_{f}\) satisfy
Then
With Theorem 2, we derive the convergence of \({\bar{\alpha }}_tR_t^2\) in the next theorem, where \(R_t = \Vert (\nabla _{{{\varvec{x}}}}\mathcal {L}_t, c_t, \max \{g_t,-{\varvec{\lambda }}_t\})\Vert \) is the KKT residual.
Theorem 3
Under the conditions of Theorem 2, \(\lim \limits _{t\rightarrow \infty }{\bar{\alpha }}_t R_t^2 = 0\) almost surely.
Proof
See Appendix B.8. \(\square \)
Then, we show that the “liminf” of the KKT residuals converges to zero.
Theorem 4
(“liminf” convergence) Consider Algorithm 1 under Assumptions 3, 4, 5. Suppose \(\omega \) satisfies (32) and \(p_{grad}, p_{f}\) satisfy (33). Then, almost surely, we have that \(\liminf _{t\rightarrow \infty } R_t = 0\).
Proof
See Appendix B.9. \(\square \)
Finally, we strengthen the statement in Theorem 4 and complete the global convergence analysis of Algorithm 1.
Theorem 5
(Global convergence) Under the same conditions of Theorem 4, we have that
Proof
See Appendix B.10. \(\square \)
Our analysis generalizes the results of [40] to inequality constrained problems. The “lim” convergence guarantee in Theorem 5 strengthens the existing “liminf” convergence guarantee of stochastic line search for both unconstrained problems [44, Theorem 4.10] and equality constrained problems [40, Theorem 4]. Theorem 5 also differs from the results in [3, 4, 18], where the authors showed the (liminf) convergence of the expected KKT residual under a fully stochastic setup. Compared to [3, 4, 18], our scheme does not tune a deterministic sequence that controls the stepsizes and determines the convergence behavior (i.e., converging to a KKT point or only its neighborhood). Our scheme tunes two probability parameters \(p_{grad}, p_f\). Seeing from (32) and (33), the upper bound conditions on \(p_{grad}, p_f\) depend on the inputs \((\rho , \beta , \kappa _{grad}, \alpha _{max})\) and a universal constant \(\varUpsilon _2\). Estimating \(\varUpsilon _2\) is often difficult in practice; however, \(p_{grad}, p_f\) affect the algorithm’s performance only via the generated batch sizes, and the batch sizes depend on \(p_{grad}, p_f\) only via the logarithmic factors (see (37) and (41) later). Thus, the algorithm is robust to \(p_{grad}, p_f\). We will also empirically test the robustness to parameters for Algorithm 1 in Sect. 4. In addition, (32) and (33) suggest that the larger the parameters \((\rho , 1/\beta , \kappa _{grad}, \alpha _{max})\) we use, the smaller the probabilities \(p_{grad}, p_f\) have to be. Such a dependence is consistent with the general intuition: the algorithm performs more aggressive updates with less restrictive Armijo condition when \((\rho , 1/\beta , \kappa _{grad}, \alpha _{max})\) are large; thus, a more precise model estimation in each iteration is desired in this case.
3.4 Discussion on sample complexities
As introduced in Sect. 1, the stochastic line search is performed by generating a batch of samples in each iteration to have a precise model estimation, which is standard in the literature [11, 13, 14, 20, 22, 29, 44]. The batch sizes are adaptively controlled based on the iteration progress. We now discuss the batch sizes \(|\xi _1^t|\) and \(|\xi _2^t|\) to ensure the generic conditions (15), (16), (24), (25) of Algorithm 1. We show that, if the KKT residual \(R_t\) does not vanish, all the conditions are satisfied by properly choosing \(|\xi _1^t|\) and \(|\xi _2^t|\).
Sample complexity of \(\xi _1^t\) The samples \(\xi _1^t\) are used to estimate \(\nabla f_t\) and \(\nabla ^2f_t\) in Step 1 of Algorithm 1. The estimators \({\bar{\nabla }}f_t\) and \({\bar{\nabla }}^2 f_t\) can be computed with different amount of samples, and their samples may or may not be independent. Let us suppose \({\bar{\nabla }}f_t\) is computed by samples \(\xi _1^t\), while \({\bar{\nabla }}^2 f_t\) is computed by a subset of samples \(\tau _1^t\subseteq \xi _1^t\). The case where \({\bar{\nabla }}f_t\) and \({\bar{\nabla }}^2 f_t\) are computed by two disjoint subsets of \(\xi _1^t\) can be studied following the same analysis. We define
By Lemma 6(a1), we know that (15) holds if, with probability \(1 - p_{grad}\),
where we suppress universal constants (such as the variance of a single sample) in \(O(\cdot )\) notation. By matrix Bernstein inequality [58, Theorem 7.7.1], (34) is satisfied if
Furthermore, we use the bound \(\mathbb {E}[\Vert {\bar{\nabla }}^2f_t - \nabla ^2 f_t\Vert ^2\mid \mathcal {F}_{t-1}] \le O(\log d/|\tau _1^t|)\) (cf. [58, (6.1.6)]) and know that (16) holds if
Combining (35) and (36) together, we know that the conditions (15) and (16) are satisfied if
Since (16) is imposed only when \(t-1\) is a successful step, the term \(\chi _{grad}^2\bar{\delta }_t/{\bar{\alpha }}_t\) on the denominator in (37) can be removed when \(t-1\) is an unsuccessful step. In contrast to [40], where the gradient \(\nabla f_t\) and Hessian \(\nabla ^2 f_t\) are computed based on the same set of samples, we sharpen the calculation and realize that the batch size \(|\tau _1^t|\) for \(\nabla ^2 f_t\) can be significantly less than \(|\xi _1^t|\) for \(\nabla f_t\). When \({\bar{R}}_t\) gets close to zero, the ratio \(|\tau _1^t|/|\xi _1^t|\) will also decay to zero.
We mention that \({\bar{R}}_t\) on the right hand side of the condition \(|\xi _1^t|\) in (37) has to be computed by samples \(\xi _1^t\). A practical algorithm can first specify \(\xi _1^t\), then compute \({\bar{R}}_t\), and finally check if (37) holds. For example, a While loop can be designed to gradually increase \(|\xi _1^t|\) until (37) holds (cf. [40, Algorithm 4]). Such a While loop always terminates in finite time when \(R_t>0\), because \({\bar{R}}_t\rightarrow R_t\) as \(|\xi _1^t|\) increases (by the law of large number) so that the right hand side of (37) does not diverge.
Sample complexity of \(\xi _2^t\) The samples \(\xi _2^t\) are used to estimate \(f_t, f_{s_t}, \nabla f_t, \nabla f_{s_t}\) in Step 4 of Algorithm 1. Similar to the discussion above, the estimators \({\bar{f}}_t, {\bar{f}}_{s_t}\) and \({\bar{{\bar{\nabla }}}}f_t, {\bar{{\bar{\nabla }}}}f_{s_t}\) can be computed with different amount of samples, and their samples may or may not be independent. Let us suppose \({\bar{f}}_t, {\bar{f}}_{s_t}\) are computed by samples \(\xi _2^t\), while \({\bar{{\bar{\nabla }}}}f_t, {\bar{{\bar{\nabla }}}}f_{s_t}\) are computed by a subset of samples \(\tau _2^t\subseteq \xi _2^t\). We define (similar for \({\bar{f}}_{s_t}, {\bar{{\bar{\nabla }}}}f_{s_t}\))
By Lemma 6(a2), we know that (24) holds if, with probability \(1-p_f\),
where \(\bar{{\bar{R}}}_t\) and \(\bar{{\bar{R}}}_{s_t}\) are computed by \(\tau _2^t\) and we use the fact that, for scalars a, b, \((a\wedge 1)\vee (b\wedge 1) = (a\vee b)\wedge 1\). By Bernstein inequality, (38) is satisfied if
Moreover, by \(\mathbb {E}[|{\bar{f}}_t-f_t|\mid \mathcal {F}_{t-0.5}]\le O(1/|\xi _2^t|)\) and \(\mathbb {E}[\Vert {\bar{{\bar{\nabla }}}}f_t - \nabla f_t\Vert ^4]\le O(1/|\tau _2^t|^2)\), we can see that (25) holds if
Combining (39) and (40) together, the conditions (24) and (25) are satisfied if
Similar to the complexity (37), (41) suggests that the batch size \(|\tau _2^t|\) for \(\nabla f_t\), \(\nabla f_{s_t}\) is significantly less than \(|\xi _2^t|\) for \(f_t\), \(f_{s_t}\), with the ratio \(|\tau _2^t|/|\xi _2^t|\) decaying to zero when t increases. The denominator in (41) is nonzero if \({\bar{R}}_t\ne 0\) (which is always the case; otherwise, we should stop the iteration). In particular, if , then
if , then
3.5 Discussion on computations and limitations
We now briefly discuss the per-iteration computational cost of Algorithm 1, and present some limitations and extensions of the algorithm.
Objective evaluations By Sect. 3.4 and the complexities in (37) and (41), Algorithm 1 generates \(|\xi _1^t| + |\xi _2^t|\) samples in each iteration, and evaluates \(2|\xi _2^t|\) function values, \(|\xi _1^t| + 2|\tau _2^t|\) gradients, and \(|\tau _1^t|\) Hessians for the objective. To see their orders from (37) and (41) clearly, let us suppose \({\bar{\alpha }}_t\) stabilizes at \(\alpha _{max}\) (i.e., the steps are successful) and (see (27) for the reasonability). We also replace the stochastic quantities in (37), (41) by deterministic counterparts and let \(R_t\approx R_{s_t}\). Then, we can see that \(|\xi _1^t| = |\tau _2^t| = O(1/R_t^2)\), \(|\xi _2^t| = O(1/R_t^4)\), and \(|\tau _1^t| = O(1)\). Thus, the objective evaluations are
We note that the evaluations for the function values and gradients are increasing as the iteration proceeds, and the function evaluations are square of the gradient evaluations. Under the same setup, our evaluation complexities for the functions and gradients are consistent with the unconstrained stochastic line search [44, Section 2.3] with \(R_t\) replaced by \(\Vert \nabla f_t\Vert \). Although the augmented Lagrangian merit function requires the Hessian evaluations, the Hessian complexity is significantly less than that of functions and gradients, and does not have to increase during the iteration. Such an observation is missing in the prior work [40].
Constraint evaluations Since the constraints are deterministic, Algorithm 1 has the same constraint evaluations as deterministic schemes. In particular, the algorithm evaluates four function values (two for equalities and two for inequalities; and for each type of constraint, one for current point and one for trial point), four Jacobians, and two Hessians in each iteration.
Computational cost Same as deterministic SQP schemes, solving Newton system dominates the computational cost. If we do not consider the potential sparse or block-diagonal structures that many problems have, solving the system (12) requires \(O((d+m+|\text {active set}|)^3) + O((m+r)^3) = O(d^3+m^3+r^3)\) flops. Such computational cost is larger than solving a standard SQP system (see [50, (8.9)]) by the extra term \(O((m+r)^3)\). However, as explained in Remark 1, the analysis of standard SQP system relies on the exact Hessian, which is inaccessible in our stochastic setting. When the SQP direction is not employed, the backup direction can be obtained with \(O(d+m+r)\) flops for the gradient step, \(O((d+m+r)^3)\) flops for the regularized Newton step, and between for the truncated Newton step. Such computational cost is standard in the literature [53, 55], where a safeguarding direction satisfying (20) is required to minimize the augmented Lagrangian. We should mention that, as the EQP scheme, the above computations are not very comparable with the IQP schemes. In that case, the SQP systems include inequality constraints and are more expensive to solve, although less iterations may be performed.
Limitations of the design Algorithm 1 has few limitations. First, it solves the SQP systems exactly. In practice, one may apply conjugate gradient (CG) or minimum residual (MINRES) methods, or apply randomized iterative solvers to solve the systems inexactly. The inexact direction can reduce the computational cost significantly [18]. Second, our backup direction does not fully utilize the computations of the SQP direction. Although our analysis allows any backup direction satisfying (20), and utilizing Newton direction as a backup is standard in the literature [53, 55], a better choice is to directly modify the SQP direction. Then, we may derive a direction that has a faster convergence than the gradient direction, and less computations than the (regularized) Newton direction. We leave the refinements of these two limitations to the future.
4 Numerical experiments
We implement the following two algorithms on 39 nonlinear problems collected in CUTEst test set [27]. We select the problems that have a non-constant objective with less than 1000 free variables. We also require the problems to have at least one inequality constraint, no infeasible constraints, no network constraints; and require the number of constraints to be less than the number of variables. The setup of each algorithm is as follows.
-
(a)
AdapNewton: the adaptive scheme in Algorithm 1 with the safeguarding direction given by the regularized Newton step. We set the inputs as \({\bar{\alpha }}_0 = \alpha _{max} = 1.5\), \(\beta =0.3\), \(\kappa _{f} = \beta /(4\alpha _{max}) = 0.05\), \( \kappa _{grad} = \chi _{grad} = \chi _{f} = \bar{\delta }_0=1\), \({\bar{\epsilon }}_0=10^{-2}\), \(\eta =10^{-4}\), \(p_{grad}=p_f=0.1\), \(\rho =2\). Here, we set \(\alpha _{max}>1\) since a stochastic scheme can select a stepsize that is greater than one (cf. Fig. 4). \(\beta \) is close to the middle of the interval (0, 0.5), which is a common range for deterministic schemes. \(({\bar{\epsilon }}_0, \bar{\delta }_0)\) are adaptively selected during the iteration, while we prefer a small initial \({\bar{\epsilon }}_0\) to run less adjustments on it. \(\kappa _{f}\) is set as the allowed largest value \(\beta /(4\alpha _{max})\) (cf. Algorithm 1); however, the parameters \((\kappa _{grad},\kappa _{f},\chi _{grad},\chi _{f},p_{grad},p_f)\) all affect the batch sizes and play the same role as the constant C that we study later. We let \(\eta \) be small so that the last penalty term of (8) is almost negligible, and the merit function (8) is close to a standard augmented Lagrangian function. We also test the robustness of the algorithm to three parameters C, \(\kappa \), \(\chi _{err}\). Here, C is the constant multiplier of the big “O" notation in (37) and (41) (the variance \(\sigma ^2\) of a single sample is also absorbed in “O", which we introduce later). \(\kappa \) is a parameter of the set \(\mathcal {T}_\nu \) (\(\kappa =2\) in (5)), and \(\chi _{err}\) is a parameter of the feasibility error condition (17). Their default values are \(C = \kappa = 2\) and \(\chi _{err} = 1\), while we allow to vary them in wide ranges: \(C,\kappa \in \{2,2^3,2^6\}\) and \(\chi _{err}\in \{1,10,10^2\}\). When we vary one parameter, the other two are set as default.
-
(b)
AdapGD: the adaptive scheme in Algorithm 1 with the safeguarding direction given by the steepest descent step. The setup is the same as (b).
For both algorithms, the initial iterate \(({{\varvec{x}}}_0, \varvec{\mu }_0, {\varvec{\lambda }}_0)\) is specified by the CUTEst package. The package also provides the deterministic function, gradient and Hessian evaluation, \(f_t, \nabla f_t, \nabla ^2f_t\), in each iteration. We generate their stochastic counterparts by adding a Gaussian noise with variance \(\sigma ^2\). In particular, we let \({\bar{f}}_t \sim {{\mathcal {N}}}(f_t, \sigma ^2)\), \({\bar{\nabla }}f_t\sim {{\mathcal {N}}}(\nabla f_t, \sigma ^2(I + {{\varvec{1}}}{{\varvec{1}}}^T))\), and \(({\bar{\nabla }}^2f_t)_{ij}\sim {{\mathcal {N}}}((\nabla f_t)_{ij}, \sigma ^2)\). We try four levels of variance: \(\sigma ^2 \in \{ 10^{-8}, 10^{-4}, 10^{-2}, 10^{-1}\}\). Throughout the implementation, we let \(B_t = I\) (cf. (12), (21)) and set the iteration budget to be \(10^4\). The stopping criterion is
The former two cases suggest that the iteration converges within the budget. For each algorithm, each problem, and each setup, we average the results of all convergent runs among 5 runs. Our code is available at https://github.com/senna1128/Constrained-Stochastic-Optimization-Inequality.
KKT residuals We draw the KKT residual boxplots for AdapNewton and AdapGD in Fig. 1. From the figure, we see that both algorithms are robust to tuning parameters \((C,\kappa ,\chi _{err})\). For both algorithms, the median of the KKT residuals gradually increases as \(\sigma ^2\) increases, which is reasonable since the model estimation of each sample is more noisy when \(\sigma ^2\) is larger. However, the increase of the KKT residuals is mild since, regardless of \(\sigma ^2\), both methods generate enough samples in each iteration to enforce the model accuracy conditions (i.e., (15), (16), (24), (25)). Figure 1 also suggests that AdapNewton outperforms AdapGD although the improvement is limited. In fact, the convergence on a few problems may be improved by utilizing the regularized Newton step as the backup of the SQP step; however, the SQP step will be employed eventually.
Sample sizes We draw the sample size boxplots for AdapNewton and AdapGD in Fig. 2. From the figure, we see that both methods generate much less samples for estimating the objective Hessian compared to estimating the objective value and gradient, between which the the objective gradient is estimated with less samples than the objective value. The sample size differences of the three quantities—objective value, gradient, Hessian—are clearer as \(\sigma ^2\) increases. For a fixed \(\sigma ^2\), the sample sizes of different setups of \((C, \kappa , \chi _{err})\) do not vary much. In fact, the parameters \(\kappa \), \(\chi _{err}\) do not directly affect the sample complexities. The parameter C plays a similar role to \(\sigma ^2\) and affects the sample complexities via changing the multipliers in (37) and (41). However, varying C from 2 to 64 is marginal compared to varying \(\sigma ^2\) from \(10^{-8}\) to \(10^{-1}\). Thus, Fig. 2 again illustrates the robustness of the designed adaptive algorithm.
Moreover, as discussed in Sects. 3.4 and 3.5, the objective value, gradient, and Hessian have different sample complexities in each iteration, which depend on different powers of the reciprocal of the KKT residual \(1/R_t\). When \(\sigma ^2=10^{-8}\), the small variance dominates the effect of \(1/R_t\) so that all three quantities can be estimated with very few samples. When \(\sigma ^2=0.1\), the different dependencies of the sample sizes on \(1/R_t\) are more evident. Overall, Fig. 2 reveals the fact that different objective quantities can be estimated with different amount of samples. Such an aspect improves the prior work [40], where the quantities with different sample complexities are estimated based on the same set of samples, and the effect of the variance \(\sigma ^2\) on the sample complexities is neglected.
In addition, we draw the trajectories of the sample size ratios. In particular, for both algorithms, we randomly pick 5 convergent problems and draw two ratio trajectories for each problem: one is the sample size of the gradient over the sample size of the value, and one is the sample size of the Hessian over the sample size of the gradient. We take \(C = 64\) as an example. The plot is shown in Fig. 3. From the figure, we note that the sample size ratios tend to be stabilized at a small level, and the trend is more evident when \(\sigma ^2=0.1\). As we explained for Fig. 2 above, such an observation is consistent with our discussions in Sect. 3.4, and illustrates the improvement of our analysis over [40] for performing the stochastic line search on the augmented Lagrangian merit function.
Stepsize trajectories Figure 4 plots the stepsize trajectories that are selected by stochastic line search. We take the default setup as an example, i.e., \(C=\kappa =2\), \(\chi _{err}=1\). Similar to Fig. 3, for each level of \(\sigma ^2\), we randomly pick 5 convergent problems to show the trajectories. Although there is no clear trend for the stepsize trajectories due to stochasticity, we clearly see for both methods that the stepsize can increase significantly from a very small value and even exceed 1. This exclusive property of the line search procedure ensures a fast convergence of the scheme, which is not enjoyed by many non-adaptive schemes where the stepsize often monotonically decays to zero.
We also examine some other aspects of the algorithm, such as the proportion of the iterations with failed SQP steps, with unstabilized penalty parameters, or with a triggered feasibility error condition (17). We also study the effect of a multiplicative noise, and implement the algorithm on an inequality constrained logistic regression problem. Due to the space limit, these auxiliary experiments are provided in Appendix D.
5 Conclusion
This paper studied inequality constrained stochastic nonlinear optimization problems. We designed an active-set StoSQP algorithm that exploits the exact augmented Lagrangian merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian, and selects the stepsize via stochastic line search. We proved that the KKT residuals converge to zero almost surely, which generalizes and strengthens the result for unconstrained and equality constrained problems in [40, 44] to enable wider applications.
The extension of this work includes studying more advanced StoSQP schemes. As mentioned in Sect. 3.5, the proposed StoSQP scheme has to solve the SQP system exactly. We note that, recently, [18] designed a StoSQP scheme where an inexact Newton direction is employed, and [3] designed a StoSQP scheme to relax LICQ condition. It is still open how to design related schemes to achieve relaxations with inequality constraints. In addition, some advanced SQP schemes solve inequality constrained problems by mixing IQP with EQP: one solves a convex IQP to obtain an active set, and then solves an EQP to obtain the search direction. See the “SQP+" scheme in [37] for example. Investigating this kind of mixed scheme with a stochastic objective is promising. Besides SQP, there are other classical methods for solving nonlinear problems that can be exploited to deal with stochastic objectives, such as the augmented Lagrangian methods and interior point methods. Different methods have different benefits and all of them deserve studying in the setup where the model can only be accessed with certain noise.
Finally, as mentioned in Sect. 3.2, non-asymptotic analysis and iteration complexity of the proposed scheme are missing in our global analysis. Further, it is known for deterministic setting that differentiable merit functions can overcome the Maratos effect and facilitate a fast local rate, while non-smooth merit functions (without advanced local modifications) cannot. This raises the questions: what is the local rate of the proposed StoSQP, and is the local rate better than the one using non-smooth merit functions? To answer these questions, we need a better understanding on the local behavior of stochastic line search. Such a local study would complement the established global analysis, recognize the benefits of the differentiable merit functions, and bridge the understanding gap between stochastic SQP and deterministic SQP.
Notes
Here, we mean \(\mathcal {X}_{\epsilon ,\nu }\) and \(\varLambda _{\epsilon ,\nu }\) only directly depend on \(\epsilon ,\nu \) but not \(\eta \), which are in contrast to neighborhoods \(\mathcal {X}_{\epsilon ,\nu ,\eta }\) and \(\varLambda _{\epsilon ,\nu ,\eta }\). However, since the threshold of \(\epsilon \), \(\gamma _{B}^2(\gamma _{B}\wedge \eta )/\left\{ (1\vee \nu )\varUpsilon \right\} \), is also determined by \(\eta \), the final local neighborhoods \(\mathcal {X}_{\epsilon ,\nu }\) and \(\varLambda _{\epsilon ,\nu }\) with \(\epsilon \) below the threshold also indirectly depend on \(\eta \). Recall that \(\eta \) can be any positive constant throughout the paper.
We note that \(\xi _2^t\) may not be generated if Lines 13 and 14 of Algorithm 1 are performed. However, for simplicity we suppose a sample \(\xi _2^t\) is still generated in this case, although no quantity is determined by this sample.
References
Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 24(3), 1238–1264 (2014). https://doi.org/10.1137/130915984
Berahas, A.S., Cao, L., Scheinberg, K.: Global convergence rate analysis of a generic line search algorithm with noise. SIAM J. Optim. 31(2), 1489–1518 (2021). https://doi.org/10.1137/19m1291832
Berahas, A.S., Curtis, F.E., O’Neill, M.J., Robinson, D.P.: A stochastic sequential quadratic optimization algorithm for nonlinear equality constrained optimization with rank-deficient Jacobians. arXiv preprint (2021). arXiv:2106.13015
Berahas, A.S., Curtis, F.E., Robinson, D., Zhou, B.: Sequential quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM J. Optim. 31(2), 1352–1379 (2021). https://doi.org/10.1137/20m1354556
Berahas, A.S., Bollapragada, R., Zhou, B.: An adaptive sampling sequential quadratic programming method for equality constrained stochastic optimization. arXiv preprint (2022). arXiv:2206.00712
Berahas, A.S., Shi, J., Yi, Z., Zhou, B.: Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction. arXiv preprint (2022). arXiv:2204.04161
Bertsekas, D.: Constrained Optimization and Lagrange Multiplier Methods. Elsevier, Belmont (1982). https://doi.org/10.1016/c2013-0-10366-2
Birge, J.R.: State-of-the-art-survey—stochastic programming: computation and applications. INFORMS J. Comput. 9(2), 111–133 (1997). https://doi.org/10.1287/ijoc.9.2.111
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS J. Optim. 1(2), 92–119 (2019). https://doi.org/10.1287/ijoo.2019.0016
Boggs, P.T., Tolle, J.W.: Sequential quadratic programming. Acta Numer. 4, 1–51 (1995). https://doi.org/10.1017/s0962492900002518
Bollapragada, R., Byrd, R., Nocedal, J.: Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 28(4), 3312–3343 (2018). https://doi.org/10.1137/17m1154679
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018). https://doi.org/10.1137/16m1080173
Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 127–155 (2012). https://doi.org/10.1007/s10107-012-0572-5
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 169(2), 337–375 (2017). https://doi.org/10.1007/s10107-017-1137-4
Chen, C., Tung, F., Vedula, N., Mori, G.: Constraint-aware deep neural network compression. In: Computer Vision—ECCV 2018. Springer, pp. 409–424 (2018). https://doi.org/10.1007/978-3-030-01237-3_25
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models. Math. Program. 169(2), 447–487 (2017). https://doi.org/10.1007/s10107-017-1141-8
Curtis, F.E., O’Neill, M.J., Robinson, D.P.: Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization. arXiv preprint (2021). arXiv:2112.14799
Curtis, F.E., Robinson, D.P., Zhou, B.: Inexact sequential quadratic optimization for minimizing a stochastic objective function subject to deterministic nonlinear equality constraints. arXiv preprint (2021). arXiv:2107.03512
di Serafino, D., Krejić, N., Jerinkić, N.K., Viola, M.: Lsos: Line-search second-order stochastic optimization methods. arXiv preprint (2020). arXiv:2007.15966
De, S., Yadav, A., Jacobs, D., Goldstein, T.: Automated inference with adaptive batches. In: Proceedings of Machine Learning Research, PMLR, Fort Lauderdale, FL, USA, vol. 54, pp. 1504–1513 (2017). http://proceedings.mlr.press/v54/de17a.html
Fasano, G., Lucidi, S.: A nonmonotone truncated Newton–Krylov method exploiting negative curvature directions, for large scale unconstrained optimization. Optim. Lett. 3(4), 521–535 (2009). https://doi.org/10.1007/s11590-009-0132-y
Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), A1380–A1405 (2012). https://doi.org/10.1137/110830629
Fukuda, E.H., Fukushima, M.: A note on the squared slack variables technique for nonlinear optimization. J. Oper. Res. Soc. Jpn. 60(3), 262–270 (2017). https://doi.org/10.15807/jorsj.60.262
Gallager, R.G.: Stochastic Processes. Cambridge University Press, Cambridge (2013). https://doi.org/10.1017/cbo9781139626514
Goh, C.K., Liu, Y., Kong, A.W.K.: A constrained deep neural network for ordinal regression. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE (2018). https://doi.org/10.1109/cvpr.2018.00093
Goldman, A.J., Tucker, A.W.: 4. Theory of linear programming. In: Linear Inequalities and Related Systems. (AM-38). Princeton University Press, pp. 53–98 (1957). https://doi.org/10.1515/9781400881987-005
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2014). https://doi.org/10.1007/s10589-014-9687-3
Gratton, S., Royer, C.W., Vicente, L.N., Zhang, Z.: Complexity and global rates of trust-region methods based on probabilistic models. IMA J. Numer. Anal. 38(3), 1579–1597 (2017). https://doi.org/10.1093/imanum/drx043
Krejić, N., Krklec, N.: Line search methods with variable sample size for unconstrained optimization. J. Comput. Appl. Math. 245, 213–231 (2013). https://doi.org/10.1016/j.cam.2012.12.020
Liew, C.K.: Inequality constrained least-squares estimation. J. Am. Stat. Assoc. 71(355), 746–751 (1976). https://doi.org/10.1080/01621459.1976.10481560
Liew, C.K.: A two-stage least-squares estimation with inequality restrictions on parameters. Rev. Econ. Stat. 58(2), 234 (1976). https://doi.org/10.2307/1924031
Livieris, I.E., Pintelas, P.: An adaptive nonmonotone active set—weight constrained—neural network training algorithm. Neurocomputing 360, 294–303 (2019). https://doi.org/10.1016/j.neucom.2019.06.033
Livieris, I.E., Pintelas, P.: An improved weight-constrained neural network training algorithm. Neural Comput. Appl. 32(9), 4177–4185 (2019). https://doi.org/10.1007/s00521-019-04342-2
Lucidi, S.: New results on a class of exact augmented Lagrangians. J. Optim. Theory Appl. 58(2), 259–282 (1988). https://doi.org/10.1007/bf00939685
Lucidi, S.: Recursive quadratic programming algorithm that uses an exact augmented Lagrangian function. J. Optim. Theory Appl. 67(2), 227–245 (1990). https://doi.org/10.1007/bf00940474
Lucidi, S.: New results on a continuously differentiable exact penalty function. SIAM J. Optim. 2(4), 558–574 (1992). https://doi.org/10.1137/0802027
Morales, J.L., Nocedal, J., Wu, Y.: A sequential quadratic programming algorithm with an additional equality constrained phase. IMA J. Numer. Anal. 32(2), 553–579 (2011). https://doi.org/10.1093/imanum/drq037
Na, S.: Global convergence of online optimization for nonlinear model predictive control. Adv. Neural Inf. Process. Syst. 34, 12441–12453 (2021)
Na, S., Mahoney, M.W.: Asymptotic convergence rate and statistical inference for stochastic sequential quadratic programming. arXiv preprint (2022). arXiv:2205.13687
Na, S., Anitescu, M., Kolar, M.: An adaptive stochastic sequential quadratic programming with differentiable exact augmented Lagrangians. Math. Program. (2022). https://doi.org/10.1007/s10107-022-01846-z
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Onuk, A.E., Akcakaya, M., Bardhan, J.P., Erdogmus, D., Brooks, D.H., Makowski, L.: Constrained maximum likelihood estimation of relative abundances of protein conformation in a heterogeneous mixture from small angle x-ray scattering intensity measurements. IEEE Trans. Signal Process. 63(20), 5383–5394 (2015). https://doi.org/10.1109/tsp.2015.2455515
Oztoprak, F., Byrd, R., Nocedal, J.: Constrained optimization in the presence of noise. arXiv preprint (2021). arXiv:2110.04355
Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30(1), 349–376 (2020). https://doi.org/10.1137/18m1216250
Phillips, R.F.: A constrained maximum-likelihood approach to estimating switching regressions. J. Econom. 48(1–2), 241–262 (1991). https://doi.org/10.1016/0304-4076(91)90040-k
Pillo, G.D., Grippo, L.: A new class of augmented Lagrangians in nonlinear programming. SIAM J. Control. Optim. 17(5), 618–628 (1979). https://doi.org/10.1137/0317044
Pillo, G.D., Grippo, L.: A new augmented Lagrangian function for inequality constraints in nonlinear programming problems. J. Optim. Theory Appl. 36(4), 495–519 (1982). https://doi.org/10.1007/bf00940544
Pillo, G.D., Grippo, L.: A continuously differentiable exact penalty function for nonlinear programming problems with inequality constraints. SIAM J. Control. Optim. 23(1), 72–84 (1985). https://doi.org/10.1137/0323007
Pillo, G.D., Grippo, L.: An exact penalty function method with global convergence properties for nonlinear programming problems. Math. Program. 36(1), 1–18 (1986). https://doi.org/10.1007/bf02591986
Pillo, G.D., Lucidi, S.: An augmented Lagrangian function with improved exactness properties. SIAM J. Optim. 12(2), 376–406 (2002). https://doi.org/10.1137/s1052623497321894
Pillo, G.D., Grippo, L., Lampariello, F.: A method for solving equality constrained optimization problems by unconstrained minimization. In: Optimization Techniques, Springer-Verlag, Lecture Notes in Control and Information Science, vol. 23, pp. 96–105 (1980). https://doi.org/10.1007/bfb0006592
Pillo, G.D., Lucidi, S., Palagi, L.: Convergence to second-order stationary points of a primal-dual algorithm model for nonlinear programming. Math. Oper. Res. 30(4), 897–915 (2005). https://doi.org/10.1287/moor.1050.0150
Pillo, G.D., Liuzzi, G., Lucidi, S., Palagi, L.: A truncated Newton method in an augmented Lagrangian framework for nonlinear programming. Comput. Optim. Appl. 45(2), 311–352 (2008). https://doi.org/10.1007/s10589-008-9216-3
Pillo, G.D., Liuzzi, G.S.L.: A primal-dual algorithm for nonlinear programming exploiting negative curvature directions. Numer. Algebra Control Optim. 1(3), 509–528 (2011). https://doi.org/10.3934/naco.2011.1.509
Pillo, G.D., Liuzzi, G., Lucidi, S.: An exact penalty-Lagrangian approach for large-scale nonlinear programming. Optimization 60(1–2), 223–252 (2011). https://doi.org/10.1080/02331934.2010.505964
Silvapulle, S.: Constrained Statistical Inference, vol. 912. Wiley, New York (2004)
Sun, S., Nocedal, J.: A trust region method for the optimization of noisy functions. arXiv preprint (2022). arXiv:2201.00973
Tropp, J.A.: An introduction to matrix concentration inequalities. Found. Trends® Mach. Learn. 8(1–2), 1–230 (2015). https://doi.org/10.1561/2200000048
Xu, M., Ye, J.J., Zhang, L.: Smoothing augmented Lagrangian method for nonsmooth constrained optimization problems. J. Glob. Optim. 62(4), 675–694 (2014). https://doi.org/10.1007/s10898-014-0242-7
Zavala, V.M., Anitescu, M.: Scalable nonlinear programming via exact differentiable penalty functions and trust-region Newton methods. SIAM J. Optim. 24(1), 528–558 (2014). https://doi.org/10.1137/120888181
Acknowledgements
We thank Associate Editor and two anonymous reviewers for instructive comments, which help us further enhance the algorithm design and presentation. This material was completed in part with resources provided by the University of Chicago Research Computing Center. This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract DE-AC02-06CH11347 and by NSF through award CNS-1545046.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proofs of Sect. 2
1.1 A.1 Proof of Lemma 2
Throughout the proof, we denote \(g^\star = g({{\varvec{x}}}^\star )\), \(\varvec{w}_{\epsilon , \nu }^\star = \varvec{w}_{\epsilon , \nu }({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star )\), \(\nabla \mathcal {L}^\star = \nabla \mathcal {L}({{\varvec{x}}}^\star , {\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\) (similar for \(c^\star \), \(a_{\nu }^\star \), \(q_{\nu }^\star \) etc.) to be the quantities evaluated at \(({{\varvec{x}}}^\star , {\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\in \mathcal {T}_{\nu }\times \mathbb {R}^m\times \mathbb {R}^r\). Since \(\varvec{w}_{\epsilon , \nu }^\star = {{\varvec{0}}}\), we know from Lemma 1 that \(g^\star \le {{\varvec{0}}}\), \({\varvec{\lambda }}^\star \ge {{\varvec{0}}}\), \(({\varvec{\lambda }}^\star )^T g^\star = 0\). This implies that \(\textrm{diag}^2(g^\star ){\varvec{\lambda }}^\star = {{\varvec{0}}}\). Furthermore, by \(c^\star = {{\varvec{0}}}\), \(\varvec{w}_{\epsilon , \nu }^\star = {{\varvec{0}}}\), \(a_{\nu }^\star , \eta , \epsilon >0\), and \(\nabla _{\varvec{\mu }, {\varvec{\lambda }}}\mathcal {L}_{\epsilon , \nu , \eta }^\star = {{\varvec{0}}}\), we obtain from (10) that
Recalling the definition of \(M^\star \) in (9), we multiply the matrix \(\nabla _{{{\varvec{x}}}}^T\mathcal {L}^\star ((J^\star )^T\;\; (G^\star )^T)\) from the left and obtain
This implies \(\left( (J^\star )^TJ^\star + (G^\star )^TG^\star \right) \nabla _{{{\varvec{x}}}}\mathcal {L}^\star = {{\varvec{0}}}\). Multiplying \(\nabla _{{{\varvec{x}}}}\mathcal {L}^\star \) from the left, we have \(J^\star \nabla _{{{\varvec{x}}}}\mathcal {L}^\star = {{\varvec{0}}}\) and \(G^\star \nabla _{{{\varvec{x}}}}\mathcal {L}^\star = {{\varvec{0}}}\). Plugging into (10) and noting that \(\nabla _{{{\varvec{x}}}}\mathcal {L}_{\epsilon , \nu , \eta }^{\star } = {{\varvec{0}}}\), \(\varvec{w}_{\epsilon , \nu }^\star = {{\varvec{0}}}\), \(c^\star = {{\varvec{0}}}\), \(\textrm{diag}^2(g^\star ){\varvec{\lambda }}^\star = {{\varvec{0}}}\), and \(q_{\nu }^\star , a_{\nu }^\star , \epsilon >0\), we obtain \(\nabla _{{{\varvec{x}}}}\mathcal {L}^\star = {{\varvec{0}}}\). This shows \(({{\varvec{x}}}^\star , {\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\) satisfies (4), and we complete the proof.
1.2 A.2 Proof of Lemma 3
We require the following two preparation lemmas.
Lemma 12
Let \(\mathcal {I}({{\varvec{x}}}^\star )\) be the active set defined in (3), and \(\mathcal {I}^+({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star ) = \left\{ i\in \mathcal {I}({{\varvec{x}}}^\star ): {\varvec{\lambda }}^\star _i>0\right\} \). For any \(\epsilon , \nu >0\), there exists a compact set \(\mathcal {X}_{\epsilon ,\nu }\times \varLambda _{\epsilon ,\nu }\ni ({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star )\) depending on \((\epsilon ,\nu )\), such that
Proof
See Appendix A.3. \(\square \)
Lemma 13
Under Assumption 1, there exist a compact set \(X\ni {{\varvec{x}}}^\star \) and a constant \(\gamma _{H}\in ~(0,1]\) such that \(M({{\varvec{x}}})\succeq \gamma _{H} I\) for any \({{\varvec{x}}}\in X\), where \(M({{\varvec{x}}})\) is defined in (9). Furthermore, for any \(\epsilon ,\nu >0\), there exists a compact set \(\mathcal {X}_{\epsilon ,\nu }\times \varLambda _{\epsilon ,\nu }\ni ({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star )\) depending on \((\epsilon , \nu )\), such that
Proof
See Appendix A.4. \(\square \)
We now prove Lemma 3. We suppress the evaluation point and the iteration index t. Let \(\mathcal {X}\times \mathcal {M}\times \varLambda \subseteq \mathcal {T}_{\nu }\times \mathbb {R}^m\times \mathbb {R}^r\) be any compact set around \(({{\varvec{x}}}^\star ,{\varvec{\mu }^\star }, {\varvec{\lambda }}^\star )\) (independent of \(\epsilon , \nu , \eta \)) and suppose \(({{\varvec{x}}}, \varvec{\mu }, {\varvec{\lambda }})\in \mathcal {X}\times \mathcal {M}\times \varLambda \). By Lemma 13, we know there exist a constant \(\gamma _{H}\in (0,1]\) and, for any \(\epsilon , \nu >0\), a compact subset \(\mathcal {X}_{\epsilon ,\nu }\times \varLambda _{\epsilon ,\nu } \subseteq \mathcal {X}\times \varLambda \) such that for any point in the subset,
Thus, by Assumption 2, we know from [41, Lemma 16.1] that \(K_a\) is invertible, and thus (12) is solvable. Furthermore, we can also show that (see [40, Lemma 1] for a simple proof)
With the above two results, we conduct our analysis. Throughout the proof, we use \(\varUpsilon _1, \varUpsilon _2\ldots \) to denote generic upper bounds of functions evaluated in the set \(\mathcal {X}\times \mathcal {M}\times \varLambda \), which are independent of \((\epsilon , \nu , \eta , \gamma _{B}, \gamma _{H})\). As they are upper bounds, without loss of generality, \(\varUpsilon _i \ge 1\), \(\forall i\).
We start from \((\nabla \mathcal {L}_{\epsilon , \nu , \eta }^{(1)})^T\varDelta \) and suppose \(({{\varvec{x}}}, \varvec{\mu }, {\varvec{\lambda }})\in \mathcal {X}_{\epsilon ,\nu }\times \mathcal {M}\times \varLambda _{\epsilon ,\nu }\subseteq \mathcal {X}\times \mathcal {M}\times \varLambda \), where \(\mathcal {X}_{\epsilon ,\nu }\) and \(\varLambda _{\epsilon ,\nu }\) come from Lemma 13. We have
Since \(({{\varvec{x}}}, \varvec{\mu }, {\varvec{\lambda }})\in \mathcal {X}\times \mathcal {M}\times \varLambda \), there exists \(\varUpsilon _1\ge 1\) such that \(\left\| (Q_1\; Q_2)\right\| \le \varUpsilon _1\). Thus, we have
Moreover, we note that
By \(({{\varvec{x}}}, \varvec{\mu }, {\varvec{\lambda }})\in \mathcal {X}\times \mathcal {M}\times \varLambda \), there exist \(\varUpsilon _2, \varUpsilon _3, \varUpsilon _4\ge 1\) such that
and
Combining the above three displays,
where the second inequality also uses \(\Vert B\Vert \le \varUpsilon _{B}\) by Assumption 2. Combining (A.4), (A.5), (A.6), and using \(0<q_\nu \le \nu \) and \(\gamma _{H}\vee \gamma _{B} \le 1\),
where the last inequality holds by defining
To deal with \(\varDelta {{\varvec{x}}}^TB\varDelta {{\varvec{x}}}\) in (A.7), we decompose \(\varDelta {{\varvec{x}}}\) as \(\varDelta {{\varvec{x}}}= \varDelta {{\varvec{u}}}+ \varDelta {{\varvec{v}}}\) where \(\varDelta {{\varvec{u}}}\in ~\text {Image}\left\{ (J^T\;\; G_a^T)\right\} \) and \(\varDelta {{\varvec{v}}}\in \text {Ker}\left\{ (J^T\;\; G_a^T)^T\right\} \). Note that
Thus, by Assumption 2,
where the last inequality holds with \(\varUpsilon _6 = \varUpsilon _B + 4\varUpsilon _B^2 + 1\) by noting that \(\gamma _{B}\le 1\). Combining the above display with (A.7) and using the following Young’s inequality,
we have
Therefore, as long as
we have
Thus, letting \(\varUpsilon = \left\{ 8\varUpsilon _5 \vee (2\varUpsilon _5^2 + \varUpsilon _6)\right\} /\gamma _{H}^4\) and noting that (A.10a) is implied by (A.10b), we complete the first part of the statement.
We now prove the second part of the statement. By (13), \(({{\varvec{x}}},\varvec{\mu },{\varvec{\lambda }})\in \mathcal {X}\times \mathcal {M}\times \varLambda \) (and hence (A.12)), and the fact that \(a_\nu \ge \nu /2\), there exists \(\varUpsilon _7\ge 1\) such that
Since \(\epsilon \le 1\) by (A.10) (noting that \(\varUpsilon \ge 1 \ge \gamma _{H}\vee \gamma _{B}\)), we simplify the above display by
Noting that
and
we define \(\varUpsilon _8 = 6\sqrt{2}\varUpsilon _7\varUpsilon _1(\varUpsilon _2(\varUpsilon _{B}+1) + \varUpsilon _4 +1)\) and have
By Lemma 12, we can find a compact subset of \(\mathcal {X}_{\epsilon ,\nu }\times \varLambda _{\epsilon ,\nu }\) depending only on \((\epsilon ,\nu )\) such that \(\mathcal {A}_{\epsilon , \nu } \subseteq \mathcal {I}({{\varvec{x}}}^\star )\) and \(\mathcal {A}_{\epsilon , \nu }^c\subseteq \{\mathcal {I}^+({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star )\}^c\); thus
Furthermore, we let \(\mathcal {X}_{\epsilon ,\nu ,\eta }\times \varLambda _{\epsilon ,\nu ,\eta } \subseteq \mathcal {X}_{\epsilon ,\nu }\times \varLambda _{\epsilon ,\nu }\) be a compact subset depending additionally on \(\eta \), such that
Then, combining (A.11) with the above two displays leads to
This completes the proof.
1.3 A.3 Proof of Lemma 12
Let \(\mathcal {X}\times \varLambda \subseteq \mathcal {T}_{\nu }\times \mathbb {R}^r\) be any compact set around \(({{\varvec{x}}}^\star ,{\varvec{\lambda }}^\star )\). For any \(({{\varvec{x}}}, {\varvec{\lambda }})\in \mathcal {X}\times \varLambda \), we have
For any \(i \in \mathcal {I}^+({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star )\), we know \(g_i^\star = 0\) and \({\varvec{\lambda }}^\star _i >0\). Thus, \(g_i^\star + \epsilon \kappa _\nu {\varvec{\lambda }}^\star _i >0\). Consider the ball \(\mathcal {B}_i^{{{\varvec{x}}}} = \{{{\varvec{x}}}: \Vert {{\varvec{x}}}- {{\varvec{x}}}^\star \Vert \le r_i\}\cap \mathcal {X}\) and \(\mathcal {B}_i^{{\varvec{\lambda }}} = \{{\varvec{\lambda }}: \Vert {\varvec{\lambda }}- {\varvec{\lambda }}^\star \Vert \le r_i\}\cap \varLambda \). For a sufficiently small \(r_i\) (depending on \(\epsilon \) and \(\nu \)), we have \(({{\varvec{x}}}^\star ,{\varvec{\lambda }}^\star )\in \mathcal {B}_i^{{{\varvec{x}}}}\times \mathcal {B}_i^{{\varvec{\lambda }}} \subseteq \mathcal {X}\times \varLambda \) and, for any \(({{\varvec{x}}}, {\varvec{\lambda }})\in \mathcal {B}_i^{{{\varvec{x}}}}\times \mathcal {B}_i^{{\varvec{\lambda }}}\),
The first inequality is due to the continuity of \(g_i\). This implies \(i \in \mathcal {A}_{\epsilon , \nu }({{\varvec{x}}}, {\varvec{\lambda }})\). Therefore, for any \(({{\varvec{x}}}, {\varvec{\lambda }})\) in the compact set \(\cap _{i\in \mathcal {I}^+({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star )}\mathcal {B}_i^{{{\varvec{x}}}}\times \mathcal {B}_i^{{\varvec{\lambda }}}\), we have \(\mathcal {I}^+({{\varvec{x}}}^\star , {\varvec{\lambda }}^\star )\subseteq \mathcal {A}_{\epsilon , \nu }({{\varvec{x}}}, {\varvec{\lambda }})\). The argument \(\mathcal {A}_{\epsilon , \nu }({{\varvec{x}}}, {\varvec{\lambda }})\subseteq \mathcal {I}({{\varvec{x}}}^\star )\) can be proved in the same way.
1.4 A.4 Proof of Lemma 13
By Assumption 1, there exists a compact set \(X\ni {{\varvec{x}}}^\star \) small enough such that \((J^T({{\varvec{x}}})\;G^T_{\mathcal {I}({{\varvec{x}}}^\star )}({{\varvec{x}}}))\) has full column rank for all \({{\varvec{x}}}\in X\). Furthermore, for any \(({{\varvec{a}}}, {{\varvec{b}}})\in \mathbb {R}^{m+r}\), we note that
where the first implication is due to \(\textrm{diag}(g({{\varvec{x}}})){{\varvec{b}}}= 0\) and \(\mathcal {I}^c({{\varvec{x}}}^\star ) \subseteq \mathcal {I}^c({{\varvec{x}}})\) (since X is small), and the second implication is due to \(\Vert J^T({{\varvec{x}}}){{\varvec{a}}}+ G^T({{\varvec{x}}}){{\varvec{b}}}\Vert = 0\). Therefore, \(M({{\varvec{x}}})\) is invertible. Moreover, for any \(\mathcal {A}\subseteq \mathcal {I}({{\varvec{x}}}^\star )\), we have
where \(\sigma _{\min }(\cdot )\) denotes the least singular value of a matrix. By (A.13), (A.14), and the compactness of X, we know that there exists \(\gamma _{H}\in (0, 1]\) such that
To show the second part of the statement, we apply Lemma 12, and know that there exists a compact set \(\mathcal {X}_{\epsilon ,\nu }\times \varLambda _{\epsilon ,\nu }\subseteq X\times \mathbb {R}^r\) such that \(\mathcal {A}({{\varvec{x}}},{\varvec{\lambda }})\subseteq \mathcal {I}({{\varvec{x}}}^\star )\), \(\forall ({{\varvec{x}}}, {\varvec{\lambda }})\in \mathcal {X}_{\epsilon ,\nu }\times ~\varLambda _{\epsilon ,\nu }\). Combining this fact with (A.15), we complete the proof.
B Proofs of Sect. 3
1.1 B.1 Proof of Lemma 4
It suffices to show that there exists a threshold \(\tilde{\epsilon }>0\) such that for any samples \(\xi _1\), any parameter \(\nu \in [{\bar{\nu }}_0, \tilde{\nu }]\), where \({\bar{\nu }}_0\) is the fixed initial input of Algorithm 1 and \({\tilde{\nu }}\) is defined in (30), and any point \(({{\varvec{x}}}, \varvec{\mu }, {\varvec{\lambda }}) \in \mathcal {X}\times \mathcal {M}\times \varLambda \) with \({{\varvec{x}}}\in \mathcal {T}_{\nu }\), if \(\epsilon \le \tilde{\epsilon }\), then
where \({\bar{\nabla }}\mathcal {L}_{\epsilon , \nu , \eta }\) is computed using samples in \(\xi _1\) and \(\eta , \chi _{err}>0\) are any given positive constants. Note that everything above is deterministic; that is, our analysis does not depend on a specific iteration sequence \(\{({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t)\}_t\). Thus, the threshold \(\tilde{\epsilon }\) is deterministic. Let us prove the above statement by contradiction. Without loss of generality, we suppose \(\chi _{err}\le 1\).
Suppose the statement is false, then there exist a sequence \(\{\epsilon _j, \xi _1^j, \nu _j\}_j\) and an evaluation point sequence \(\{({{\varvec{x}}}_j, \varvec{\mu }_j, {\varvec{\lambda }}_j)\}_j\in \mathcal {X}\times \mathcal {M}\times \varLambda \) such that \(\nu _j\in [{\bar{\nu }}_0, {\tilde{\nu }}]\), \({{\varvec{x}}}_j \in \mathcal {T}_{\nu _j}\), \(\epsilon _j\searrow 0\) and
where \({\bar{\nabla }}\mathcal {L}_{\epsilon _j, \nu _j, \eta }^j\) is computed using samples \(\xi _1^j\), and \(\eta \) and \(\chi _{err}\) are fixed constants. By the compactness condition, we suppose \(({{\varvec{x}}}_j, \varvec{\mu }_j, {\varvec{\lambda }}_j) \rightarrow (\tilde{{{\varvec{x}}}}, \tilde{\varvec{\mu }}, \tilde{{\varvec{\lambda }}}) \in \mathcal {X}\times \mathcal {M}\times \varLambda \) and \(\nu _j \rightarrow \nu \) as \(j \rightarrow \infty \) (otherwise, we can consider a convergent subsequence, which must exist). Noting that \(c_j = c({{\varvec{x}}}_j)\) and \(\varvec{w}_{\epsilon _j, \nu _j}^j = \max \{g({{\varvec{x}}}_j), -\epsilon _jq_{\nu _j}({{\varvec{x}}}_j, {\varvec{\lambda }}_j){\varvec{\lambda }}_j\}\) are bounded due to the compactness of \(({{\varvec{x}}}_j, \varvec{\mu }_j, {\varvec{\lambda }}_j)\) and the boundedness of \(\nu _j\) and \(\epsilon _j\), we have from (B.1) that
Moreover, since \({{\varvec{x}}}_j \in \mathcal {T}_{\nu _j}\), we have \(\sum _{i=1}^r\max \{(g_j)_i, 0\}^3\le \nu _j/2\). Taking limit \(j\rightarrow \infty \) leads to \(\tilde{{{\varvec{x}}}}\in \mathcal {T}_{\nu }\). Furthermore, by (10), (B.2), and the convergence of \(({{\varvec{x}}}_j, \varvec{\mu }_j, {\varvec{\lambda }}_j)\), we get
which is further simplified as
Suppose \(\tilde{{{\varvec{x}}}} \in \mathcal {X}\backslash \varOmega \) and let \(\mathcal {I}_c(\tilde{{{\varvec{x}}}}) = \{i: 1\le i\le m, c_i(\tilde{{{\varvec{x}}}}) \ne 0\}\), and \(\mathcal {I}_g(\tilde{{{\varvec{x}}}}) = \{i: 1\le i\le r, g_i(\tilde{{{\varvec{x}}}}) >0\}\). By Assumption 4, the set
is nonempty. By the Gordan’s theorem [26], for any \(a_i, b_i\ge 0\) such that
we have \(a_i = b_i = 0\). Comparing (B.4) with (B.3), and noting that the coefficients of (B.3) are all positive (since \(\tilde{{{\varvec{x}}}}\in \mathcal {T}_{\nu }\)), we immediately get the contradiction. Thus, \(\tilde{{{\varvec{x}}}}\in \varOmega \).
By Assumption 4 and following the same reasoning as (A.13), \(M(\tilde{{{\varvec{x}}}})\) is invertible and, particularly, is positive definite. Thus, \(M_j\) is invertible for large enough j. Let us suppose \(\Vert M_j^{-1}\Vert \le \varUpsilon _M\) for some \(\varUpsilon _M>0\). Further, by direct calculation, we have
Thus, we can obtain
Let us focus on \({{\mathcal {H}}}_{2,j}\). We know that
Recalling that \(\sigma _{\min }(\cdot )\) denotes the least singular value of a matrix, by the Weyl’s inequality,
Since \(\epsilon _j\rightarrow 0\) and \(\varvec{w}_{\epsilon _j, \nu _j}^j \rightarrow 0\) as \(j\rightarrow \infty \) (because \(\tilde{{{\varvec{x}}}}\in \varOmega \)), we know \(\varDelta {{\mathcal {H}}}_{2,j}\rightarrow {{\varvec{0}}}\). In addition, since \(M_j\rightarrow M(\tilde{{{\varvec{x}}}})\) with \(M(\tilde{{{\varvec{x}}}})\) being positive definite, and \(q_{\nu _j}^j \le \nu _j = {\tilde{\nu }}\), we know for some constant \(\varphi >0\) and sufficiently large j,
Now we bound the first term in (B.6). By (10) and the invertibility of \(M_j\), we know
Moreover, by the compactness condition, we have \(\Vert {{\mathcal {H}}}_{1,j}\Vert \le \varUpsilon _1\) and \(\Vert (J_j^T\; G_j^T)\Vert \le \varUpsilon _2\) for some constants \(\varUpsilon _1, \varUpsilon _2>0\). Combining (B.7), (B.8) with (B.6), we have
Noting that \(\varphi _j \rightarrow 0\) as \(j\rightarrow \infty \) (since \(\varvec{w}_{\epsilon _j, \nu _j}^j\rightarrow 0\) and \(\epsilon _j\rightarrow 0\)), we obtain for large j that
which cannot hold because \(\epsilon _j\searrow 0\). This is a contradiction, and thus we complete the proof.
1.2 B.2 Proof of Lemma 5
The proof closely follows the proof of Lemma 3 in Appendix A.2 We suppress the iteration t and assume \(\xi _1^t\) is any sample set. Our analysis is independent of the sample set \(\xi _1^t\) for computing \({\bar{\nabla }}\mathcal {L}_{{\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta }^t\), and we will see that the threshold is independent of t. Like Lemma 3, we use \(\varUpsilon _1, \varUpsilon _2, \ldots \) to denote generic constants that are independent of \(({\bar{\epsilon }}_t, {\bar{\nu }}_t, \eta , \gamma _{B}, \gamma _{H})\), whose existence is ensured by the compactness of the iterates.
Following the derivation of (A.4), we have
where \((\bar{{\tilde{\varDelta }}}\varvec{\mu }, \bar{{\tilde{\varDelta }}}{\varvec{\lambda }}_a)\) is the dual solution of (12a) with \(\nabla _{{{\varvec{x}}}}\mathcal {L}\) being replaced by \({\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}\). Following the derivation of (A.5), there exists \(\varUpsilon _1>0\) such that
Following the derivation of (A.6), there exists \(\varUpsilon _2>0\) such that
Following the derivation of (A.7) by combining (B.9), (B.10), and (B.11), and noting that \(0<q_{{\bar{\nu }}}\le {\bar{\nu }}\le \tilde{\nu }\) where \({\tilde{\nu }}\) is defined in (30), there exists \(\varUpsilon _3>0\) such that
Following the derivation of (A.9), there exists \(\varUpsilon _4>0\) such that
Combining the above display with (B.12) and using the following Young’s inequality
we have
Therefore, as long as
we have
Thus, we can define
which implies (B.13) and completes the proof.
1.3 B.3 Proof of Lemma 6
We let \(C_1, C_2,\ldots \) be generic constants that are independent of \((\beta , \alpha _{max}, \kappa _{grad}, \kappa _{f}, p_{grad}, p_{f}, \chi _{grad},\chi _{f})\). These constants may not be consistent with the constants \(C_1, C_2, C_3\) in the statement. However, the existence of \(C_1,C_2, C_3\) in the statement follows directly from our proof.
(a1) By the definition of \(\nabla \mathcal {L}_{\epsilon , \nu , \eta }\) in (10), all quantities depending on \(\epsilon , \nu \) do not depend on the batch samples. We have
By Assumption 3, the definition (9), and the facts that \({\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t - \nabla _{{{\varvec{x}}}}\mathcal {L}_t = {\bar{\nabla }}f_t - \nabla f_t\) and \({\bar{\nabla }}_{{{\varvec{x}}}}^2\mathcal {L}_t - \nabla _{{{\varvec{x}}}}^2\mathcal {L}_t = {\bar{\nabla }}^2 f_t - \nabla ^2 f_t\), there exists \(C_1>0\) (depending on \(\eta \)) such that
Since \(\Vert \textrm{diag}^2(g_t){\varvec{\lambda }}_t\Vert \le C_2\Vert \max \{g_t, -{\varvec{\lambda }}_t\}\Vert \) for some constant \(C_2>0\), we apply the definition of \({\bar{R}}_t\) in (14) and the uniform boundedness of \({\bar{R}}_t\), and know that the above inequality leads to the statement.
(a2) By the definition of \(\mathcal {L}_{\epsilon ,\nu ,\eta }\) in (8), all quantities depending on \(\epsilon ,\nu \) do not depend on the batch samples. We have
By Assumption 3 and the facts that \({\bar{\mathcal {L}}}_t - \mathcal {L}_t = {\bar{f}}_t-f_t\) and \(\Vert \textrm{diag}^2(g_t){\varvec{\lambda }}_t\Vert \le C_2\Vert \max \{g_t, -{\varvec{\lambda }}_t\}\Vert \), there exists \(C_3>0\) (depending on \(\eta \)) such that
Using \(R_t\le {\bar{R}}_t + \Vert {\bar{\nabla }}f_t-\nabla f_t\Vert \le 2({\bar{R}}_t \vee \Vert {\bar{\nabla }}f_t-\nabla f_t\Vert )\), we prove the statement.
(b) By (10) and Assumption 3, there exists \(C_4>0\) such that
By Theorem 1, we have
Thus, there exists \(C_5>0\) such that
Moreover, there exists \(C_6>0\) such that
and
Combining the above three displays, there exists \(C_7>0\) such that
We deal with the middle term. We know that
Multiplying \(((J_t{\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t)^T\; (G_t{\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t)^T)\) on both sides, there exists \(C_8>0\) such that
Furthermore,
Combining the above display with (B.17), there exists \(C_9>0\) such that
where the second inequality is due to Young’s inequality \(a^{3/4}b^{1/4} \le 3a/4 + b/4\). Thus,
(c) By (10) and using (B.14), (B.15) and (B.16), there exists \(C_{10}>0\) such that
For \({\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t\), we have the following decomposition
By Assumptions 3 and 5, we know \(\Vert (I - {{\mathcal {P}}}_{JG}^t){\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t\Vert \le C_{11}\Vert (J_t{\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t, G_{t_a}{\bar{\nabla }}_{{{\varvec{x}}}}\mathcal {L}_t)\Vert \) for some constant \(C_{11}>0\). Furthermore, for some constant \(C_{12}>0\), we also have
Combining the last two displays, we have
Moreover, there exists \(C_{13}>0\) such that
Combining (B.20), (B.21), and (B.22) together, we complete the proof.
1.4 B.4 Proof of Lemma 7
Analogous to the proof of Lemma 6, we only track the constants \((\beta , \alpha _{max}, \kappa _{grad}, \kappa _{f}, p_{grad}, p_{f}, \chi _{grad}, \chi _{f})\). We use \(\varUpsilon _1, \varUpsilon _2, \ldots \) to denote generic constants that are independent from \((\beta , \alpha _{max}, \kappa _{grad}, \kappa _{f}, p_{grad}, p_{f}, \chi _{grad}, \chi _{f})\). Note that \(\varUpsilon _1\) in the proof may not be consistent with \(\varUpsilon _1\) in the statement, while the existence of \(\varUpsilon _1\) in the statement follows directly from our proof.
Let \(\varUpsilon _{\epsilon , \nu , \eta }\) be the upper bound of the generalized Hessian of \(\mathcal {L}_{\epsilon , \nu , \eta }\) in the compact set \((\mathcal {X}\cap \mathcal {T}_{\theta \nu })\times \mathcal {M}\times \varLambda \) (see [50] for the definition of the generalized Hessian). In particular, \(\varUpsilon _{\epsilon , \nu , \eta } = \sup _{(\mathcal {X}\cap \mathcal {T}_{\theta \nu })\times \mathcal {M}\times \varLambda }\Vert \partial ^2\mathcal {L}_{\epsilon , \nu , \eta }\Vert \). Without loss of generality, we suppose \({\tilde{\epsilon }}\) in Theorem 1 satisfies \(\tilde{\epsilon } = {\bar{\epsilon }}_0/\rho ^{\tilde{i}}\) for some integer \(\tilde{i}\). Then, with definition \(\tilde{j}\) in (30), we let
and have \(\varUpsilon _{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta } \le \varUpsilon _{{\tilde{\epsilon }}, {\tilde{\nu }}, \eta }\). Noting that \({{\varvec{x}}}_{s_t}, {{\varvec{x}}}_t \in \mathcal {T}_{{\bar{\nu }}_{{\bar{t}}}}\), we apply the Taylor expansion and have
We consider the following two cases.
Case 1, Combining (18) with (19), we have
By (B.10), there exists \(\varUpsilon _1>0\) such that
Furthermore, we have
and thus, by (B.21), (B.22), there exists \(\varUpsilon _2>0\) such that
Plugging (B.25) and (B.26) into (B.23), we have
where \(\varUpsilon _3 = 4\varUpsilon _1\varUpsilon _2/(\gamma _{B}\wedge \eta ) \vee 2\varUpsilon _1^2\varUpsilon _{{\tilde{\epsilon }}, {\tilde{\nu }}, \eta }/(\gamma _{B}\wedge \eta )\).
Case 2, By Lemma 6(b), Lemma 14, (17), and (B.14), there exists \(\varUpsilon _4>0\) such that
Plugging (20) and (B.28) into (B.23), we have
where \(\varUpsilon _5 = \varUpsilon _4\chi _{u}^2 \vee \varUpsilon _{{\tilde{\epsilon }}, {\tilde{\nu }}, \eta }\chi _{u}^3/2\).
Combining (B.27) and (B.29), and letting \(\varUpsilon _6 = \varUpsilon _3\vee \varUpsilon _5\vee 2\), we obtain
By the event \(\mathcal {E}_2^t\), we have
Therefore, as long as
we have
This completes the proof.
1.5 B.5 Proof of Lemma 9
Algorithm 1 has three types of steps: a reliable step (Line 19), an unreliable step (Line 21), and an unsuccessful step (Line 24). For each type of step, or . Thus, we analyze in the following six cases.
Case 1a, reliable step, By Lemma 8, we have
Note that
Combining the above display with (B.26), Lemma 6(c), and using \({\bar{\alpha }}_t\le \alpha _{max}\), there exists \(\varUpsilon _1>0\) such that
Combining the above inequality with (B.31), we have
By Line 20 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = (\rho -1)\bar{\delta }_t\). By the Taylor expansion and \({\bar{\alpha }}_{t+1} \le \rho {\bar{\alpha }}_t\) (Line 18), there exists \(\varUpsilon _2>0\) such that
Combining the above two displays with (31), we obtain
Let
which is further implied by
if we define \(\varUpsilon _3 = (36\rho \varUpsilon _{{\tilde{\epsilon }}, {\tilde{\nu }}, \eta }^2\varUpsilon _2\vee 36\rho \varUpsilon _1^2)/(\gamma _{B}\wedge \eta )\). Then, we obtain
Case 2a, unreliable step, By Lemma 8, we have
By Line 22 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = -(1-1/\rho )\bar{\delta }_t\), while (B.33) still holds. Thus, under (B.35), we have
Case 3a, unsuccessful step, In this case, \(({{\varvec{x}}}_{t+1},\varvec{\mu }_{t+1}, {\varvec{\lambda }}_{t+1}) = ({{\varvec{x}}}_t, \varvec{\mu }_t, {\varvec{\lambda }}_t)\), \({\bar{\alpha }}_{t+1} = {\bar{\alpha }}_t/\rho \) and \(\bar{\delta }_{t+1} = \bar{\delta }_t/\rho \). Thus, we immediately have
Combining (B.36), (B.37), (B.38), and noting that
with the right hand side being implied by (B.34) and further by (B.35), we know (B.38) holds for all three cases with .
Case 1b, reliable step, By Lemma 8, we have
Note that
Combining the above display with (B.28) and using \({\bar{\alpha }}_t\le \alpha _{max}\), there exists \(\varUpsilon _4>0\) such that
Combining the above three displays,
By Line 20 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = (\rho -1)\bar{\delta }_t\). By the Taylor expansion and \({\bar{\alpha }}_{t+1} \le \rho {\bar{\alpha }}_t\) (Line 18),
Combining the above two displays,
Let
which is implied by (B.35) if we re-define \(\varUpsilon _3 \leftarrow \varUpsilon _3\vee 12\rho \varUpsilon _{{\tilde{\epsilon }}, {\tilde{\nu }},\eta }^2\chi _{u}^3\vee 12\rho \varUpsilon _4^2\chi _{u}\). Then,
Case 2b, unreliable step, By Lemma 8, we have
By Line 22 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = -(1-1/\rho )\bar{\delta }_t\), while (B.40) still holds. Thus, under (B.35), we have
Case 3b, unsuccessful step, In this case, (B.38) holds. Combining (B.42), (B.43), (B.38), and noting that
as implied by (B.41) and further by (B.35), we know (B.38) holds for all three cases with . In summary, under (B.35), (B.38) holds for all cases. This completes the proof.
1.6 B.6 Proof of Lemma 10
The proof follows the proof of Lemma 9, except that (B.32) and (B.39) do not hold due to \((\mathcal {E}_1^t)^c\). We consider the following six cases.
Case 1a, reliable step, By Lemma 8, we have
By Line 20 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = (\rho -1)\bar{\delta }_t\), while (B.33) still holds. By the condition of \(\omega \) in (B.34) and (B.35), we know that under (32) (which implies (B.35)),
Case 2a, unreliable step, By Lemma 8, we have
By Line 22 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = -(1-1/\rho )\bar{\delta }_t\), while (B.33) still holds. Thus, under (32),
Case 3a, unsuccessful step, In this case, (B.38) holds. Combining (B.44), (B.45), and (B.38), we have
Case 1b, reliable step, By Lemma 8, we have
By Line 20 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = (\rho -1)\bar{\delta }_t\), while (B.40) still holds. By the condition of \(\omega \) in (B.41), we know that under (32) (which implies (B.41)),
Case 2b, unreliable step, By Lemma 8, we have
By Line 22 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = -(1-1/\rho )\bar{\delta }_t\), while (B.40) still holds. Thus, under (32),
Case 3b, unsuccessful step, In this case, (B.38) holds. Combining (B.47), (B.48), and (B.38), we note that (B.46) holds as well. Thus, (B.46) holds for all six cases. This completes the proof.
1.7 B.7 Proof of Lemma 11
The proof follows the proof of Lemma 10, except that Lemma 8 is not applicable. We consider the following six cases.
Case 1a, reliable step, We have
By Line 20 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = (\rho -1)\bar{\delta }_t\), while (B.33) still holds. By the condition of \(\omega \) in (B.34) and (B.35), we know that under (32) (which implies (B.35)),
Case 2a, unreliable step, We have
By Line 22 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = -(1-1/\rho )\bar{\delta }_t\), while (B.33) still holds. Thus, under (32),
Case 3a, unsuccessful step, In this case, (B.38) holds. Combining (B.49), (B.50), and (B.38), we obtain
Case 1b, reliable step, We have
By Line 20 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = (\rho -1)\bar{\delta }_t\), while (B.40) still holds. By the condition of \(\omega \) in (B.41), we know that under (32) (which implies (B.41)),
Case 2b, unreliable step, We have
By Line 22 of Algorithm 1, \(\bar{\delta }_{t+1} - \bar{\delta }_t = -(1-1/\rho )\bar{\delta }_t\), while (B.40) still holds. Thus, under (32),
Case 3b, unsuccessful step, In this case, (B.38) holds. Combining (B.52), (B.53), and (B.38), we note that (B.51) holds as well. Thus, (B.51) holds for all six cases. This completes the proof.
1.8 B.8 Proof of Theorem 3
We suppose there are infinite many successful steps. Otherwise, \({\bar{\alpha }}_t\) decreases to zero (cf. Line 25 of Algorithm 1) and the argument holds trivially. We use \({\bar{t}}<t_1<t_2<\ldots \) to denote the subsequence with \(t_i-1\), \(\forall i\ge 1\), being a successful step. By Lemma 14, Lemma 6(b), and (B.14), there exist \(\varUpsilon _1, \varUpsilon _2>0\) such that for any \(i\ge 1\),
Since \(t_i\ge {\bar{t}}+1\), two parameters \({\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}\) are fixed conditional on any \(\sigma \)-algebra \(\mathcal {F}\supseteq \mathcal {F}_{{\bar{t}}}\). Thus, for any \(i\ge 1\),
Combining the above two displays, we know there exists \(\varUpsilon _3>0\) such that
which implies
On the other hand, by Theorem 2, we sum up the error recursion for \(t\ge {\bar{t}}+1\), take conditional expectation on \(\mathcal {F}_{{\bar{t}}}\), and have
Thus, applying the Fubini’s theorem to exchange the summation and expectation, we know that \(\mathbb {E}[\limsup \nolimits _{t\rightarrow \infty } {\bar{\alpha }}_t\Vert \nabla \mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\Vert ^2 + \bar{\delta }_t\mid \mathcal {F}_{{\bar{t}}}]=0 \). Since \({\bar{\alpha }}_t\Vert \nabla \mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\Vert ^2 + \bar{\delta }_t\) is non-negative, we further obtain \({\bar{\alpha }}_t\Vert \nabla \mathcal {L}_{{\bar{\epsilon }}_{{\bar{t}}}, {\bar{\nu }}_{{\bar{t}}}, \eta }^t\Vert ^2 + \bar{\delta }_t \rightarrow 0\) as \(t\rightarrow \infty \) almost surely. By (B.54), we have \({\bar{\alpha }}_{t_i}R_{t_i}^2\rightarrow 0\) as \(i\rightarrow \infty \). Noting that \({\bar{\alpha }}_t R_t^2\le {\bar{\alpha }}_{t_i}R_{t_i}^2\) for any \(t_i\le t<t_{i+1}\), we complete the proof.
1.9 B.9 Proof of Theorem 4
We adapt the proof of [40, Theorem 4]. By Theorem 3, it suffices to show that the “limsup” of the random stepsize sequence \(\{{\bar{\alpha }}_t\}_t\) is lower bounded away from zero. To show this, we define two stepsize sequences as follows. For any \(t> {\bar{t}}+1\), we let
and let \(\phi _{{\bar{t}}+1} = \varphi _{{\bar{t}}+1} = \log ({\bar{\alpha }}_{{\bar{t}}+1})\). Here, c is a deterministic constant such that
and \(c = \rho ^{-i}\alpha _{max}\) for some \(i>0\). The first constant comes from Lemma 7. We aim to show \(\phi _t\ge \varphi _t\), \(\forall t\ge {\bar{t}}+1\).
First, we note that by the stepsize specification in Lines 18 and 25 of Algorithm 1 (Line 13 is not performed since \(t\ge {\bar{t}}+ 1\)), \({\bar{\alpha }}_t = \rho ^{j_t}c\) for some integer \(j_t\). Second, we note that \(\phi _t\) and \(\varphi _t\) are both \(\mathcal {F}_{t-1}\)-measurable, that is, they are fixed conditional on \(\mathcal {F}_{t-1}\). Third, we show that \(\phi _t\ge \varphi _t\) by induction. Note that \(\phi _{{\bar{t}}+1} = \varphi _{{\bar{t}}+1}\). Suppose \(\phi _t\ge \varphi _t\), we consider the following three cases.
(a) If \(\phi _t>\log (c)\), then \(\phi _t\ge \log (c) + \log (\rho )\). Thus, \(\phi _{t+1} \ge \phi _t - \log (\rho ) \ge \log (c) \ge \varphi _{t+1}\).
(b) If \(\phi _t\le \log (c)\) and \({{\varvec{1}}}_{\mathcal {E}_1^t\cap \mathcal {E}_2^t} = 1\), then Lemma 7 leads to
(c) If \(\phi _t\le \log (c)\) and \({{\varvec{1}}}_{\mathcal {E}_1^t\cap \mathcal {E}_2^t} = 0\), then
Combining the above three cases, we have \(\phi _t\ge \varphi _t\), \(\forall t\ge {\bar{t}}+1\). Note that, conditional on \(\mathcal {F}_{{\bar{t}}}\), \(\{\varphi _t\}_{t\ge {\bar{t}}+1}\) is a random walk with a maximum and a drift upward (cf. [24, Example 6.1.2]). Thus, \(\limsup _{t\rightarrow \infty }\varphi _t\ge \log (c)\) almost surely. In particular, we have
which means that the “limsup” of \({\bar{\alpha }}_t\) is lower bounded almost surely. Using Theorem 3, we complete the proof.
1.10 B.10 Proof of Theorem 5
Suppose \(\limsup _{t\rightarrow \infty }R_t = \epsilon >0\). By Theorem 4, we know there exist two sequences \(\{n_i\}_i\) and \(\{m_i\}_i\) with \(n_i<m_i<n_{i+1}\) for all i, such that
For each interval \([n_i, m_i]\), we use \(\{t_{i,j}\}_{j=1}^{J_i}\) to denote a subsequence within the interval such that \(n_i= t_{i,1}<\ldots< t_{i,j}< \ldots <t_{i,J_i} = m_i\) and \(t_{i,j}-1\) is a successful step. In other words, \(t_{i,j}\) is the first index that we arrive at the new point. Here, we suppose \(n_i-1\) is a successful step; that is, the index \(n_i\) is the first time we arrive at the point \(({{\varvec{x}}}_{n_i}, \varvec{\mu }_{n_i}, {\varvec{\lambda }}_{n_i})\) (one can always choose \(n_i\) to satisfy this condition). We also note that \(t_{i,J_i} = m_i\) because \(R_{m_i-1}\ge \epsilon /3\) while \(R_{m_i}<\epsilon /3\). With these notation, there exist \(\varUpsilon _1, \varUpsilon _2>0\) such that
Let us define the set \({{\mathcal {T}}}= \{t: t-1 \text { is successful and } R_t\ge \epsilon /3\}\). We can see from (B.54) and (B.55) that \(\sum _{t\in {{\mathcal {T}}}}{\bar{\alpha }}_t<\infty \). This contradicts (B.56) since \(\sum _{t\in {{\mathcal {T}}}}{\bar{\alpha }}_t \ge \sum _i\sum _{j=1}^{J_i-1}{\bar{\alpha }}_{t_{i,j}}{\mathop {=}\limits ^{B.56}}\infty \). Thus, we know \(\limsup _{t\rightarrow \infty }R_t=0\); and thus, we complete the proof.
C Auxiliary lemmas
Lemma 14
Let \(\epsilon , \nu >0\) and \(({{\varvec{x}}}, {\varvec{\lambda }})\in \mathcal {T}_{\nu } \times \mathbb {R}^r\). Then
Proof
To prove Lemma 14, we require the following lemma. \(\square \)
Lemma 15
For any two scalars a, b and a scalar \(c>0\), \(|\max \{a, b\}| \le \frac{1}{c\wedge 1}|\max \{a, cb\}|\).
Proof
Without loss of generality, we assume \(b\ne 0\) and \(c\ne 1\). We consider four cases.
Case 1: \(b>0\), \(c<1\) If \(a \le cb< b\), then \(|\max \{a, b\}| = b = \frac{1}{c}|\max \{a, cb\}|\). If \(cb<a\le b\), then \(|\max \{a, b\}| = b \le \frac{1}{c}a = \frac{1}{c}|\max \{a, cb\}|\). If \(cb<b<a\), then \(|\max \{a, b\}| = a\le \frac{1}{c}|\max \{a, cb\}|\). Thus, the result holds.
Case 2: \(b>0\), \(c>1\) If \(a \le b <cb\), then \(|\max \{a, b\} |= b \le cb = |\max \{a, cb\}|\). If \(b<a\le cb\), then \(|\max \{a, b\}| = a \le cb = |\max \{a, cb\}|\). If \(b<cb<a\), then \(|\max \{a, b\}| = a = |\max \{a, cb\}|\). Thus, the result holds.
Case 3: \(b<0\), \(c<1\) If \(a \le b < cb\), then \(|\max \{a, b\}| = |b| = \frac{1}{c}|\max \{a, cb\}|\). If \(b<a \le cb\), then \(|\max \{a, b\}| = |a| \le |b| = \frac{1}{c}|\max \{a, cb\}|\). If \(b<cb< a\), then \(|\max \{a, b\}| = |a| \le \frac{|a|}{c} = \frac{1}{c}|\max \{a, cb\}|\). Thus, the result holds.
Case 4: \(b<0\), \(c>1\) If \(a \le cb<b\), then \(|\max \{a, b\}| = |b| \le c|b| = |\max \{a, cb\}|\). If \(cb< a \le b\), then \(|\max \{a, b\}| = |b| \le |a| = |\max \{a, cb\}|\). If \(cb<b<a\), then \(|\max \{a, b\}| = |a| = |\max \{a, cb\}|\). Thus, the result holds.
Combining the above four cases, we complete the proof. \(\square \)
Since \(\epsilon , \nu >0\), \(({{\varvec{x}}}, {\varvec{\lambda }})\in \mathcal {T}_{\nu }\times \mathbb {R}^r\), and \(q_\nu ({{\varvec{x}}}, {\varvec{\lambda }})>0\), we have for any \(i\in \{1,2,\ldots , r\}\),
where both inequalities are from Lemma 15. Taking \(\ell _2\) norm on both sides, we finish the proof.
D Auxiliary experiments
We follow the experiments in Sect. 4 and provide additional results. We first examine three proportions: (1) the proportion of the iterations with failed SQP steps, (2) the proportion of the iterations with unstabilized penalty parameters, (3) the proportion of the iterations with a triggered feasibility error condition. We then investigate a multiplicative noise, and apply the method on an inequality constrained logistic regression problem.
Failed SQP steps Figure 5 plots the proportion of the iterations with failed SQP steps. From the figure, we see that the proportion varies from \(10\%\) to \(60\%\) across the problems, and AdapNewton tends to have a smaller proportion than AdapGD. Although the proportion does not have a clear dependency on the variance \(\sigma ^2\), the noticeable proportion of failed SQP steps illustrates the differences between equality and inequality constrained problems. As analyzed in Sect. 2, the active-set SQP steps may not be informative if the identified active set is very distinct from the true active set. Due to the potential failure of the SQP steps, utilizing a safeguarding direction is critical in achieving the global convergence for the algorithm.
Non-stationary penalty parameters Figure 6 plots the proportion of the iterations with unstabilized penalty parameters; i.e., the last iteration that we update \({\bar{\epsilon }}_0\) over the total number of the iterations. From the figure, we observe that the proportion varies from \(20\%\) to \(70\%\), and AdapNewton and AdapGD have comparable results. In fact, the proportion highly depends on the adopted initial \({\bar{\epsilon }}_0\) and the updating rule of \({\bar{\epsilon }}_0\). For example, a large \(\rho \) and a small \({\bar{\epsilon }}_0\) will reduce the proportion significantly; and the updating rules \({\bar{\epsilon }}_0\leftarrow {\bar{\epsilon }}_0/\rho \) and \({\bar{\epsilon }}_0\leftarrow \exp (-1/{\bar{\epsilon }}_0)\) will also lead to different proportions. The large variation in Fig. 6 suggests that different problems stabilize \({\bar{\epsilon }}_0\) to different levels; thus, a problem-dependent tuning of \({\bar{\epsilon }}_0\) is desired in practice. We note in the experiments that the results on some problems can be improved if \({\bar{\epsilon }}_0=10^{-4}\), while such a setup may not be suitable for other problems. Thus, designing a robust scheme to select the penalty parameters deserves further studying.
Feasibility error condition Figure 7 plots the proportion of the iterations with a triggered feasibility error condition. We do not show the results for the different setups of \(\chi _{err}\). In fact, when \(\chi _{err}=1\), the results are identical to \(C =2\) and \(\kappa =2\) (see the left column of Fig. 7). However, when \(\chi _{err} = 10\) or 100, the feasibility error condition is never triggered. From Fig. 7, we see that the proportion is extremely small (e.g., as small as \(1\%\)). This suggests that the condition (17) is hardly triggered in practice. Figure 7 also plots the iteration proportion that (17) is triggered for an unsuccessful step. We see that such an proportion is even smaller (e.g., less than \(0.5\%\)). Given these negligible proportions, we can conclude that the condition (17) does not negatively affect the performance of the designed StoSQP scheme.
Multiplicative noise We also investigate a multiplicative noise in the experiments. In particular, we employ the default setup \((C, \kappa , \chi _{err}) = (2,2,1)\) but replace the noise variance \(\sigma ^2\) by \((1 +\Vert {{\varvec{x}}}_t\Vert ^2)\sigma ^2\). Thus, the variance scales linearly with respect to the magnitude of the (primal) iterate. The KKT residual and sample size boxplots are shown in Fig. 8. Compared to Figs. 1 and 2, we see that the algorithm achieves comparable results to additive noise. This observation is as expected because, regardless of the noise type, the algorithm enforces the same stochastic conditions on the model estimation accuracy in each iteration, and adaptively selects the batch sizes that are mainly characterized by the current KKT residual.
Logistic regression problem We study an inequality constrained logistic regression problem, where we let
We set \(d = 10, r= 5\), and generate each entry of the matrix \(C\in \mathbb {R}^{5\times 10}\) and vector \(q\in \mathbb {R}^5\) from the standard Gaussian distribution. We let \(\xi _b\) be a Rademacher variable (i.e., taking \(\{-1,1\}\) with equal probability), and consider different design distributions for \(\xi _{{{\varvec{a}}}}\). In particular, we consider both a light tail design \((\xi _{{{\varvec{a}}}})_i\sim {{\mathcal {N}}}({{\varvec{0}}}, \sigma _{{{\varvec{a}}}}^2)\) and vary \(\sigma _{{{\varvec{a}}}}^2\in \{10^{-8}, 10^{-4}, 10^{-2}\}\), and a heavy tail design \((\xi _{{{\varvec{a}}}})_i\sim \text {Exp}(\lambda _{{{\varvec{a}}}})\) and vary \(\lambda _{{{\varvec{a}}}}\in \{10,10^2,10^4\}\). Note that \(\text {Exp}(\lambda _{{{\varvec{a}}}})\) has the variance \(1/\lambda _{{{\varvec{a}}}}^2\). For each design, we run AdapNewton and AdapGD for 20 times. The default algorithm setup is the same as in Sect. 4.
Figure 9 shows the KKT residual boxplots. From the figure, we observe that AdapNewton performs slightly better than AdapGD. Both methods achieve reasonable performance on all setups of the two designs, although the two methods perform better on the Gaussian design that has a lighter tail than the Exponential design. Overall, the experiments demonstrate the effectiveness of the proposed algorithm.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Na, S., Anitescu, M. & Kolar, M. Inequality constrained stochastic nonlinear optimization via active-set sequential quadratic programming. Math. Program. 202, 279–353 (2023). https://doi.org/10.1007/s10107-023-01935-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-023-01935-7
Keywords
- Inequality constraints
- Stochastic optimization
- Exact augmented Lagrangian
- Sequential quadratic programming