Inequality constrained stochastic nonlinear optimization via active-set sequential quadratic programming

We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming (StoSQP) algorithm that utilizes a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian, and performs a stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the KKT residuals converge to zero almost surely. Our algorithm and analysis further develop the prior work of Na et al. (Math Program, 2022. https://doi.org/10.1007/s10107-022-01846-z). Specifically, we allow nonlinear inequality constraints without requiring the strict complementary condition; refine some of designs in Na et al. (2022) such as the feasibility error condition and the monotonically increasing sample size; strengthen the global convergence guarantee; and improve the sample complexity on the objective Hessian. We demonstrate the performance of the designed algorithm on a subset of nonlinear problems collected in CUTEst test set and on constrained logistic regression problems.


Introduction
We study stochastic nonlinear optimization problems with deterministic equality and inequality constraints: where f : R d Ñ R is an expected objective, c : R d Ñ R m are deterministic equality constraints, g : R d Ñ R r are deterministic inequality constraints, ξ " P is a random variable following the distribution P, and F p¨; ξq : R d Ñ R is a realized objective. In stochastic optimization regime, the direct evaluation of f and its derivatives is not accessible. Instead, it is assumed that one can generate independent and identically distributed samples tξ i u i from P, and estimate f and its derivatives based on the realizations tF p¨; ξ i qu i . Problem (1) widely appears in a variety of industrial applications including finance, transportation, manufacturing, and power systems [8,56]. It includes constrained empirical risk minimization (ERM) as a special case, where P can be regarded as a uniform distribution over n data points tξ i " py i , z i qu n i"1 , with py i , z i q being the feature-outcome pairs. Thus, the objective has a finite-sum form as The goal of (1) is to find the optimal parameter x ‹ that fits the data best. One of the most common choices of F is the negative log-likelihood of the underlying distribution of py i , z i q. In this case, the optimizer x ‹ is called the maximum likelihood estimator (MLE). Constraints on parameters are also common in practice, which are used to encode prior model knowledge or to restrict model complexity. For example, [29,30] studied inequality constrained least-squares problems, where inequality constraints maintain structural consistency such as non-negativity of the elasticities. [41,44] studied statistical properties of constrained MLE, where constraints characterize the parameters space of interest. More recently, a growing literature on training constrained neural networks has been reported [15,24,31,32], where constraints are imposed to avoid weights either vanishing or exploding, and objectives are in the above finite-sum form. This paper aims to develop a numerical procedure to solve (1) with a global convergence guarantee. When the objective f is deterministic, numerous nonlinear optimization methods with well-understood convergence results are applicable, such as exact penalty methods, augmented Lagrangian methods, sequential quadratic programming (SQP) methods, and interior-point methods [40]. However, methods to solve constrained stochastic nonlinear problems with satisfactory convergence guarantees have been developed only recently. In particular, with only equality constraints, [4] designed a very first stochastic SQP (StoSQP) scheme using an 1 -penalized merit function, and showed that for any initialization, the KKT residuals tR t u t converge in two different regimes, determined by a prespecified deterministic stepsize-related sequence tα t u t : (a) (constant sequence) if α t " α for some small α ą 0, then ř t´1 i"0 ErR 2 i s{t ď Υ {pαtq`Υ α for some Υ ą 0; (b) (decaying sequence) if α t satisfies ř 8 t"0 α t " 8 and ř 8 t"0 α 2 t ă 8, then lim inf tÑ8 ErR 2 t s " 0. Both convergence regimes are well known for unconstrained stochastic problems where R t " }∇f px t q} (see [12] for a recent review), while [4] generalized the results to equality constrained problems. Within the algorithm of [4], the authors designed a stepsize selection scheme (based on the prespecified deterministic sequence) to bring some sort of adaptivity into the algorithm. However, it turns out that the prespecified sequence, which can be aggressive or conservative, still highly affects the performance. To address the adaptivity issue, [39] proposed an alternative StoSQP, which exploits a differentiable exact augmented Lagrangian merit function, and enables a stochastic line search procedure to adaptively select the stepsize. Under a different setup (where the model is precisely estimated with high probability), [39] proved a different guarantee: for any initialization, lim inf tÑ8 R t " 0 almost surely. Subsequently, a series of extensions have been reported. [3] designed a StoSQP scheme to deal with rank-deficient constraints. [18] designed a StoSQP that exploits inexact Newton directions. [6] designed an accelerated StoSQP via variance reduction for finite-sum problems. [5] further developed [4] to achieve adaptive sampling. [17] established the worst-case iteration complexity of StoSQP, and [38] established the asymptotic local rate of StoSQP and performed statistical inference. In addition, [42] investigated a deterministic SQP where the objective and constraints are evaluated with noise. However, all aforementioned literature does not include inequality constraints.
Our paper develops this line of research by designing a StoSQP method that works with nonlinear inequality constraints. In order to do so, we have to overcome a number of intrinsic difficulties that arise in dealing with inequality constraints, which were already noted in classical nonlinear optimization literature [7,40]. Our work is built upon [39], where we exploited an augmented Lagrangian merit function under the SQP framework. We enhance some of designs in [39] (e.g., the feasibility error condition, the increasing batch size, and the complexity of Hessian sampling; more on these later), and the analysis of this paper is more involved. To generalize [39], we address the following two subtleties. (a) With inequalities, SQP subproblems are inequality constrained (nonconvex) quadratic programs (IQPs), which themselves are difficult to solve in most cases. Some SQP literature (e.g., [10]) supposes to apply a QP solver to solve IQPs exactly, however, a practical scheme should embed a finite number of inner loop iterations of active-set methods or interior-point methods into the main SQP loop, to solve IQPs approximately. Then, the inner loop may lead to an approximation error for search direction in each iteration, which complicates the analysis. (b) When applied to deterministic objectives with inequalities, the SQP search direction is a descent direction of the augmented Lagrangian only in a neighborhood of a KKT point [49,Propositions 8.3,8.4]. This is in contrast to equality constrained problems, where the descent property of the SQP direction holds globally, provided the penalty parameters of the augmented Lagrangian are suitably chosen. Such a difference is indeed brought by inequality constraints: to make the (active-set) SQP direction informative, the estimated active set has to be close to the optimal active set (see Lemma 3 for details). Thus, simply changing the merit function in [39] does not work for Problem (1). The existing literature on inequality constrained SQP has addressed (a) and (b) via various tools for deterministic objectives, while we provide new insights into stochastic objectives. To resolve (a), we design an active-set StoSQP scheme, where given the current iterate, we first identify an active set which includes all inequality constraints that are likely to be equalities. We then obtain the search direction by solving a SQP subproblem, where we include all inequality constraints in the identified active set but regard them as equalities. In this case, the subproblem is an equality constrained QP (EQP), and can be solved exactly provided the matrix factorization is within the computational budget. To resolve (b), we provide a safeguarding direction to the scheme. In each step, we check if the SQP subproblem is solvable and generates a descent direction of the augmented Lagrangian merit function. If yes, we maintain the SQP direction as it typically enjoys a fast local rate; if no, we switch to the safeguarding direction (e.g., one gradient/Newton step of the augmented Lagrangian), along which the iterates still decrease the augmented Lagrangian although the convergence may not be as effective as that of SQP.
Furthermore, to design a scheme that adaptively selects the penalty parameters and stepsizes for Problem (1), additional challenges have to be resolved. In particular, we know that there are unknown deterministic thresholds for penalty parameters to ensure one-to-one correspondence between a stationary point of the merit function and a KKT point of Problem (1). However, due to the scheme stochasticity, the stabilized penalty parameters are random. We are unsure if the stabilized values are above (or below, depending on the context) the thresholds or not. Thus, we cannot directly conclude that the iterates converge to a KKT point, even if we ensure a sufficient decrease on the merit function in each step, and enforce the iterates to converge to one of its stationary points.
The above difficulty has been resolved for the 1 -penalized merit function in [4], where the authors imposed a probability condition on the noise (satisfied by symmetric noise; see [4,Proposition 3.16]). [39] resolved this difficulty for the augmented Lagrangian merit function by modifying the SQP scheme when selecting the penalty parameters. In particular, [39] required the feasibility error to be bounded by the gradient magnitude of the augmented Lagrangian in each step, and generated monotonically increasing samples to estimate the gradient. Although that analysis does not require noise conditions, adjusting the penalty parameters to enforce the feasibility error condition may not be necessary for the iterates that are far from stationarity. Also, generating increasing samples is not satisfactory since the sample size should be adaptively chosen based on the iterates. In this paper, we refine the techniques of [39] and generalize them to inequality constraints. We weaken the feasibility error condition by using a (large) multiplier to rescale the augmented Lagrangian gradient, and more significantly, enforcing it only when the magnitude of the rescaled augmented Lagrangian gradient is smaller than the estimated KKT residual. In other words, the feasibility error condition is imposed only when we have a stronger evidence that the iterate is approaching to a stationary point than approaching to a KKT point. Such a relaxation matches the motivation of the feasibility error condition, i.e., bridging the gap between stationary points and KKT points. We also get rid of the increasing sample size requirement by adaptively controlling the absolute deviation of the augmented Lagrangian gradient for the new iterates only (i.e. the previous step is a successful step; see Section 3). Following [39], we perform a stochastic line search procedure. However, instead of using the same sample set to estimate the gradient ∇f and Hessian ∇ 2 f as in [39], we sharpen the analysis and realize that the needed samples for ∇ 2 f are significantly less than ∇f .
With all above extensions from [39], we finally prove that the KKT residual R t satisfies lim tÑ8 R t " 0 almost surely for any initialization. Such a result is stronger than [43,Theorem 4.10] for unconstrained problems and [39,Theorem 4] for equality constrained problems, which only showed the "liminf" type of convergence. Our result also differs from the (liminf) convergence of the expected KKT residual ErR 2 t s established in [3,4,5,6,18] (under a different setup). Related work. A number of methods have been proposed to optimize stochastic objectives without constraints, varying from first-order methods to secondorder methods [12]. For all methods, adaptively choosing the stepsize is particularly important for practical deployment. A line of literature selects the stepsize by adaptively controlling the batch size and embedding natural (stochastic) line search into the schemes [11,13,19,21,28]. Although empirical experiments suggest the validity of stochastic line search, a rigorous analysis is missing. Until recently, researchers revisited unconstrained stochastic optimization via the lens of classical nonlinear optimization methods, and were able to show promising convergence guarantees. In particular, [1,9,16,27,57] studied stochastic trust-region methods, and [2,14,43,55] studied stochastic line search methods. Moreover, [3,4,5,6,18,39] designed a variety of StoSQP schemes to solve equality constrained stochastic problems. Our paper contributes to this line of works by proposing an active-set StoSQP scheme to handle inequality constraints.
There are numerous methods for solving deterministic problems with nonlinear constraints, varying from exact penalty methods, augmented Lagrangian methods, interior-point methods, and sequential quadratic programming (SQP) methods [40]. Our paper is based on SQP, which is a very effective (or at least competitive) approach for small or large problems. When inequality constraints are present, SQP can be classified into IQP and EQP approaches. The former solves inequality constrained subproblems; the latter, to which our method belongs, solves equality constrained subproblems. A clear advantage of EQP over IQP is that the subproblems are less expensive to solve, especially when the quadratic matrix is indefinite. See [40,Chapter 18.2] for a comparison. Within SQP schemes, an exact penalty function is used as the merit function to monitor the progress of the iterates towards a KKT point. The 1 -penalized merit function, f pxq`µ p}cpxq} 1`} maxtgpxq, 0u} 1 q, is always a plausible choice be-cause of its simplicity. However, a disadvantage of such non-differentiable merit functions is their impedance of fast local rates. A nontrivial local modification of SQP has to be employed to relieve such an issue [10]. As a resolution, multiple differentiable merit functions have been proposed [7]. We exploit an augmented Lagrangian merit function, which was first proposed for equality constrained problems by [45,50], and then extended to inequality constrained problems by [46,47]. [49] further improved this series of works by designing a new augmented Lagrangian, and established the exact property under weaker conditions. Although not crucial for that exact property analysis, [49] did not include equality constraints. In this paper, we enhance the augmented Lagrangian in [49] by containing both equality and inequality constraints; and study the case where the objective is stochastic. When inequality constraints are suppressed, our algorithm and analysis naturally reduce to [39] (with refinements). We should mention that differentiable merit functions are often more expensive to evaluate, and their benefits are mostly revealed for local rates (see [37, Figure 1] for a comparison between the augmented Lagrangian and 1 merit functions on an optimal control problem). Thus, with only established global analysis, we do not aim to claim the benefits of the augmented Lagrangian over the popular 1 merit function. On the other hand, the augmented Lagrangian is a very common alternative of non-differentiable penalty functions, which has been widely utilized for inequality constrained problems and achieved promising performance [51,52,53,54,60]. Also, our global analysis is the first step towards understanding the local rate of StoSQP when differentiable merit functions are employed. Structure of the paper. We introduce the exploited augmented Lagrangian merit function and active-set SQP subproblems in Section 2. We propose our StoSQP scheme and analyze it in Section 3. The experiments and conclusions are in Sections 4 and 5. Due to the space limit, we defer all proofs to Appendix. Notation. We use }¨} to denote the 2 norm for vectors and spectrum norm for matrices. For two scalars a and b, a^b " minta, bu and a _ b " maxta, bu. For two vectors a and b with the same dimension, minta, bu and maxta, bu are vectors by taking entrywise minimum and maximum, respectively. For a P R r , diagpaq P R rˆr is a diagonal matrix whose diagonal entries are specified by a sequentially. I denotes the identity matrix whose dimension is clear from the context. For a set A Ď t1, 2, . . . , ru and a vector a P R r (or a matrix A P R rˆd ), a A P R |A| (or A A P R |A|ˆd ) is a sub-vector (or a sub-matrix) including only the indices in A; Π A p¨q : R r Ñ R r (or R rˆd Ñ R rˆd ) is a projection operator with rΠ A paqs i " a i if i P A and rΠ A paqs i " 0 if i R A (for A P R rˆd , Π A pAq is applied column-wise); A c " t1, 2, . . . , ruzA. Finally, we reserve the notation for the Jacobian matrices of constraints: Jpxq " ∇ T cpxq " p∇c 1 pxq, . . . , ∇c m pxqq T P R mˆd and Gpxq " ∇ T gpxq " p∇g 1 pxq, . . . , ∇g r pxqq T P R rˆd .

Preliminaries
Throughout this section, we suppose f, c, g are twice continuously differentiable (i.e., f, g, c P C 2 ). The Lagrangian function of Problem (1) is We denote by Ω " tx P R d : cpxq " 0, gpxq ď 0u (2) the feasible set and the active set. We aim to find a KKT point px ‹ , µ ‹ , λ ‹ q of (1) satisfying When a constraint qualification holds, existing a dual pair pµ ‹ , λ ‹ q to satisfy (4) is a first-order necessary condition for x ‹ being a local solution of (1). In most cases, it is difficult to have an initial iterate that satisfies all inequality constraints, and enforce inequality constraints to hold as the iteration proceeds. This motivates us to consider a perturbed set. For ν ą 0, we let Here, the perturbation radius ν{2 is not essential and can be replaced by ν{κ for any κ ą 1. Also, the cubic power in apxq can be replaced by any power s with s ą 2, which ensures that apxq P C 2 provided g i pxq P C 2 , @i. We also define a scaling function where a ν pxq measures the distance of apxq to the boundary ν, and q ν px, λq rescales a ν pxq by penalizing λ that has a large magnitude. In the definitions of (5) and (6), ν ą 0 is a parameter to be chosen: given the current primal iterate x t , we choose ν " ν t large enough so that x t P T ν . Note that while it is difficult to have x t P Ω, it is easy to choose ν to have x t P T ν . We also note that ν 2p1`}λ} 2 q ď q ν px, λq ď ν @px, λq P T νˆR r , and q ν px, λq Ñ 0 as }λ} Ñ 8.
An implication of Lemma 1 is that, when the iteration sequence converges to a KKT point, w ,ν px, λq converges to 0, i.e., gpxq " b ,ν px, λq. This motivates us to define the following augmented Lagrangian function: where η ą 0 is a prespecified parameter, which can be any positive number throughout the paper. The augmented Lagrangian (8) generalizes the one in [49] by including equality constraints and introducing η to enhance flexibility (η " 2 in [49]). Without inequalities, (8) reduces to the augmented Lagrangian studied in [39]. The penalty in (8) consists of two parts. The first part characterizes the feasibility error and consists of }cpxq} 2 and }gpxq} 2´} b ,ν px, λq} 2 . The latter term is rescaled by 1{q ν px, λq to penalize λ with a large magnitude. In fact, if }λ} Ñ 8, then q ν px, λqλ Ñ 0 so that b ,ν px, λq Ñ mint0, gpxqu (cf. (7)). Thus, the penalty term p}gpxq} 2´} b px, λq} 2 q{q ν px, λq Ñ 8, which is impossible when the iterates decrease L ,ν,η . The second part characterizes the optimality error and does not depend on the parameters and ν. We mention that there are alternative forms of the augmented Lagrangian, some of which transform nonlinear inequalities using (squared) slack variables [7,60]. In that case, additional variables are involved and the strict complementarity condition is often needed to ensure the equivalence between the original and transformed problems [22].
The exact property of (8) can be studied similarly as in [49], however this is incremental and not crucial for our analysis. We will only use (a stochastic version of) (8) to monitor the progress of the iterates. By direct calculation, we obtain the gradient ∇L ,ν,η . We first suppress the evaluation point for conciseness, and define the following matrices where e i,m P R m is the i-th canonical basis of R m (similar for e i,r P R r ). Then, where l " lpxq " diagpmaxtgpxq, 0uq maxtgpxq, 0u. Clearly, the evaluation of ∇L ,ν,η requires ∇f and ∇ 2 f , which have to be replaced by their stochastic counterparts∇f and∇ 2 f for Problem (1). Based on (10), we note that, if the feasibility error vanishes, then ∇L ,ν,η " 0 implies the KKT conditions (4) hold for any , ν, η ą 0. We summarize this observation in the next lemma. The result holds without any constraint qualifications.
In the next subsection, we introduce an active-set SQP direction that is motivated by the augmented Lagrangian (8).

An active-set SQP direction via EQP
Let , ν, η ą 0 be fixed parameters. Suppose we have the t-th iterate px t , µ t , λ t q P T νˆR mˆRr , let us denote J t " Jpx t q, G t " Gpx t q (similar for ∇f t , c t , g t , q t ν etc.) to be the quantities evaluated at the t-th iterate. We generally use index t as subscript, except for the quantities (e.g., q t ν ) that depend on , ν, or η, which have been used as subscript. For an active set A Ď t1, . . . , ru, we denote λ ta " pλ t q A , λ tc " pλ t q A c (similar for g ta , g tc , G ta , G tc etc.) to be the sub-vectors (or sub-matrices), and denote Π a p¨q " Π A p¨q, Π c p¨q " Π A c p¨q for shorthand.
With the t-th iterate px t , µ t , λ t q and the above notation, we first define the identified active set as We then solve the following coupled linear system for some B t that approximates the Hessian ∇ 2 x L t . Our active-set SQP direction is then ∆ t :" p∆x t , ∆µ t , ∆λ t q. Finally, we update the iterate as with α t chosen to ensure a certain sufficient decrease on the merit function (8).
The definition of active set was introduced in [49, (8.5)] and has been utilized, e.g., in [52]. Intuitively, for the i-th inequality constraint, if g ‹ i " pgpx ‹ qq i " 0 and λ ‹ i ą 0, then i will be identified when px t , λ t q is close to px ‹ , λ ‹ q; if g ‹ i ă 0 and λ ‹ i " 0, then i will not be identified. The stepsize α t is usually chosen by line search. In Section 3, we will design a stochastic line search scheme to select α t adaptively. Compared to fully stochastic SQP schemes [3,4,18], we need a more precise model estimation. We explain the SQP direction (12) in the next remark.
Remark 1 Our dual direction p∆µ t , ∆λ t q differs from the usual SQP direction introduced, for example, in [49, (8.9)]. In particular, the system (12a) is nothing but the KKT conditions of EQP: Thus, p∆x t , µ t`∆ µ t , λ ta`∆ λ ta q solved from (12a) is also the primal-dual solution of the above EQP. However, instead of using p∆µ t ,∆λ ta ,´λ tc q, we solve the dual direction p∆µ t , ∆λ t q for both active and inactive constraints from (12b). As B t converges to ∇ 2 x L t and px t , µ t , λ t q converges to a KKT point px ‹ , µ ‹ , λ ‹ q, it is fairly easy to see that p∆µ t , ∆λ t q converges to p∆µ t ,∆λ t q (where we de-note∆λ tc "´λ tc ) in a higher order by noting that Thus, the fast local rate of the SQP direction p∆x t ,∆µ t ,∆λ t q is preserved by ∆ t . However, it turns out that the adjustment of ∆ t is crucial for the merit function (8) when B t is far from ∇ 2 x L t . A similar, coupled SQP system is employed for equality constrained problems [34,39], while we extend to inequality constraints here. In fact, [49,Proposition 8.2] showed that p∆x t ,∆µ t ,∆λ t q is a descent direction of L t ,ν,η if px t , µ t , λ t q is near a KKT point and B t " ∇ 2 x L t . However, B t " ∇ 2 x L t (i.e., no Hessian modification) is restrictive even for a deterministic line search, and that descent result does not hold if B t ‰ ∇ 2 x L t . In contrast, as shown in Lemma 3, ∆ t is a descent direction even if B t is not close to ∇ 2 x L t .

The descent property of ∆ t
In this subsection, we present a descent property of ∆ t . We focus on the term p∇L t ,ν,η q T ∆ t . Different from SQP for equality constrained problems, ∆ t may not be a descent direction of L t ,ν,η for some points even if is chosen small enough.
To see it clearly, we suppress the iteration index, denote g a " g ta (similar for λ a , λ c etc.), and divide ∇L ,ν,η (cf. (10)) into two terms: a dominating term that depends on pg a , λ c q linearly, and a higher-order term that depends on pg a , λ c q at least quadratically. In particular, we write ∇L ,ν,η " ∇L p1q ,ν,η`∇ L p2q ,ν,η wherë Loosely speaking (see Lemma 3 for a rigorous result), p∇L p1q ,ν,η q T ∆ provides a sufficient decrease provided the penalty parameters are suitably chosen, while p∇L p2q ,ν,η q T ∆ has no such guarantee in general. Since ∇L p2q ,ν,η depends on pg a , λ c q quadratically, to ensure ∇L T ,ν,η ∆ ă 0, we require }g a }_}λ c } to be small enough to let the linear term p∇L p1q ,ν,η q T ∆ dominate. This essentially requires the iterate to be close to a KKT point, since }g a } " }λ c } " 0 at a KKT point. With this discussion in mind, if the iterate is far from a KKT point, ∆ may not be a descent direction of L ,ν,η . In fact, for an iterate that is far from a KKT point, the KKT matrix K a (and its component G a ) is likely to be singular due to the imprecisely identified active set. Thus, Newton system (12) is not solvable at this iterate at all, let alone it generates a descent direction. Without inequalities, the quadratic term ∇L p2q ,ν,η disappears and our analysis reduces to the one in [39]. We realize that the existence of ∇L p2q ,ν,η results in a very different augmented Lagrangian to the one in [39]; and brings difficulties in designing a global algorithm to deal with inequality constraints.
We point out that requiring a local iterate is not an artifact of the proof technique. Such a requirement is imposed for different search directions in related literature. For example, [49] showed that the SQP direction obtained by either EQP or IQP is a descent direction of L ,ν,η in a neighborhood of a KKT point (cf. Propositions 8.2 and 8.4). That work also required B t " ∇ 2 x L t , which we relax by considering a coupled Newton system. Subsequently, [52,54] studied truncated Newton directions, whose descent properties hold only locally as well (cf. [52,Proposition 3.7], [54,Proposition 10]). Now, we introduce two assumptions and formalize the descent property.
Assumption 1 (LICQ) We assume at x ‹ that pJ T px ‹ q G T Ipx ‹ q px ‹ qq has full column rank, where Ipx ‹ q is the active inequality set defined in (3).
The above condition on B t is standard in nonlinear optimization literature [7]. In fact, B t " I with γ B " Υ B " 1 is sufficient for the analysis in this paper.
The condition Υ B ě 1 ě γ B ą 0 (similar for other constants defined later) is inessential, which is only for simplifying the presentation. Without such a requirement, our analyses hold by replacing γ B with γ B^1 and Υ B with Υ B _ 1.
Lemma 3 Let ν, η ą 0 and suppose Assumptions 1 and 2 hold. There exist a constant Υ ą 0 depending on Υ B but not on pν, η, γ B q, and a compact set X ,νMˆΛ ,ν around px ‹ , µ ‹ , λ ‹ q depending on p , νq but not on η, 1 such that if px t , µ t , λ t q P X ,νˆMˆΛ ,ν with satisfying ď γ 2 B pγ B^η q{ tp1 _ νqΥ u, then Furthermore, there exists a compact subset X ,ν,ηˆMˆΛ ,ν,η Ď X ,νˆMˆΛ ,ν depending additionally on η, such that if px t , µ t , λ t q P X ,ν,ηˆMˆΛ ,ν,η , then Similar arguments for other directions can be found in [52,Proposition 3.5] and [54,Proposition 9]. By the proof of Lemma 3, we know that as long as M t and pJ T t G T ta q in the SQP system (12) have full (column) rank, p∇L t p1q ,ν,η q T ∆ t ensures a sufficient decrease provided is small enough. However, from (A.11) in the proof, we also see that p∇L t p2q ,ν,η q T ∆ t is only bounded by where Υ 1 ą 0 is a constant independent of p , ν, ηq. Thus, to ensure p∇L t ,ν,η q T ∆ t to be negative, we have to restrict to a neighborhood, in which }g ta } _ }λ tc } is small enough so that Υ 1 p 1_ν p1^ν 2 q _ ηqp}g ta }`}λ tc }q ď pγ B^η q{4. This requirement is achievable near a KKT pair px ‹ , λ ‹ q, where the active set is correctly identified (implying that }g ta } ď }pg t q Ipx ‹ q } and }λ tc } ď }pλ t q ti:1ďiďr,λ ‹ i "0u }); and the radius of the neighborhood clearly depends on p , ν, ηq.
In the next section, we exploit the introduced augmented Lagrangian merit function (8) and the active-set SQP direction (12) to design a StoSQP scheme for Problem (1). We will adaptively choose proper and ν (recall that η ą 0 can be any positive number in this paper), incorporate stochastic line search to select the stepsize, and globalize the scheme by utilizing a safeguarding direction (e.g., Newton or steepest descent step) of the merit function L ,ν,η . If the system (12) is not solvable, or is solvable but does not generate a descent direction, we search along the alternative direction to decrease the merit function. However, since ∆ t usually enjoys a fast local rate (see [49,Proposition 8.3] for a local analysis of p∆x t ,∆µ t ,∆λ t q and Remark 1), we prefer to preserve ∆ t as much as possible.

An Adaptive Active-Set StoSQP Scheme
We design an adaptive scheme for Problem (1) that embeds stochastic line search, originally designed and analyzed for unconstrained problems in [14,43], into an active-set StoSQP. There are two challenges to design adaptive schemes for constrained problems. First, the merit function has penalty parameters that are random and adaptively specified; while for unconstrained problems one simply uses the objective function in line search. To show the global convergence, it is crucial that the stochastic penalty parameters are stabilized almost surely. Thus, for each run, after few iterations we always target a stabilized merit function. Otherwise, if each iteration decreases a different merit function, the decreases across iterations may not accumulate. Second, since the stabilized parameters are random, they may not be below unknown deterministic thresholds. Such a condition is critical to ensure the equivalence between the stationary points of the merit function and the KKT points of Problem (1). Thus, even if we converge to a stationary point of the (stabilized) merit function, it is not necessarily true that the stationary point is a KKT point of Problem (1).
With only equality constraints, [4,39] addressed the first challenge under a boundedness condition, and our paper follows the same type of analysis. Similar boundedness condition is also required for deterministic analyses to have the penalty parameters stabilized [7,Chapter 4.3.3]. [4] resolved the second challenge by introducing a noise condition (satisfied by symmetric noise), while [39] resolved it by adjusting the SQP scheme when selecting the penalty parameters. As introduced in Section 1, the technique of [39] has multiple flaws: (i) it requires generating increasing samples to estimate the gradient of the augmented Lagrangian (cf. [39, Step 1]); (ii) it imposes a feasibility error condition for each step (cf. [39, (19)]). In this paper, we refine the technique of [39] and enable inequality constraints. As revealed by Section 2, the present analysis of inequality constraints is much more involved; and more importantly, our "lim" convergence guarantee strengthens the existing "liminf" convergence of the stochastic line search in [39,43]. In what follows, we usep¨q to denote random quantities, except for the iterate px t , µ t , λ t q. For example,ᾱ t denotes a random stepsize.
Step 1: Estimate objective derivatives. We generate a batch of independent samples ξ t 1 to estimate the gradient ∇f t and Hessian ∇ 2 f t . The estimators ∇f t and∇ 2 f t may not be computed with the same amount of samples, since they have different sample complexities. For example, we can compute∇f t using ξ t 1 while compute∇ 2 f t using a fraction of ξ t 1 (more on this in Section 3.4). With∇f t ,∇ 2 f t , we then compute∇ x L t ,Q 1,t , andQ 2,t used in the system (12). We require the batch size |ξ t 1 | to be large enough to make the gradient error of the merit function small. In particular, we definē (10) is that∆p∇L t η q is independent of¯ t (andν t ), which will be selected later (Step 2). We require |ξ t 1 | to satisfy two conditions: satisfies Step 5 for the meaning), then The sample complexities to ensure (15) and (16) will be discussed in Section 3.4. Compared to [39], we do not let |ξ t 1 | increase monotonically, while we impose an expectation condition (16) when we arrive at a new iterate. By our analysis, it is easy to see that (16) can also be replaced by requiring the subsequence t|ξ t 1 | : t´1 is a successful stepu to increase to the infinity (e.g., increase by at least one each time), which is still weaker than [39]. The right hand side of (16) will be clear when we utilizeδ t later in Step 5 (cf. (27)). We use P ξ t 1 p¨q and E ξ t 1 r¨s to denote the probability and expectation that are evaluated over the randomness of sampling ξ t 1 only, while other random quantities are conditioned on, such as px t , µ t , λ t q andᾱ t . More precisely, we mean P ξ t Step 2: Set parameter¯ t . With currentν t , we decrease¯ t Ð¯ t {ρ until¯ t is small enough to satisfy the following two conditions simultaneously: (a) the feasibility error is proportionally bounded by the gradient of the merit function, whenever the iterate is closer to a stationary point than a KKT point: (we use the same multiplier χ err only for simplifying the notation.) (b) if the SQP system (12) with∇ x L t ,Q 1,t , andQ 2,t is solvable, then we obtain ∆ t " p∆x t ,∆µ t ,∆λ t q and require We prove in Lemma 4 and Lemma 5 that both (17) and (18) can be satisfied for sufficiently small¯ t . In fact, Lemma 3 has already established (18) for the deterministic case. Even though∆ t is not always used as the search direction, we still enforce (18) to hold for p∇L t p1q t,νt,η q T∆ t . The reason for this is to avoid ruling out∆ t just because¯ t is not small enough, which would result in a positive dominating term p∇L t p1q t,νt,η q T∆ t . If (12) is not solvable (e.g., the active set is imprecisely identified so that K ta is singular), then (18) is not needed.
The condition (17) is the key to ensure that the stationary point of the merit function that we converge to is a KKT point of (1). Motivated by Lemma 2, we know that "the stationarity of the merit function plus vanishing feasibility error" implies vanishing KKT residual. (17) states that the feasibility error is roughly controlled by the gradient of the merit function. (17) relaxes [39, (19)] from two aspects. First, [39] had no multiplier while we allow any (large) multiplier χ err . Second, [39] enforced (17) for each step, while we enforce it only when we observe a stronger evidence that the scheme is approaching to a stationary point than to a KKT point. The above relaxations are driven by the intention of imposing the condition. When adjusting¯ t , if }∇L t t,νt,η } first exceedsR t before }pc t , w t t,νt q} (which easily happens for a largeν t ), then one can immediately stop the adjustment of¯ t . Compared to [39] where the SQP system is supposed to be always solvable, (17) has extra usefulness: when∆ t is not available, (17) ensures that the safeguarding direction can be computed using the samples in Step 1. Such a desire is not easily achieved, and further relaxations of (17) can be designed if we generate new samples for the safeguarding direction (in Step 3). The subtlety lies in the fact that no penalty parameters are involved when we generate ξ t 1 in Step 1, while (17) builds a connection between ξ t 1 and the penalty parameters. It implies that the set ξ t 1 satisfying (15) and (16) also satisfies the corresponding conditions for the safeguarding direction.
Step 3: Decide the search direction. We may obtain a stochastic SQP direction∆ t from Step 2. However, if (12) is not solvable, or it is solvable but∆ t is not a sufficient descent direction because then an alternative safeguarding direction∆ t must be employed to ensure the decrease of the merit function. In that case, we follow [52,54] and regard L¯ t,νt,η as a penalized objective. We require∆ t to satisfy for a constant χ u ě 1. Similar to (17), we use the same constant χ u for the two multipliers to simplify the notation. When using two different constants χ 1,u and χ 2,u , we can always set One example that satisfies (20) and is computationally cheap is the steepest descent direction∆ t "∇ L t t,νt,η with χ u " 1. Such a direction can be computed (almost) without any extra cost since the two components of∇L t t,νt,η ,∇L computationally expensive is the regularized Newton stepĤ t∆t "´∇L t t,νt,η , whereĤ t captures second-order information of L t t,νt,η and satisfies 1{χ u I ĺ H t ĺ χ u I. In particular,Ĥ t can be obtained by regularizing the (generalized) Hessian matrix H t , which is provided and discussed in [49,52], and has the form 2 Other examples that improve upon the regularized Newton step include the choices in [20,53], where a truncated conjugate gradient method is applied to an indefinite Newton system [53,Proposition 3.3,(14)]. We will numerically implement the regularized Newton and the steepest descent steps in Section 4.
Step 4: Estimate the merit function. Let q ∆ t denote the adopted search direction; thus q ∆ t "∆ t from Step 2 or q ∆ t "∆ t from Step 3. We aim to perform stochastic line search by checking the Armijo condition (26) at the trial point We estimate the merit function in this step and perform line search in Step 5. (5)), then we stop the current iteration and reject the trial point by letting where rys denotes the least integer that exceeds y. The definition of j ě 1 in (22) ensures x st P Tν t`1 . However, j " 1 works as well, since x t`1 " x t P Tν t Ď Tν t`1 , as required for performing the next iteration. In the case of x st R Tν t , particularly if a st ěν t , evaluating the merit function L st t,νt,η is not informative since the penalty term in L st t,νt,η may be rescaled by a negative multiplier. Thus, we increaseν t and rerun the iteration at the current point.
Otherwise x st P Tν t , then we generate a batch of independent samples ξ t 2 , that are independent from ξ t 1 as well, and estimate f t , f st , ∇f t , ∇f st . Similar to Step 1, the estimatorsf t ,f st and∇f t ,∇f st may not be computed with the same amount of samples. For example,f t andf st can be computed using ξ t 2 while∇f t and∇f st can be computed using a fraction of ξ t 2 . The sample complexities are discussed in Section 3.4. Here, we distinguish∇f t from∇f t in Step 1. While both of them are estimates of ∇f t , the former is computed based on ξ t 2 and the latter is computed based on ξ t 1 . Usingf t ,f st ,∇f t ,∇f st , we computeL t t,νt,η and L st t,νt,η according to (8).
We require |ξ t 2 | is large enough such that the event E t 2 , satisfies and Similar to (15) and (16), P ξ t 2 p¨q and E ξ t 2 r¨s denote that the randomness is taken over sampling ξ t 2 only, while other random quantities are conditioned on. That is, Step 5: Perform line search. With the merit function estimates, we check the Armijo condition next. (a) If the Armijo condition holds, then the trial point is accepted by letting px t`1 , µ t`1 , λ t`1 q " px st , µ st , λ st q and the stepsize is increased byᾱ t`1 " ρᾱ t^αmax . Furthermore, we check if the decrease of the merit function is reliable. In particular, if then we increaseδ t byδ t`1 " ρδ t ; otherwise, we decreaseδ t byδ t`1 "δ t {ρ.
(b) If the Armijo condition (26) does not hold, then the trial point is rejected by Finally, for both cases (a) and (b), we let¯ t`1 "¯ t ,ν t`1 "ν t and repeat the procedure from Step 1. From (27), we can see thatδ t (roughly) has the orderᾱ t }∇L t t,νt,η } 2 , which justifies the definition of the right hand side of (16).
The proposed scheme is summarized in Algorithm 1. We define three types of iterations for line search. If the Armijo condition (26) holds, we call the iteration a successful step, otherwise we call it an unsuccessful step. For a successful step, if the sufficient decrease in (27) is satisfied, we call it a reliable step, otherwise we call it an unreliable step. Same notion is used in [14,39,43].
To end this section, let us introduce the filtration induced by the randomness of the algorithm. Given a random sample sequence tξ t 1 , ξ t 2 u 8 t"0 , 3 we let F t " σptξ j 1 , ξ j 2 u t j"0 q, t ě 0, be the σ-algebra generated by all the samples till t; F t´0.5 " σptξ j 1 , ξ j 2 u t´1 j"0 Yξ t 1 q, t ě 0, be the σ-algebra generated by all the samples Algorithm 1 An Adaptive Active-Set StoSQP with Augmented Lagrangian Generate ξ t 1 so that (a) (15) holds; (b) (16) holds if t´1 is a successful step; computē ∇xLt,Q 1,t ,Q 2,t as in (9) Generate ξ t 2 and computeL t t ,ν t ,η ,L s t t ,ν t ,η so that (24) and (25) hold; 17: end if 29: end for till t´1 and the sample ξ t 1 ; and F´1 be the trivial σ-algebra generated by the initial iterate (which is deterministic). Throughout the presentation, we let¯ t be the quantity obtained after Step 2; that is,¯ t satisfies (17) and (18). With this setup, it is easy to see that We analyze Algorithm 1 in the next subsection.

Assumptions and stability of parameters
We study the stability of the parameter sequence t¯ t ,ν t u t . We will show that, for each run of the algorithm, the sequence is stabilized after a finite number of iterations. Thus, Lines 5 and 14 of Algorithm 1 will not be performed when the iteration index t is large enough. We begin by introducing the assumptions.

Assumption 3 (Regularity condition)
We assume the iterate tpx t , µ t , λ t qu and trial point tpx st , µ st , λ st qu are contained in a convex compact region XˆM Λ. Further, if x st P Tν t , then the segment tζx t`p 1´ζqx st : ζ P p0, 1qu Ď T θνt for some θ P r1, 2q. We also assume the functions f, g, c are thrice continuously differentiable over X , and realizations |F px, ξq|, }∇F px, ξq}, }∇ 2 F px, ξq} are uniformly bounded over x P X and ξ " P.
Assumption 4 (Constraint qualification) For any x P Ω, we assume that pJ T pxq G T Ipxq pxqq has full column rank, where Ω is the feasible set in (2) and Ipxq is the active set in (3). For any x P X zΩ, we assume the linear system has a solution for z P R d .
The boundedness condition on realizations in Assumption 3 is widely used in StoSQP analysis to have a well-behaved stochastic penalty parameter sequence [3,4,18,39]. The third derivatives of f, g, c are only required in the analysis and not needed in the implementation. They are required since the existence of the (generalized) Hessian of the augmented Lagrangian needs the third derivatives. See, for example, [49,Section 6] for the same requirement. For deterministic schemes, the compactness condition on the iterates is typical for the augmented Lagrangian and SQP analyses [7, Chapter 4] [40, Chapter 18]. Some literature relaxed it by assuming all quantities (e.g., the objective gradient and constraints Jacobian, etc.) are uniformly upper bounded with a lower bounded objective (so as the merit function). However, either condition is rather restrictive for StoSQP due to the underlying randomness of the scheme. That said, given the StoSQP iterates presumably contract to a deterministic feasible set, we believe that an unbounded iteration sequence is rare in general. Furthermore, compared to fully stochastic schemes in [3,4,18], we generate a batch of samples to have a more precise estimation of the true model in each iteration; thus, our stochastic iterates have a higher chance to closely track the underlying deterministic iterates.
The convexity of MˆΛ can be removed by defining a closed convex hull convpMqˆconvpMq. However, the convexity of the set for the primal iterates is essential to enable a valid Taylor expansion. See [ (14)] and references therein for the same requirement for doing line search with (8) and applying its Taylor expansion.
In particular, by the design of Algorithm 1, we have x t P Tν t for any t, while the trial step x st may be outside Tν t . If x st R Tν t , we enlargeν t (Line 14) and rerun the iteration from the beginning. Assumption 3 states that if it turns out that x st P Tν t , then the whole segment ζx t`p 1´ζqx st , which may not completely lie in Tν t as Tν t may be nonconvex, is supposed to lie in a larger space T θνt with θ P r1, 2q. Since L¯ t,νt,η is SC 1 in T2ν where T2ν t denotes the interior of T 2νt , the second-order Taylor expansion at px t , µ t , λ t q is allowed [49]. Note that the range of θ is inessential. If we replace ν{2 in (5) by ν{κ for any κ ą 1, then we would allow the existence of θ in r1, κq. In other words, θ can be as large as any κ. In fact, the condition on the segment always holds when the input α max , the upper bound ofᾱ t (cf. Line 18), is suitably upper bounded. Specifically, supposing sup X }∇apxq}_sup t } q ∆x t } ď Υ (ensured by compactness of iterates), for any θ ą 1 and ζ P p0, 1q, as long as α max ď pθ´1qν 0 {p2Υ 2 q, we have ζx t`p 1´ζqx st P T θνt by noting that Clearly, the condition on the segment is not required if T ν in (5) is a convex set, which is the case, for example, if we have linear inequality constraints x ď 0; or more generally, each g i p¨q is a convex function. We further investigate the effect of the range of θ by varying κ (κ " 2 by default; cf. (5)) in the experiments. By the compactness condition and noting thatν t is increased by at least a factor of ρ each time in (22), we immediately know thatν t stabilizes when t is large. Moreover, if we let thenν t ďν, t ě 0, almost surely. We will show a similar result for¯ t . Assumption 4 imposes the constraint qualifications. In particular, for feasible points Ω, we assume the linear independence constraint qualification (LICQ), which is a standard condition to ensure the existence and uniqueness of the Lagrangian multiplier [40]. For infeasible points X zΩ, we assume that the solution set of the linear system (29) [35] requires the weak MFCQ to hold for feasible points in addition to LICQ; while [49,52] and this paper remove such a condition. The condition (29) simplifies and generalizes the weak MFCQ in [35,49,52] by including equality constraints. We note that the weak MFCQ is slightly weaker than (29). By the Gordan's theorem [25], (29) implies that tc i¨∇ c i u i:ci‰0 Y t∇g i u i:gią0 are positively linearly independent: In contrast, the weak MFCQ only requires that the above linear combination is nonzero for a particular set of coefficients. However, we adopt the simplified but a bit stronger condition only because (29) has a cleaner form and a clearer connection to SQP subproblems. The coefficients of the weak MFCQ in [35,49,52] are relatively hard to interpret. Instead of regarding the constraint qualification as the essence of constraints, those coefficients depend on particular choice of the merit function, although that assumption statement is sharper. That said, (29) is still weaker than other literature on the augmented Lagrangian [33,46,48]; and weaker than what is widely assumed in SQP analysis [10], where the IQP system, Moreover, we do not require the strict complementary condition, which is often imposed for the merit functions that apply (squared) slack variables to transform nonlinear inequality constraints [60, A2], [22,Proposition 3.8].
The first lemma shows that (17) is satisfied for a sufficiently small¯ t . Although (17) is inspired by [39, (19)] for equalities, the proof is quite different from that paper (cf. Lemma 4 there).

Lemma 4 Under Assumptions 3 and 4, there exists a deterministic threshold
The second lemma shows that (18) is satisfied for small¯ t . The analysis is similar to Lemma 3. We need the following condition on the SQP system (12).
Assumption 5 We assume that, whenever (12) is solvable, pJ T t G T ta q has full column rank, and there exist positive constants Assumption 5 summarizes Assumptions 1 and 2. As shown in Lemma 3, the conditions on M t and pJ T t G T ta q hold locally. For the presented global analysis, the Hessian approximation B t is easy to construct to satisfy the condition, e.g., B t " I; however, such a choice is not proper for fast local rates. In practice, given a lower bound γ B ą 0, B t is constructed by doing a regularization on a subsampled Hessian (e.g., for finite-sum objectives) or a sketched Hessian (e.g., for regression objectives), which can preserve certain second-order information and be obtained with less expense. With Assumption 5, we have the following result. We summarize (30), Lemmas 4 and 5 in the next theorem.
Theorem 1 Under Assumptions 3, 4, and 5, there exist deterministic thresholdsν,˜ ą 0 such that tν t ,¯ t u t generated by Algorithm 1 satisfyν t ďν,¯ t ě˜ . Moreover, almost surely, there exists an iteration thresholdt ă 8, such that Proof The existence ofν is showed in (30). By Lemmas 4 and 5, and defining " p˜ 1^˜ 2 q{ρ, we show the existence of˜ . The existence of the iteration thresholdt is ensured by noting that tν t , 1{¯ t u t are bounded from above; and each update increases the parameters by at least a factor of ρ ą 1.
We mention that the iteration thresholdt is random for stochastic schemes and it changes between different runs. However, it always exists. The following analysis supposes t is large enough such that t ět and¯ t ,ν t have stabilized. We condition our analysis on the σ-algebra Ft, which means that we only consider the randomness of the generated samples aftert`1 iterations and, by (28), the parameters¯ t,νt are fixed. We should point out that, although it is standard to focus only on the tail of the iteration sequence to show the global convergence (even for the deterministic case [40,Theorem 18.3]), an important aspect that is missed by such an analysis is the non-asymptotic guarantees. In particular, we know the scheme changes the merit parameters for at most logpν¯ 0 {pν 0˜ qq{ logpρq times; however, how many iterations it spans for all the changes is not answered by our analysis. Establishing a bound ont in expectation or high probability sense would help us further understand the efficiency of the scheme. However, since any characterization oft is difficult even for deterministic schemes, we leave such a study to the future. Another missing aspect is the iteration complexity, where we are interested in the number of iterations to attain an -first-or secondorder stationary point (we abuse notation here to refer to the accuracy level). The iteration complexity is recently studied for two StoSQP schemes under very particular setups [5,17]; none of the existing works allow either stochastic line search or inequality constraints. We leave the iteration complexity of our scheme to the future as well.

Convergence analysis
We conduct the global convergence analysis for Algorithm 1. We prove that lim tÑ8 R t " 0 almost surely, where R t " }p∇ x L t , c t , maxtg t ,´λ t uq} is the KKT residual. We suppose the line search conditions (15), (16), (24), (25) hold. We will discuss the sample complexities that ensure these generic conditions in Section 3.4. It is fairly easy to see that all conditions hold for large batch sizes.
Our proof structure closely follows [39]. The analyses are more involved in Lemmas 7, 9, 10, 11 and Theorem 3, which account for the differences between equality and inequality constraints, and account for our relaxations of the feasibility error condition and the increasing sample size requirement of [39]. The analysis in Theorem 5 is new, which strengthens the "liminf" convergence in [39]. The analyses are slightly adjusted in Theorem 4, and the same in Lemma 8 and Theorem 2. The adopted potential function (or Lyapunov function) is where ω P p0, 1q is a coefficient to be specified later. We note that using L t t,νt,η by itself (i.e., ω " 1) to monitor the iteration progress is not suitable for the stochastic setting; it is possible that L t t,νt,η increases whileL t t,νt,η decreases. In contrast, Θ t t,νt,η,ω linearly combines different components and has a composite measure of the progress. For example, the decrease of Θ t t,νt,η,ω may come from δ t (Lines 22 and 25 of Algorithm 1).
The first lemma presents a preliminary result.
Lemma 6 Under Assumptions 3, 4, 5, the following results hold deterministically conditional on F t´1 . (a) There exists C 1 ą 0 such that the following two inequalities hold for any iteration t ě 0 ((a2) also holds for s t ), any parameters , ν, and any generated sample set ξ: There exists C 2 ą 0 such that for any t ě 0 and set ξ, (c) There exists C 3 ą 0 such that for any t ě 0 and set ξ, if (12) is solvable, then The results in Lemma 6 hold deterministically conditional on F t´1 , because the samples ξ for computing∇L t t,νt,η ,∇ x L t are supposed to be also given by the statement. The following result suggests that if both the gradient ∇L t t,νt,η and the function evaluations L t t,νt,η , L st t,νt,η are precisely estimated, in the sense that the event E t 1 X E t 2 happens (cf. (14), (23)), then there is a uniform lower bound onᾱ t to make the Armijo condition hold.
Lemma 7 For t ět`1, suppose E t 1 X E t 2 happens. There exists Υ 1 ą 0 such that the t-th step satisfies the Armijo condition (26) (i.e., is a successful step) if The next result suggests that, if only the function evaluations L t t,νt,η , L st t,νt,η are precisely estimated, in the sense that the event E t 2 happens, then a sufficient decrease ofL t t,νt,η implies a sufficient decrease of L t t,νt,η . The proof directly follows [39,Lemma 6], and thus is omitted.
Lemma 8 For t ět`1, suppose E t 2 happens. If the t-th step satisfies the Armijo condition (26), then Based on Lemmas 7 and 8, we now establish an error recursion for the potential function Θ t ω in (31). Our analysis is separated into three cases according to the events: E t 1 X E t 2 , pE t 1 q c X E t 2 and pE t 2 q c . We will show that Θ t ω decreases in the case of E t 1 X E t 2 , while may increase in the other two cases. Fortunately, by letting p grad and p f be small, Θ t ω always decreases in expectation. We first show in Lemma 9 that Θ t ω decreases when E t 1 XE t 2 happens. We note that the decrease of Θ t ω exceedsᾱ t }∇L t t,νt,η } 2 byδ t (up to a multiplier).
Proof See Appendix B.5.
Lemma 10 For t ět`1, suppose pE t 1 q c X E t 2 happens. Under (32), we have Proof See Appendix B.6.
We finally show in Lemma 11 that Θ t ω increases and the increase can exceed α t }∇L t t,νt,η } 2 , if L t t,νt,η , L st t,νt,η are not precisely estimated. In this case, the exceeding terms have to be controlled by making use of the condition (25).

Lemma 11
For t ět`1, suppose pE t 2 q c happens. Under (32), we have Combining Lemmas 9, 10, 11, we derive the one-step error recursion of Θ t ω . The proof directly follows that of [39,Theorem 2] and is omitted.
Theorem 2 (One-step error recursion) For t ět`1, suppose ω satisfies (32) and p grad and p f satisfy With Theorem 2, we derive the convergence ofᾱ t R 2 t in the next theorem, where R t " }p∇ x L t , c t , maxtg t ,´λ t uq} is the KKT residual. Then, we show that the "liminf" of the KKT residuals converges to zero.
Proof See Appendix B.9.
Finally, we strengthen the statement in Theorem 4 and complete the global convergence analysis of Algorithm 1. Our analysis generalizes the results of [39] to inequality constrained problems. The "lim" convergence guarantee in Theorem 5 strengthens the existing "liminf" convergence guarantee of stochastic line search for both unconstrained problems [43,Theorem 4.10] and equality constrained problems [39,Theorem 4]. Theorem 5 also differs from the results in [3,4,18], where the authors showed the (liminf) convergence of the expected KKT residual under a fully stochastic setup. Compared to [3,4,18], our scheme does not tune a deterministic sequence that controls the stepsizes and determines the convergence behavior (i.e., converging to a KKT point or only its neighborhood). Our scheme tunes two probability parameters p grad , p f . Seeing from (32) and (33), the upper bound conditions on p grad , p f depend on the inputs pρ, β, κ grad , α max q and a universal constant Υ 2 . Estimating Υ 2 is often difficult in practice; however, p grad , p f affect the algorithm's performance only via the generated batch sizes, and the batch sizes depend on p grad , p f only via the logarithmic factors (see (37) and (41) later). Thus, the algorithm is robust to p grad , p f . We will also empirically test the robustness to parameters for Algorithm 1 in Section 4. In addition, (32) and (33) suggest that the larger the parameters pρ, 1{β, κ grad , α max q we use, the smaller the probabilities p grad , p f have to be. Such a dependence is consistent with the general intuition: the algorithm performs more aggressive updates with less restrictive Armijo condition when pρ, 1{β, κ grad , α max q are large; thus, a more precise model estimation in each iteration is desired in this case.

Discussion on sample complexities
As introduced in Section 1, the stochastic line search is performed by generating a batch of samples in each iteration to have a precise model estimation, which is standard in the literature [11,13,14,19,21,28,43]. The batch sizes are adaptively controlled based on the iteration progress. We now discuss the batch sizes |ξ t 1 | and |ξ t 2 | to ensure the generic conditions (15), (16), (24), (25) of Algorithm 1. We show that, if the KKT residual R t does not vanish, all the conditions are satisfied by properly choosing |ξ t 1 | and |ξ t 2 |. Sample complexity of ξ t 1 . The samples ξ t 1 are used to estimate ∇f t and ∇ 2 f t in Step 1 of Algorithm 1. The estimators∇f t and∇ 2 f t can be computed with different amount of samples, and their samples may or may not be independent. Let us suppose∇f t is computed by samples ξ t 1 , while∇ 2 f t is computed by a subset of samples τ t 1 Ď ξ t 1 . The case where∇f t and∇ 2 f t are computed by two disjoint subsets of ξ t 1 can be studied following the same analysis. We definē By Lemma 6(a1), we know that (15) holds if, with probability 1´p grad , where we suppress universal constants (such as the variance of a single sample) in Op¨q notation.
Furthermore, we use the bound Er}∇ 2 f t´∇ 2 f t } 2 | F t´1 s ď Oplog d{|τ t 1 |q (cf. [58, (6.1.6)]) and know that (16) holds if Combining (35) and (36) together, we know that the conditions (15) and (16) are satisfied if Since (16) is imposed only when t´1 is a successful step, the term χ 2 gradδ t {ᾱ t on the denominator in (37) can be removed when t´1 is an unsuccessful step. In contrast to [39], where the gradient ∇f t and Hessian ∇ 2 f t are computed based on the same set of samples, we sharpen the calculation and realize that the batch size |τ t 1 | for ∇ 2 f t can be significantly less than |ξ t 1 | for ∇f t . WhenR t gets close to zero, the ratio |τ t 1 |{|ξ t 1 | will also decay to zero. We mention thatR t on the right hand side of the condition |ξ t 1 | in (37) has to be computed by samples ξ t 1 . A practical algorithm can first specify ξ t 1 , then computeR t , and finally check if (37) holds. For example, a While loop can be designed to gradually increase |ξ t 1 | until (37) holds (cf. [39,Algorithm 4]). Such a While loop always terminates in finite time when R t ą 0, becauseR t Ñ R t as |ξ t 1 | increases (by the law of large number) so that the right hand side of (37) does not diverge.

Discussion on computations and limitations
We now briefly discuss the per-iteration computational cost of Algorithm 1, and present some limitations and extensions of the algorithm.
We note that the evaluations for the function values and gradients are increasing as the iteration proceeds, and the function evaluations are square of the gradient evaluations. Under the same setup, our evaluation complexities for the functions and gradients are consistent with the unconstrained stochastic line search [43, Section 2.3] with R t replaced by }∇f t }. Although the augmented Lagrangian merit function requires the Hessian evaluations, the Hessian complexity is significantly less than that of functions and gradients, and does not have to increase during the iteration. Such an observation is missing in the prior work [39].
Constraint evaluations. Since the constraints are deterministic, Algorithm 1 has the same constraint evaluations as deterministic schemes. In particular, the algorithm evaluates four function values (two for equalities and two for inequalities; and for each type of constraint, one for current point and one for trial point), four Jacobians, and two Hessians in each iteration. Computational cost. Same as deterministic SQP schemes, solving Newton system dominates the computational cost. If we do not consider the potential sparse or block-diagonal structures that many problems have, solving the system (12) requires Oppd`m`|active set|q 3 q`Oppm`rq 3 q " Opd 3`m3`r3 q flops. Such computational cost is larger than solving a standard SQP system (see [49, (8.9)]) by the extra term Oppm`rq 3 q. However, as explained in Remark 1, the analysis of standard SQP system relies on the exact Hessian, which is inaccessible in our stochastic setting. When the SQP direction is not employed, the backup direction can be obtained with Opd`m`rq flops for the gradient step, Oppd`m`rq 3 q flops for the regularized Newton step, and between for the truncated Newton step. Such computational cost is standard in the literature [52,54], where a safeguarding direction satisfying (20) is required to minimize the augmented Lagrangian. We should mention that, as the EQP scheme, the above computations are not very comparable with the IQP schemes. In that case, the SQP systems include inequality constraints and are more expensive to solve, although less iterations may be performed.
Limitations of the design. Algorithm 1 has few limitations. First, it solves the SQP systems exactly. In practice, one may apply conjugate gradient (CG) or minimum residual (MINRES) methods, or apply randomized iterative solvers to solve the systems inexactly. The inexact direction can reduce the computational cost significantly [18]. Second, our backup direction does not fully utilize the computations of the SQP direction. Although our analysis allows any backup direction satisfying (20), and utilizing Newton direction as a backup is standard in the literature [52,54], a better choice is to directly modify the SQP direction. Then, we may derive a direction that has a faster convergence than the gradient direction, and less computations than the (regularized) Newton direction. We leave the refinements of these two limitations to the future.

Numerical Experiments
We implement the following two algorithms on 39 nonlinear problems collected in CUTEst test set [26]. We select the problems that have a non-constant objective with less than 1000 free variables. We also require the problems to have at least one inequality constraint, no infeasible constraints, no network constraints; and require the number of constraints to be less than the number of variables. The setup of each algorithm is as follows.
(a) AdapNewton: the adaptive scheme in Algorithm 1 with the safeguarding direction given by the regularized Newton step. We set the inputs asᾱ 0 " α max " 1.5, β " 0.3, κ f " β{p4α max q " 0.05, κ grad " χ grad " χ f "δ 0 " 1, stochastic scheme can select a stepsize that is greater than one (cf. Figure 4). β is close to the middle of the interval p0, 0.5q, which is a common range for deterministic schemes. p¯ 0 ,δ 0 q are adaptively selected during the iteration, while we prefer a small initial¯ 0 to run less adjustments on it. κ f is set as the allowed largest value β{p4α max q (cf. Algorithm 1); however, the parameters pκ grad , κ f , χ grad , χ f , p grad , p f q all affect the batch sizes and play the same role as the constant C that we study later. We let η be small so that the last penalty term of (8) is almost negligible, and the merit function (8) is close to a standard augmented Lagrangian function. We also test the robustness of the algorithm to three parameters C, κ, χ err . Here, C is the constant multiplier of the big "O" notation in (37) and (41) (the variance σ 2 of a single sample is also absorbed in "O", which we introduce later). κ is a parameter of the set T ν (κ " 2 in (5)), and χ err is a parameter of the feasibility error condition (17). Their default values are C " κ " 2 and χ err " 1, while we allow to vary them in wide ranges: C, κ P t2, 2 3 , 2 6 u and χ err P t1, 10, 10 2 u. When we vary one parameter, the other two are set as default. (b) AdapGD: the adaptive scheme in Algorithm 1 with the safeguarding direction given by the steepest descent step. The setup is the same as (b). For both algorithms, the initial iterate px 0 , µ 0 , λ 0 q is specified by the CUTEst package. The package also provides the deterministic function, gradient and Hessian evaluation, f t , ∇f t , ∇ 2 f t , in each iteration. We generate their stochastic counterparts by adding a Gaussian noise with variance σ 2 . In particular, we let f t " N pf t , σ 2 q,∇f t " N p∇f t , σ 2 pI`11 T qq, and p∇ 2 f t q ij " N pp∇f t q ij , σ 2 q. We try four levels of variance: σ 2 P t10´8, 10´4, 10´2, 10´1u. Throughout the implementation, we let B t " I (cf. (12), (21)) and set the iteration budget to be 10 4 . The stopping criterion is The former two cases suggest that the iteration converges within the budget. For each algorithm, each problem, and each setup, we average the results of all convergent runs among 5 runs. Our code is available at https://github.com/ senna1128/Constrained-Stochastic-Optimization-Inequality.
KKT residuals. We draw the KKT residual boxplots for AdapNewton and AdapGD in Figure 1. From the figure, we see that both algorithms are robust to tuning parameters pC, κ, χ err q. For both algorithms, the median of the KKT residuals gradually increases as σ 2 increases, which is reasonable since the model estimation of each sample is more noisy when σ 2 is larger. However, the increase of the KKT residuals is mild since, regardless of σ 2 , both methods generate enough samples in each iteration to enforce the model accuracy conditions (i.e., (15), (16), (24), (25)). Figure 1 also suggests that AdapNewton outperforms AdapGD although the improvement is limited. In fact, the convergence on a few problems may be improved by utilizing the regularized Newton step as the backup of the SQP step; however, the SQP step will be employed eventually. Sample sizes. We draw the sample size boxplots for AdapNewton and AdapGD in Figure 2. From the figure, we see that both methods generate much less samples for estimating the objective Hessian compared to estimating the objective value and gradient, between which the the objective gradient is estimated with less samples than the objective value. The sample size differences of the three quantities-objective value, gradient, Hessian-are clearer as σ 2 increases. For a fixed σ 2 , the sample sizes of different setups of pC, κ, χ err q do not vary much. In fact, the parameters κ, χ err do not directly affect the sample complexities. The parameter C plays a similar role to σ 2 and affects the sample complexities via changing the multipliers in (37) and (41). However, varying C from 2 to 64 is marginal compared to varying σ 2 from 10´8 to 10´1. Thus, Figure 2 again illustrates the robustness of the designed adaptive algorithm. Moreover, as discussed in Sections 3.4 and 3.5, the objective value, gradient, and Hessian have different sample complexities in each iteration, which depend on different powers of the reciprocal of the KKT residual 1{R t . When σ 2 " 10´8, the small variance dominates the effect of 1{R t so that all three quantities can be estimated with very few samples. When σ 2 " 0.1, the different dependencies of the sample sizes on 1{R t are more evident. Overall, Figure 2 reveals the fact that different objective quantities can be estimated with different amount of samples. Such an aspect improves the prior work [39], where the quantities with different sample complexities are estimated based on the same set of samples, and the effect of the variance σ 2 on the sample complexities is neglected.
In addition, we draw the trajectories of the sample size ratios. In particular, for both algorithms, we randomly pick 5 convergent problems and draw two ratio trajectories for each problem: one is the sample size of the gradient over the sample size of the value, and one is the sample size of the Hessian over the sample size of the gradient. We take C " 64 as an example. The plot is shown in Figure  3. From the figure, we note that the sample size ratios tend to be stabilized at a small level, and the trend is more evident when σ 2 " 0.1. As we explained for Figure 2 above, such an observation is consistent with our discussions in Section 3.4, and illustrates the improvement of our analysis over [39] for performing the stochastic line search on the augmented Lagrangian merit function.
Stepsize trajectories. Figure 4 plots the stepsize trajectories that are selected by stochastic line search. We take the default setup as an example, i.e., C " κ " 2, χ err " 1. Similar to Figure 3, for each level of σ 2 , we randomly pick 5 convergent problems to show the trajectories. Although there is no clear trend for the stepsize trajectories due to stochasticity, we clearly see for both methods that the stepsize can increase significantly from a very small value and even exceed 1. This exclusive property of the line search procedure ensures a fast convergence of the scheme, which is not enjoyed by many non-adaptive schemes where the stepsize often monotonically decays to zero.
We also examine some other aspects of the algorithm, such as the proportion of the iterations with failed SQP steps, with unstabilized penalty parameters, or with a triggered feasibility error condition (17). We also study the effect of a multiplicative noise, and implement the algorithm on an inequality constrained logistic regression problem. Due to the space limit, these auxiliary experiments are provided in Appendix D.

Conclusion
This paper studied inequality constrained stochastic nonlinear optimization problems. We designed an active-set StoSQP algorithm that exploits the exact augmented Lagrangian merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian, and selects the stepsize via stochastic line search. We proved that the KKT residuals converge to zero almost surely, which generalizes and strengthens the result for unconstrained and equality constrained problems in [39,43] to enable wider applications.
The extension of this work includes studying more advanced StoSQP schemes. As mentioned in Section 3.5, the proposed StoSQP scheme has to solve the SQP system exactly. We note that, recently, [18] designed a StoSQP scheme where an  inexact Newton direction is employed, and [3] designed a StoSQP scheme to relax LICQ condition. It is still open how to design related schemes to achieve relaxations with inequality constraints. In addition, some advanced SQP schemes solve inequality constrained problems by mixing IQP with EQP: one solves a convex IQP to obtain an active set, and then solves an EQP to obtain the search direction. See the "SQP+" scheme in [36] for example. Investigating this kind of mixed scheme with a stochastic objective is promising. Besides SQP, there are other classical methods for solving nonlinear problems that can be exploited to deal with stochastic objectives, such as the augmented Lagrangian methods and interior point methods. Different methods have different benefits and all of them deserve studying in the setup where the model can only be accessed with certain noise. Finally, as mentioned in Section 3.2, non-asymptotic analysis and iteration complexity of the proposed scheme are missing in our global analysis. Further, it is known for deterministic setting that differentiable merit functions can overcome the Maratos effect and facilitate a fast local rate, while non-smooth merit functions (without advanced local modifications) cannot. This raises the questions: what is the local rate of the proposed StoSQP, and is the local rate better than the one using non-smooth merit functions? To answer these questions, we need a better understanding on the local behavior of stochastic line search. Such a local study would complement the established global analysis, recognize the benefits of the differentiable merit functions, and bridge the understanding gap between stochastic SQP and deterministic SQP.

Acknowledgments
We thank Associate Editor and two anonymous reviewers for instructive comments, which help us further enhance the algorithm design and presentation. This material was completed in part with resources provided by the University of Chicago Research Computing Center. This material was based upon work sup-ported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract DE-AC02-06CH11347 and by NSF through award CNS-1545046.

A.2 Proof of Lemma 3
We require the following two preparation lemmas.

Lemma 13
Under Assumption 1, there exist a compact set X Q x ‹ and a constant γ H P p0, 1s such that M pxq ľ γ H I for any x P X, where M pxq is defined in (9). Furthermore, for any , ν ą 0, there exists a compact set X ,νˆΛ ,ν Q px ‹ , λ ‹ q depending on p , νq, such that Jpxq G A ,ν px,λq pxq˙`J pxq T G A ,ν px,λq pxq T˘ľ γ H I, @px, λq P X ,νˆΛ ,ν .
Proof See Appendix A.4.
We now prove Lemma 3. We suppress the evaluation point and the iteration index t. Let XˆMˆΛ Ď TνˆR mˆRr be any compact set around px ‹ , µ ‹ , λ ‹ q (independent of , ν, η) and suppose px, µ, λq P XˆMˆΛ. By Lemma 13, we know there exist a constant γ H P p0, 1s and, for any , ν ą 0, a compact subset X ,νˆΛ ,ν Ď XˆΛ such that for any point in the subset, Thus, by Assumption 2, we know from [40,Lemma 16.1] that Ka is invertible, and thus (12) is solvable. Furthermore, we can also show that (see [39, Lemma 1] for a simple proof) With the above two results, we conduct our analysis. Throughout the proof, we use Υ 1 , Υ 2 . . . to denote generic upper bounds of functions evaluated in the set XˆMˆΛ, which are independent of p , ν, η, γ B , γ H q. As they are upper bounds, without loss of generality, Υ i ě 1, @i.
Furthermore, we let X ,ν,ηˆΛ ,ν,η Ď X ,νˆΛ ,ν be a compact subset depending additionally on η, such that Then, combining (A.11) with the above two displays leads to This completes the proof.

A.3 Proof of Lemma 12
Let XˆΛ Ď TνˆR r be any compact set around px ‹ , λ ‹ q. For any px, λq P XˆΛ, we have qν px, λq (6) ě ν 2¨1 1`max λPΛ }λ} 2 ": κν . (A.12) For any i P I`px ‹ , λ ‹ q, we know g ‹ i " 0 and λ ‹ i ą 0. Thus, g ‹ i` κν λ ‹ i ą 0. Consider the ball B x i " tx : }x´x ‹ } ď r i u X X and B λ i " tλ : }λ´λ ‹ } ď r i u X Λ. For a sufficiently small r i (depending on and ν), we have px ‹ , λ ‹ q P B x iˆB λ i Ď XˆΛ and, for any px, λq P B x iˆB λ i , The first inequality is due to the continuity of g i . This implies i P A ,ν px, λq. Therefore, for any px, λq in the compact set X iPI`px ‹ ,λ ‹ q B x iˆB λ i , we have I`px ‹ , λ ‹ q Ď A ,ν px, λq. The argument A ,ν px, λq Ď Ipx ‹ q can be proved in the same way.

A.4 Proof of Lemma 13
By Assumption 1, there exists a compact set X Q x ‹ small enough such that pJ T pxq G T Ipx ‹ q pxqq has full column rank for all x P X. Furthermore, for any pa, bq P R m`r , we note that where the first implication is due to diagpgpxqqb " 0 and I c px ‹ q Ď I c pxq (since X is small), and the second implication is due to }J T pxqa`G T pxqb} " 0. Therefore, M pxq is invertible. Moreover, for any A Ď Ipx ‹ q, we have where σ min p¨q denotes the least singular value of a matrix. By (A.13), (A.14), and the compactness of X, we know that there exists γ H P p0, 1s such that To show the second part of the statement, we apply Lemma 12, and know that there exists a compact set X ,νˆΛ ,ν Ď XˆR r such that Apx, λq Ď Ipx ‹ q, @px, λq P X ,νˆΛ ,ν . Combining this fact with (A.15), we complete the proof.

B.1 Proof of Lemma 4
It suffices to show that there exists a threshold˜ ą 0 such that for any samples ξ 1 , any parameter ν P rν 0 ,νs, whereν 0 is the fixed initial input of Algorithm 1 andν is defined in (30), and any point px, µ, λq P XˆMˆΛ with x P Tν , if ď˜ , then › ›`c pxq, w ,ν px, λq˘› › ď χerr¨› ›∇ L ,ν,η px, µ, λq where∇L ,ν,η is computed using samples in ξ 1 and η, χerr ą 0 are any given positive constants. Note that everything above is deterministic; that is, our analysis does not depend on a specific iteration sequence tpxt, µt, λtqut. Thus, the threshold˜ is deterministic. Let us prove the above statement by contradiction. Without loss of generality, we suppose χerr ď 1. Suppose the statement is false, then there exist a sequence t j , ξ j 1 , ν j u j and an evaluation point sequence tpx j , µ j , λ j qu j P XˆMˆΛ such that ν j P rν 0 ,νs, x j P Tν j , j OE 0 and }∇L j j ,ν j ,η } ă 1{χerr¨}pc j , w j j ,ν j q}, @j ě 0, (B.1) where∇L j j ,ν j ,η is computed using samples ξ j 1 , and η and χerr are fixed constants. By the compactness condition, we suppose px j , µ j , λ j q Ñ px,μ,λq P XˆMˆΛ and ν j Ñ ν as j Ñ 8 (otherwise, we can consider a convergent subsequence, which must exist). Noting that c j " cpx j q and w j j ,ν j " maxtgpx j q,´ j qν j px j , λ j qλ j u are bounded due to the compactness of px j , µ j , λ j q and the boundedness of ν j and j , we have from (B.1) that Moreover, since x j P Tν j , we have ř r i"1 maxtpg j q i , 0u 3 ď ν j {2. Taking limit j Ñ 8 leads tõ x P Tν . Furthermore, by (10), (B.2), and the convergence of px j , µ j , λ j q, we get which is further simplified as 3) Supposex P X zΩ and let Icpxq " ti : 1 ď i ď m, c i pxq ‰ 0u, and Igpxq " ti : 1 ď i ď r, g i pxq ą 0u. By Assumption 4, the set is nonempty. By the Gordan's theorem [25], for any a i , b i ě 0 such that 3), and noting that the coefficients of (B.3) are all positive (sincex P Tν ), we immediately get the contradiction. Thus,x P Ω. By Assumption 4 and following the same reasoning as (A.13), M pxq is invertible and, particularly, is positive definite. Thus, M j is invertible for large enough j. Let us suppose }M´1 j } ď Υ M for some Υ M ą 0. Further, by direct calculation, we have Thus, we can obtain Let us focus on H 2,j . We know that Recalling that σ min p¨q denotes the least singular value of a matrix, by the Weyl's inequality, Since j Ñ 0 and w j j ,ν j Ñ 0 as j Ñ 8 (becausex P Ω), we know ∆H 2,j Ñ 0. In addition, since M j Ñ M pxq with M pxq being positive definite, and q j ν j ď ν j "ν, we know for some constant ϕ ą 0 and sufficiently large j, σ min pH 2,j q ě ϕ. (B.7) Now we bound the first term in (B.6). By (10) and the invertibility of M j , we know , . - Moreover, by the compactness condition, we have }H 1,j } ď Υ 1 and }pJ T j G T j q} ď Υ 2 for some constants Υ 1 , Υ 2 ą 0. Combining (B.7), (B.8) with (B.6), we have Noting that ϕ j Ñ 0 as j Ñ 8 (since w j j ,ν j Ñ 0 and j Ñ 0), we obtain for large j that which cannot hold because j OE 0. This is a contradiction, and thus we complete the proof.

B.2 Proof of Lemma 5
The proof closely follows the proof of Lemma 3 in Appendix A.2. We suppress the iteration t and assume ξ t 1 is any sample set. Our analysis is independent of the sample set ξ t 1 for computing∇L t Therefore, as long as Thus, we can define˜ 2 :" , which implies (B.13) and completes the proof.

B.5 Proof of Lemma 9
Algorithm 1 has three types of steps: a reliable step (Line 19), an unreliable step (Line 21), and an unsuccessful step (Line 24). For each type of step, q ∆t "∆t or q ∆t "∆t. Thus, we analyze in the following six cases.

B.6 Proof of Lemma 10
The proof follows the proof of Lemma 9, except that (B.32) and (B.39) do not hold due to pE t 1 q c . We consider the following six cases.

B.7 Proof of Lemma 11
The proof follows the proof of Lemma 10, except that Lemma 8 is not applicable. We consider the following six cases.
Case 1a, reliable step, q ∆t "∆t. We have
Combining the above three cases, we have φt ě ϕt, @t ět`1. Note that, conditional on Ft, tϕtu tět`1 is a random walk with a maximum and a drift upward (cf. [23, Example 6.1.2]). Thus, lim sup tÑ8 ϕt ě logpcq almost surely. In particular, we have Pˆlim sup tÑ8 φt ě logpcq˙" which means that the "limsup" ofᾱt is lower bounded almost surely. Using Theorem 3, we complete the proof.

B.10 Proof of Theorem 5
Suppose lim sup tÑ8 Rt " ą 0. By Theorem 4, we know there exist two sequences tn i u i and tm i u i with n i ă m i ă n i`1 for all i, such that Rn i ě 2 3 , Rt ě 3 , t " n i`1 , . . . , m i´1 , Rm i ă 3 .
For each interval rn i , m i s, we use tt i,j u J i j"1 to denote a subsequence within the interval such that n i " t i,1 ă . . . ă t i,j ă . . . ă t i,J i " m i and t i,j´1 is a successful step. In other words, t i,j is the first index that we arrive at the new point. Here, we suppose n i´1 is a successful step; that is, the index n i is the first time we arrive at the point pxn i , µn i , λn i q (one can always choose n i to satisfy this condition). We also note that t i,J i " m i because R m i´1 ě {3 where both inequalities are from Lemma 15. Taking 2 norm on both sides, we finish the proof.

D Auxiliary Experiments
We follow the experiments in Section 4 and provide additional results. We first examine three proportions: (1) the proportion of the iterations with failed SQP steps, (2) the proportion of the iterations with unstabilized penalty parameters, (3) the proportion of the iterations with a triggered feasibility error condition. We then investigate a multiplicative noise, and apply the method on an inequality constrained logistic regression problem.
Failed SQP steps. Figure 5 plots the proportion of the iterations with failed SQP steps. From the figure, we see that the proportion varies from 10% to 60% across the problems, and AdapNewton tends to have a smaller proportion than AdapGD. Although the proportion does not have a clear dependency on the variance σ 2 , the noticeable proportion of failed SQP steps illustrates the differences between equality and inequality constrained problems. As analyzed in Section 2, the active-set SQP steps may not be informative if the identified active set is very distinct from the true active set. Due to the potential failure of the SQP steps, utilizing a safeguarding direction is critical in achieving the global convergence for the algorithm.
Non-stationary penalty parameters. Figure 6 plots the proportion of the iterations with unstabilized penalty parameters; i.e., the last iteration that we update¯ 0 over the total number of the iterations. From the figure, we observe that the proportion varies from 20% to 70%, and AdapNewton and AdapGD have comparable results. In fact, the proportion highly depends on the adopted initial¯ 0 and the updating rule of¯ 0 . For example, a large ρ and a small¯ 0 will reduce the proportion significantly; and the updating rules¯ 0 Ð¯ 0 {ρ and¯ 0 Ð expp´1{¯ 0 q will also lead to different proportions. The large variation in Figure 6 suggests that different problems stabilize¯ 0 to different levels; thus, a problem-dependent tuning of¯ 0 is desired in practice. We note in the experiments that the results on some problems can be improved if 0 " 10´4, while such a setup may not be suitable for other problems. Thus, designing a robust scheme to select the penalty parameters deserves further studying.
Feasibility error condition. Figure 7 plots the proportion of the iterations with a triggered feasibility error condition. We do not show the results for the different setups of χerr. In fact, when χerr " 1, the results are identical to C " 2 and κ " 2 (see the left column of Figure 7). However, when χerr " 10 or 100, the feasibility error condition is never triggered. From Figure  7, we see that the proportion is extremely small (e.g., as small as 1%). This suggests that the condition (17) is hardly triggered in practice. Figure 7 also plots the iteration proportion that (17) is triggered for an unsuccessful step. We see that such an proportion is even smaller (e.g., less than 0.5%). Given these negligible proportions, we can conclude that the condition (17) does not negatively affect the performance of the designed StoSQP scheme.
Multiplicative noise. We also investigate a multiplicative noise in the experiments. In particular, we employ the default setup pC, κ, χerrq " p2, 2, 1q but replace the noise variance σ 2 by p1`}xt} 2 qσ 2 . Thus, the variance scales linearly with respect to the magnitude of the (primal) iterate. The KKT residual and sample size boxplots are shown in Figure 8. Compared to Figures 1 and 2, we see that the algorithm achieves comparable results to additive noise. This observation is as expected because, regardless of the noise type, the algorithm enforces the same stochastic conditions on the model estimation accuracy in each iteration, and adaptively selects the batch sizes that are mainly characterized by the current KKT residual.
Logistic regression problem. We study an inequality constrained logistic regression problem, where we let F px; pξa, ξ b qq " logt1`expp´ξ b¨ξ T a xqu, gpxq " Cx`q. We set d " 10, r " 5, and generate each entry of the matrix C P R 5ˆ10 and vector q P R 5 from the standard Gaussian distribution. We let ξ b be a Rademacher variable (i.e., taking t´1, 1u with equal probability), and consider different design distributions for ξa. In particular, we (a) C " 2 (b) C " 2 3 (c) C " 2 6 (d) κ " 2 (e) κ " 2 3 (f) κ " 2 6 (g) χerr " 1 (h) χerr " 10 (i) χerr " 10 2 consider both a light tail design pξaq i " N p0, σ 2 a q and vary σ 2 a P t10´8, 10´4, 10´2u, and a heavy tail design pξaq i " Exppλaq and vary λa P t10, 10 2 , 10 4 u. Note that Exppλaq has the variance 1{λ 2 a . For each design, we run AdapNewton and AdapGD for 20 times. The default algorithm setup is the same as in Section 4. Figure 9 shows the KKT residual boxplots. From the figure, we observe that AdapNewton performs slightly better than AdapGD. Both methods achieve reasonable performance on all setups of the two designs, although the two methods perform better on the Gaussian design that has a lighter tail than the Exponential design. Overall, the experiments demonstrate the effectiveness of the proposed algorithm.