A two-level distributed algorithm for nonconvex constrained optimization

This paper aims to develop distributed algorithms for nonconvex optimization problems with complicated constraints associated with a network. The network can be a physical one, such as an electric power network, where the constraints are nonlinear power flow equations, or an abstract one that represents constraint couplings between decision variables of different agents. Despite the recent development of distributed algorithms for nonconvex programs, highly complicated constraints still pose a significant challenge in theory and practice. We first identify some difficulties with the existing algorithms based on the alternating direction method of multipliers (ADMM) for dealing with such problems. We then propose a reformulation that enables us to design a two-level algorithm, which embeds a specially structured three-block ADMM at the inner level in an augmented Lagrangian method framework. Furthermore, we prove the global and local convergence as well as iteration complexity of this new scheme for general nonconvex constrained programs, and show that our analysis can be extended to handle more complicated multi-block inner-level problems. Finally, we demonstrate with computation that the new scheme provides convergent and parallelizable algorithms for various nonconvex applications, and is able to complement the performance of the state-of-the-art distributed algorithms in practice by achieving either faster convergence in optimality gap or in feasibility or both.


Introduction
This paper develops a new two-level distributed algorithm with global and local convergence guarantees for solving general smooth and nonsmooth constrained nonconvex optimization problems. We will start with a general constrained optimization model that is motivated by nonlinear network flow problems, and then explore reformulations for distributed computation. This process of reformulation leads us to observe two structural properties that a distributed reformulation should possess, which in fact pose some challenges to existing distributed algorithms in terms of convergence and practical performance. This observation inspired us to develop the two-level distributed algorithm. We will summarize our contributions in the end of this section.

Constrained Nonconvex Optimization over a Network
Consider a connected, undirected graph 1 G(V, E) with a set of nodes V and a set of edges E. A centralized constrained optimization problem on G is given as where each node i ∈ V of the graph G is associated with a decision variable x i and a cost function f i (x i ) as in (1a). Variable x i and variables x j of i's adjacent nodes j ∈ δ(i) are coupled through constraints (1b)-(1c), and X i in (1d) represents some constraints only for x i . The functions f i , h i , g i , and the set X i may be nonconvex. Any constrained optimization problem can be reformulated as (1) after proper transformation. An especially interesting motivation for us is the nonlinear network flow problems. In this case, the graph G represents a physical network such as an electric power network, a natural gas pipeline network, or a water transport network, where the variables x i in (1) are nodal potentials such as electric voltages, gas pressures, or water pressures, and the constraints h i and g i are usually nonconvex functions that describe the physical relations between nodal potentials and flows on the edges, flow balance at nodes, and flow capacity constraints. Notice that a node i in the graph can also represent a sub-network of the entire physical network, and then the constraints could involve variables in adjacent sub-networks. There has been much recent interest in solving nonlinear network flow problems, e.g. applications in the optimal power flow problem in electric power network [34], the natural gas nomination problem [52], and the water network scheduling problem [14].
In many situations, it is desirable to solve problem (1) in a distributed manner, where each node i represents an individual agent that solves a localized problem, while agents coordinate with their neighbors to solve the overall problem. Each agent need to handle its own set of local constraints h i , g i , and X i . For example, agents may be geographically dispersed with local constraints representing the physics of the subsystems, which cannot be controlled by other agents; or agents may have private data in their constraints, which cannot be shared with other agents; or the sheer amount of data needed to describe constraints or objective could be too large to be stored or transmitted in distributed computation between agents. These practical considerations pose restrictions that each agent in a distributed algorithm has to deal with a set of complicated, potentially nonconvex, constraints.

Necessary Structures of Distributed Formulations
In order to do distributed computation, the centralized formulation (1) first needs to be transformed into a formulation to which a distributed algorithm could be applied. We call such a formulation a distributed formulation, whose form may depend on specific distributed algorithms as well as on the structure of the distributed computation, e.g. which variables and constraints are controlled by which agents and in what order computation and communication can be carried out. Despite the great variety of distributed formulations, we want to identify some desirable and necessary features for a distributed formulation.
One desirable feature is the capability of parallel decomposition so that all agents can solve their local problems in parallel, rather than in sequence. To realize this, each agent needs a local copy of its neighboring agents' variables. For problem (1), we may introduce a local copy x i j of the original variable x j and a global copyx j , and enforce consensus as Using this duplication scheme, a distributed formulation of (1) can be written as min x,x s.t. Ax + Bx = 0, (3b) In problem (3), the optimization variables are all the local variables controlled by agent i including the original variable x i and the local copies x i j ; each subvectorx j ofx denotes a global copy of x j . The set X i ⊆ R n1i is defined as X i := {v : h i (v) = 0, g i (v) ≤ 0}, so the original constraints (1b)-(1c) are decoupled into each agent's local constraints X i , which also absorb the constraints (1d). Additionally, the global copyx is constrained in some simple convex setX ⊆ R n2 . The only coupling among agents are (3b), which formulate the consensus constraint (2) with A ∈ R m×n1 and B ∈ R m×n2 . An alternating optimization scheme is then natural, as all the agents can solve their subproblems over x i 's in parallel oncex is fixed; and once x i 's are updated and fixed, the subproblems overx can also be solved in parallel.
In fact, for any constrained optimization problem, not necessarily a network flow type problem, if distributed computation is considered, variables of the centralized problem need to be grouped into variables x i in a distributed formulation for agents i according to the decision structure, and duplicate variablesx need to be introduced to decouple the constraints from agents. In this way, problem (3) provides a general formulation for distributed computation of constrained optimization problems. Conversely, due to the necessity of duplicating variables, any distributed formulation of a constrained program necessarily shares some key structures of (3). In particular, problem (3) has two simple but crucial properties. Namely, -Property 1: As the matrices A and B are defined in (2), the image of A strictly contains the image of B, i.e. Im(A) Im(B). -Property 2: Each agent i may face local nonconvex constraints X i . Property 1 follows from the fact that, for any given value ofx j in (2), there is always a feasible solution (x j , x i j ) that satisfies the equalities in (2), but if x j = x i j , then there does not exist anx j that satisfies both equalities in (2). Property 2 follows from our desire to decompose the computation for different agents.
In this paper, we will show that the above two properties of distributed constrained optimization pose a significant challenge to the theory and practice of existing distributed optimization algorithms. In particular, existing distributed algorithms based on the alternating direction method of multipliers (ADMM) may fail to converge for the general nonconvex constrained problem (3) without further reformulation or relaxation. Before proceeding, we summarize our contributions.

Summary of Contributions
The contributions of the paper can be summarized below.
Firstly, we propose a new reformulation and a two-level distributed algorithm for solving nonconvex constrained optimization problem (1)-(2), which embeds a specially structured three-block ADMM at the inner level in an augmented Lagrangian method (ALM) framework. The proposed algorithm maintains the flexibility of ADMM in achieving distributed computation.
Secondly, we prove global and local convergence as well as iteration complexity results for the proposed two-level algorithm, and illustrate that the underlying algorithmic framework can be extended to more complicated nonconvex multi-block problems. For the convergence of ADMM, we allow each nonconvex subproblem to be solved to a stationary point with certain improvement in the objective function compared to the previous iterate, which mildly relaxes the global optimality of nonconvex subproblems commonly assumed in the ADMM literature. Our convergence analysis builds on the classical and recent works on ADMM and ALM, and our results are derived by relating these two methods in an analytical way.
Thirdly, we provide extensive computational tests of our two-level algorithm on nonconvex network flow problems, parallel minimization of nonconvex functions over compact manifolds, and a robust tensor PCA problem from machine learning. Numerical results demonstrate the advantages of the proposed algorithm over existing ones, including randomized updates, modified ADMM, and centralized solver, either in the convergence speed to close optimality gap, or to close feasibility gap, or both. Moreover, our test result on the multi-block robust tensor PCA problem suggests that the proposed twolevel algorithm not only ensures convergence for a wider range of applications where ADMM may fail, but also tends to accelerate ADMM on problems where convergence of ADMM is already guaranteed.

Notation
Throughout this paper, we use Z + (resp. Z ++ ) to denote the set of nonnegative (resp. positive) integers, and R n to denote the n-dimensional real Euclidean space. For x, y ∈ R n , the inner product is denoted by x y or x, y ; the Euclidean norm is denoted by x = x, x . A vector x may consist of J subvectors x j ∈ R nj with J j=1 n j = n; in this case, we will write x = [{x j } j∈[J] ], where [J] = {1, · · · , J}. Occasionally, we use x i to denote the i-th component of x if there is no confusion to do so. For a matrix A, denote its largest singular value by A and image space by Im(A). We use B r (x) to denote the Euclidean ball centered at x with radius r > 0. For a closed set C ⊂ R n , the interior of C is denoted by Int C, the projection operator onto C is denoted by Proj C (x), and the indicator function of C is denoted by I C (x), which takes value 0 if x ∈ C and +∞ otherwise.
The rest of this paper is organized as follows. In Section 2, we review the literature and summarize two conditions that are crucial to the convergence of ADMM, which are essentially contradicting to Properties 1 and 2. In Section 3, we propose our new reformulation and a two-level algorithm for solving problem (3) in a distributed way. In Section 4, we provide the global convergence as well as the iteration complexity result, and show our scheme can be applied to more complicated multi-block problems. Then in Section 5, we show the local convergence result under standard second-order assumptions. Finally, we present computational results in Section 6 and conclude in Section 7.

Related Literature
In this section, we review the literature on ADMM and other distributed algorithms, and identify some limitations of the standard ADMM approach in solving problem (3).

ADMM for Nonconvex Problems
The convergence of ADMM has been observed for many nonconvex problems with various applications in matrix completion and factorization [72,57,74,73], optimal power flow [61,16,45], asset allocation [69], and polynomial optimization [33], among others. For convergence theory, several conditions have been proposed to guarantee convergence on structured nonconvex problems that can be abstracted in the following form min x1,··· ,xp,z We summarize some convergence conditions in Table 1 [20] provided an alternative convergence rate proof of proximal ADMM applied to convex problems, which was shown to be an instance of a more general non-Euclidean hybrid proximal extragradient framework. The two-block, multi-block, and Jacobi-type extensions of this framework to nonconvex problems can be found in [21,49,48], where an iteration complexity of O(1/ √ k) was also established. Jiang et al. [32] proposed two variants of proximal ADMM. Some proximal terms are added to the first p block updates; for the last block, either a gradient step is performed, or a quadratic approximation of the augmented Lagrangian is minimized. For general nonconvex and nonsmooth problems, we note that the convergence of ADMM relies on the following two conditions. Due to the sequential update order of ADMM, z k is obtained after x k is calculated. If Condition 1 on the images of A and B is not satisfied, then it is possible that x k converges to some x * such that there is no z * satisfying Ax * + Bz * = b. In addition, Condition 2 provides a way to control dual iterates by primal iterates via the optimality condition of the z-subproblem. This relation requires unconstrained optimality condition of z-update, so the last block variable z cannot be constrained elsewhere. See also [67] for some relevant discussions. As indicated from Table 1, these two conditions (and their variants) are almost necessary for ADMM to converge in the absence of convexity. We also note that, even for convex problems, these two conditions are used to relax the strong convexity assumption in the objective [40] or accelerate ADMM with O(1/k 2 ) iteration complexity [63]. It turns out that the two conditions and the two properties we mentioned in Section 1.2 may conflict each other. By Property 1, the image of A strictly constrains the image of B, so by Condition 1, we should update local variables after the global variable in each ADMM iteration to ensure feasibility. However, by Property 2, each local variable is subject to some local constraints, so Condition 2 cannot be satisfied; technically speaking, we cannot utilize the unconstrained optimality condition of the last block to link primal and dual variables, which again makes it difficult to ensure primal feasibility of the solution. When ADMM is directly applied to nonconvex problems, divergence is indeed observed [61,45,67]. As a result, for many applications in the form of (3) where the above two conditions are not available, the ADMM framework cannot guarantee convergence.
After completing a draft of this paper, we were informed of a ADMM-based approach in [32], where the authors proposed to solve the relaxed problem of (3) min x∈X ,x∈X ,z Notice first that, as proved in [32], in order to achieve a desired feasibility with Ax + Bx = O( ), the coefficient β( ) and ADMM penalty need to be as large as O(1/ 2 ). Such large parameters may lead to slow convergence and large optimality gaps. Also notice that, applying ADMM to (5) may produce an approximate stationary solution to (3), even when the problem is infeasible to begin with. As we will show in Section 4, our proposed two-level algorithm is able to achieve the same order of iteration complexity as the reformulation (5) and the one-level ADMM approach proposed in [32], and meanwhile the proposed two-level algorithm provides information on ill conditions and infeasibility; in Section 6, we empirically demonstrate with computation that the proposed algorithm robustly converges on large-scale constrained nonconvex programs with a faster speed and obtains solutions with higher qualities.

Other Distributed Algorithms
Some other distributed algorithms not based on ADMM are also studied in the literature. Hong [29] introduced a proximal primal-dual algorithm for distributed optimization problems, where a proximal term is added to cancel out cross-product terms in the augmented Lagrangian function. Lan and Zhou [36] proposed a randomized incremental gradient algorithm for a class of convex problems over a multi-agent network. Lan and Yang [35] proposed accelerated stochastic algorithms for nonconvex finite-sum and multi-block problems; interestingly, the analysis for the multi-block problem also requires the last block variable to be unconstrained with an invertible coefficient matrix and a Lipschitz differentiable objective, which further confirms the necessity of Conditions 1 and 2. We end this subsection with a recent work by Shi et al. [58]. They studied the problem min x,y The variables x and y are divided into n and m subvectors, respectively. f (x, y), h(x, y), g i (x i ) are continuously differentiable,φ j (y j ) is a composite function, and X i 's are convex. The authors proposed a doubly-looped penalty dual decomposition method (PDD). The overall algorithm used the ALM framework, where the coupling constraint h(x, y) = 0 is relaxed and each ALM subproblem is solved by a randomized block update scheme. We note that randomization is crucial in their convergence analysis, and a deterministic implementation of the inner-level algorithm for solving the ALM subproblem may not converge when nonconvex functional constraints are present.

A Key Reformulation and A Two-level Algorithm
We In equations (7) and (8), the notation N X (x) denotes the general normal cone of X at x ∈ X [56, Def 6.3], and ∂L(·) denotes the general subdifferential of L(·) [56,Def 8.3]. Some properties and calculus rules of normal cones and the general subdifferential can be found in [56,Chap 6,8,10].
It can be shown that if (x * ,x * ) is a local minimum of (3) and satisfies some mild regularity condition, then condition (7) is satisfied [56,Thm 8.15]. If X andX are defined by finitely many continuously differentiable constraints, then condition (7) is equivalent to the well-known KKT condition of problem (3) under some constraint qualification. Therefore, condition (7) can be viewed as a generalized first-order necessary optimality condition for nonsmooth constrained problems. Our goal is to find such a stationary point (x * ,x * , y * ) for problem (3).

A Key Reformulation
As analyzed in the previous section, since directly applying ADMM to a distributed formulation of the general constrained nonconvex problem (3) cannot guarantee convergence without using the relaxation scheme in [32], we want to go beyond the standard ADMM framework. We propose two steps for achieving this. The first step is taken in this subsection to propose a new reformulation, and the second step is taken in the next subsection to propose a new two-level algorithm for the new reformulation.
We consider the following reformulation of (3) The idea of adding a slack variable z ∈ R m has two consequences. The first consequence is that the linear coupling constraint Ax + Bx + z = 0 now has three blocks, and the last block is an identity matrix I m , whose image is the whole space. Given any x andx, we can always let z = −Ax − Bx to make the constraint satisfied. The second consequence is that the artificial constraint z = 0 can be treated separately from the coupling constraint. Notice that a direct application of ADMM to problem (9) still does not guarantee convergence since Conditions 1 and 2 are not satisfied yet. So it is necessary to separate the linear constraints into two levels. If we ignore z = 0 for the moment, existing techniques in ADMM analysis can be applied to the rest of the problem. Since we want to utilize the unconstrained optimality condition of the last block, we can relax z = 0. This observation motivates us to choose ALM. To be more specific, consider the problem min x∈X ,x∈X ,z which is obtained by dualizing constraint z = 0 with λ k ∈ R m and adding a quadratic penalty β k 2 z 2 with β k > 0. The augmented Lagrangian term λ k , z + β k 2 z 2 can be viewed as an objective function in variable z, which is not only Lipschitz differentiable but also strongly convex. Problem (10) can be solved by a three-block ADMM in a distributed fashion when a separable structure is available. Notice that the first-order optimality condition of problem (10) at a stationary solution (x k ,x k , z k , y k ) is However, such a solution may not satisfy primal feasibility Ax+Bx = 0, which is the only difference from the optimality condition (7) (note that (11c) is analogous to the dual feasibility in variable z in the KKT condition). Fortunately, the ALM offers a scheme to drive the slack variable z to zero by updating λ and we can expect iterates to converge to a stationary point of the original problem (3). In summary, reformulation (9) separates the complication of the original problem into two levels, where the inner level (10) provides a formulation that simultaneously satisfies Conditions 1 and 2, and the outer level drives z to zero. We propose a two-level algorithmic architecture in the next subsection to realize this.

A Two-level Algorithm
The proposed algorithm consists of two levels, both of which are based on the augmented Lagrangian framework. The inner-level algorithm is described in Algorithm 1, which uses a three-block ADMM to solve problem (10) and its iterates are indexed by t. The outer-level algorithm is described in Algorithm 2 with iterates indexed by k.
Given λ k ∈ R m and β k > 0, the augmented Lagrangian function associated with the k-th inner-level problem (10) is defined as where y ∈ R m is the dual variable for constraint Ax + Bx + z = 0 and ρ k is a penalty parameter for ADMM. In view of (11), the k-th inner-level ADMM aims to find an approximate stationary solution (x k ,x k , z k , y k ) of (10) in the sense that there exist d k where k i 's are positive tolerances. The optimality conditions of x t in Line 5 andx t in Line 7 of Algorithm 1 read: With the dual update in Line 11, we can see that As a result, Algorithm 1 can be terminated if it finds (x t ,x t , z t ) such that Notice that ρ k does not appear in (14c), so we can use different tolerances for the above three measures. Since (13c) is always maintained by ADMM with (y k , z k ) = (y t , z t ), a solution satisfying (14) is an approximate stationary solution to problem (10) by assigning (x k ,x k , z k , y k ) := (x t ,x t , z t , y t ).

15:
end if 16: end for The first block update in Algorithm 1 reads as so line 5 of Algorithm 1 searches for a stationary solution x t of the constrained problem (15). The second and third block updates in lines 7 and 9 admit closed form solutions, so in view of the network flow problem (1), the proposed reformulation (9) does not introduce additional computational burden. All primal and dual updates in Algorithm 1 can be implemented in parallel as f and X admit separable structures. In each ADMM iteration, agents solve their own local problems independently and only need to communicate with their immediate neighbors. We resolve this by updating λ and β, which is referred as outer-level iterations indexed by k in Algorithm 2.
In Algorithm 2, we choose some predetermined bounds [λ, λ] and explicitly project the "true" dual variable λ k +β k z k onto this hyper-cube to obtain λ k+1 used in the next outer iteration. Such safeguarding technique is essential to establish the global convergence of ALM [1,43]. We increase the outer-level penalty β k if there is no significant improvement in reducing z k .
end if 10: end for Before proceeding to the next section, we note that the key reformulation (9) is inspired by the hope to reconcile the conflict between the two properties and the two condition so that ADMM can be applied. The introduction of additional variable z is not necessary in the sense that any method that achieves distributed computation for the subproblem min x∈X ,x∈X can be embedded inside the ALM framework. The aforementioned PDD method [58] is such an approach. There are some other update schemes [6,71] that can handle functional constraints in (16), assuming that the (Euclidean) projection oracle onto the nonconvex set X is available. It would be interesting to compare their performances with ADMM when used in the inner level, and we leave this to future work. Meanwhile, as we will demonstrate in Section 6, the proposed two-level algorithm preserves the desirable properties of ADMM in practice, such as fast convergence in early stages and scalability to handle large-scale problems.

Global Convergence
In this section, we prove global convergence and convergence rate of the proposed two-level algorithm. Starting from any initial point, iterates generated by the proposed algorithm have a limit point; every limit point is a stationary solution to the original problem under some mild condition. In particular, we make the following assumptions.
Assumption 1 Problem (9) is feasible and the set of stationary points satisfying (7) is nonempty.

Assumption 2
The objective function f : R n → R is continuously differentiable, X ⊆ R n is a compact set, andX is convex and compact.
Assumption 3 Given λ k , β k , and ρ k , the first block update can find a sta- We give some comments below. Assumption 1 ensures the feasibility of problem (9), which is standard. Though it is desirable to design an algorithm that can guarantee feasibility of the limit point, usually this is too much to ask: the powerful ALM may converge to an infeasible limit point even if the original problem is feasible. If this situation happens, or problem (9) is infeasible in the first place, our algorithm will converge to a limit point that is stationary to some problem, as stated in Theorem 1. The compactness required in Assumption 2 ensures that the sequence generated by our algorithm stays bounded, and can be dropped if the existence of a limit point is directly assumed or derived from elsewhere. We do not make any explicit assumptions on matrices A and B in this section, and our analysis does not rely on any convenient structures that A and B may process, such as full row or column rank.
For Assumption 3, we note that finding a stationary point usually can be achieved at the successful termination of some nonlinear solvers. In addition, the state-of-the-art nonlinear solver IPOPT [64] will accept a trial point if either the objective or the constraint violation is decreased in each iteration. In step 1 of Algorithm 1, since x t−1 is already a feasible solution, if we start from x t−1 , it is reasonable to expect a new stationary point x t is reached with an improved objective value. Assumption 3 is slightly weaker and more realistic than assuming that the nonconvex subproblem can be solved globally, which is commonly adopted in the nonconvex ADMM literature.
In Section 4.1, we show that each inner-level ADMM converges to a solution that approximately satisfies the stationary condition (11) of problem (10). This sequence of solutions that we obtain at termination of the inner ADMM is referred as outer-level iterates. Then in Section 4.2, we firstly characterize limit points of outer-level iterates, whose existence is guaranteed. Then we show that a limit point is stationary to problem (3) if some mild constraint qualification is satisfied.

Convergence of Inner-level Iterations
In this subsection, we show that, by applying the three-block ADMM to problem (10), we will get an approximate stationary point (x k ,x k , z k , y k ) satisfying the approximate stationary condition (13). The convergence of the inner-level ADMM in this subsection uses some techniques from the literature, e.g., [67]. We present a self-contained proof in the appendix and demonstrate that the descent oracle assumed in Assumption 3 relaxes the global optimality of subproblems without affecting the overall convergence.
Proposition 1 Suppose Assumptions 2-3 hold. The k-th inner-level ADMM of Algorithm 1 terminates, i.e., the stopping criteria (14) is satisfied, in at most Proof See Appendix A.1.
In particular, the approximate stationary condition (13) is satisfied with the solution returned by ADMM.

Convergence of Outer-level Iterations
In this subsection, we prove the convergence of outer-level iterations. In general, when the method of multipliers is used as a global method, there is no guarantee that the constraint being relaxed can be satisfied at the limit. Due to the special structure of our reformulation, we are able to give a characterization of limit points of outer-level iterates.
Theorem 1 Suppose Assumptions 2-3 hold. Let {(x k ,x k , z k , y k )} k∈Z++ be the sequence of outer-level iterates of Algorithm 2 satisfying condition (13). Then the sequence of the primal solutions {(x k ,x k , z k )} k∈Z++ are bounded, and every limit point (x * ,x * , z * ) of this sequence satisfies one of the following: is a stationary point of the problem min x∈X ,x∈X Proof See Appendix A.2.
Theorem 1 gives a complete characterization of limit points of outer-level iterates. If the limit point is infeasible, i.e. z * = 0, then (x * ,x * ) is a stationary point of the problem (18). This is also the case if problem (3) is infeasible, i.e. the feasible region defined by X andX does not intersect the affine plane Ax+Bx = 0, since each inner-level problem (10) is always feasible and the first case in Theorem 1 cannot happen. We also note that even if (x * ,x * ) falls into the second case of Theorem 1, it is still possible that the associated z * = 0, but then (x * ,x * ) will be some irregular feasible solution. In both cases, we believe (x * ,x * ) generated by the two-level algorithm has its own significance and may provide some useful information regarding the problem structure. Since stationarity and optimality are maintained in all subproblems, we should expect that any feasible limit point of the outer-level iterates is stationary for the original problem. As we will prove in the next theorem, this is indeed the case if some mild constraint qualification is satisfied.
Proof See Appendix A.3.
In Theorem 2, we assume the dual variable {y k } has a limit point y * . Since by (38) we have λ k + β k z k + y k = 0, the "true" multiplierλ k+1 := λ k + β k z k also has a limit point. We note that the existence of a limit point can be ensured by the existence of a bounded dual subsequence, which is known as the sequentially bounded constraint qualification (SBCQ) [44]. More specifically in the context of smooth nonlinear problems, the constant positive linear dependence (CPLD) condition proposed by Qi and Wei [54] also guarantees that the sequence of dual variables has a bounded subsequence. Therefore, we think our assumption of y * is analogous to some constraint qualification in the KKT condition for smooth problems, and does not restrict the field where our algorithm is applicable.
We also give some comments regarding the predetermined bound [λ, λ] on outer-level dual variable λ. In principle, the bound should be chosen large enough at the beginning of the algorithm. Otherwise λ k will probably stay at λ or λ all the time; in this case, the outer-level ALM automatically converts to the penalty method, which usually requires β k to go to infinity, because, in general, exact penalization does not hold for a quadratic penalty function. In contrast, a proper choice of the dual variable can compensate asymptotic exactness even when the penalty function is not sharp at the origin. In terms of convergence analysis, one may notice that the choice of λ is actually not that important: if we set λ k = 0 for all k, the analysis can still go through. This is because in the framework of ALM, the dual variable λ is closely related to local optimal solutions. While we study global convergence, it is not clear which local solution the algorithm will converge to, so the role of λ is not significant. It seems difficult to establish the uniform boundedness of dual variables without the projection step, especially when there are nonconvex constraints In Section 5, we will show our algorithm inherits some nice local convergence properties of ALM, where λ does play an important role, and in Section 6, we will demonstrate that keeping λ indeed enables the algorithm to converge faster than the penalty method.

Iteration Complexity
In this subsection, we provide an iteration complexity analysis of the proposed algorithm. In view of (7), our goal is to give a complexity bound on the number of ADMM iterations for finding an -stationary solution (x K ,x K , y K ) in the sense that there exist d 1 , d 2 , d 3 such that In order to illustrate the main result in a concise and clear way, we slightly modify the outer-level Algorithm 2 as follows.
Remark 1 This assumption can be satisfied if ADMM can make significant progress in reducing z k or equivalently Ax k + Bx k . Another naive implementation can be seen as follows: suppose a feasible point (x,x) is known a priori, i.e., (x,x) ∈ X ×X , and Ax + Bx = 0, then the initialization of the k-ADMM with (x 0 ,x 0 , z 0 , y 0 ) = (x,x, 0, −λ k ) guarantees that L ρ k (x 0 ,x 0 , z 0 , y 0 ) ≤ L, where L = max x∈X f (x).
Theorem 3 Under Assumptions 1-4, Algorithm 3 finds an -stationary solution (x K ,x K , y K ) of (3) in the sense of (19) in no more than O 1/ 4 inner ADMM iterations. Furthermore, ifλ k := λ k + β k z k is bounded, then the iteration complexity can be improved to O 1/ 3 .

Proof See Appendix A.4.
We acknowledge that {λ k } k may not be bounded for some applications. The second part of Theorem 3 (as well as Theorem 4 to be presented next) aims to reasonably justify the performance of the proposed algorithm under the boundedness condition.

Extension to Multi-block Problems
In this section, we will discuss the extension of the two-level framework to the more general class of multi-block problems (4). In particular, we are interested in the case where Conditions 1 and 2 are not satisfied. As we mentioned earlier, Jiang et al. [32] proposed to solve the following perturbed problem of (4): for λ = 0, where f i 's are lower semi-continuous, and f p and g are Lipschitz differentiable. Notice that we change h and B in (4) to f p and A p for ease of presentation. The iteration complexity for this one-level workaround is O 1/ 4 when the dual variable is bounded, and O 1/ 6 otherwise. In contrast, we can apply our two-level framework to the multi-block problem (4) as well: with some initial guess λ and moderate β, we solve (20) approximately using ADMM, and then we update λ and β. We define dual residual similarly as in (14a)-(14b) for each block variable, and -stationary solution as a pair of primal-dual points where the primal residual ( and dual residuals (with respect to each primal block) are less than some > 0. An extension of the two-level framework is presented in Algorithm 4 below.
Although the proposed algorithm invokes a series of ADMM with varying outer-level dual variables and penalties, Theorem 4 suggests that its iteration complexity for finding a stationary solution is no worse than that of the singlelooped ADMM variant proposed in [32]. In Section 5, local convergence results are presented as an alternative perspective to help us understand the behavior of the proposed algorithm.

Local Convergence
We show in this section that the proposed algorithm inherits some nice local convergence properties of the augmented Lagrangian method. The analysis builds on the classic local convergence of ALM [5], and our purpose is to provide some quantitative justification for the fast convergence of the twolevel algorithm, which will be presented in Section 6.
To begin with, we note that the inner-level problem (10) solved by ADMM is closely related to the problem min x∈X ,x∈X It is straightforward to verify that (x k ,x k ) is a stationary point of (21) in the sense that if and only if (x k ,x k , z k , y k ) is a stationary point of (10) satisfying (11) with z k = −Ax k −Bx k and y k = −λ k +β k (Ax k +Bx k ). In addition, an approximate stationary solution of (10) can be mapped to an approximate solution of (21).
Thus we will mainly focus on problem (21) and its approximate stationarity system (23) in this section. We add following assumptions on problem (3).

Assumption 5
The set X = {x ∈ R n1 : h(x) = 0} is compact with h : R n1 → R p being second-order continuously differentiable, the objective f is secondorder continuously differentiable over some open set containing X , andX is a convex set with nonempty interior in R n2 . The matrix B has full column rank.
Remark 2 Any inequality constraint in X can be converted to the form h(x) = 0 by adding the squares of additional slack variables. The second-order continuous differentiability of f and h are standard to establish local convergence of the augmented Lagrangian method. In addition, we explicitly require B to have full column rank, which can be justified by the reformulation (2). Definition 1 Let x * ∈ X = {x|h(x) = 0} and ∇h(x * ) = [∇h 1 (x * ), · · · , ∇h p (x * )] ∈ R n1×p .

Assumption 6 Problem
Moreover, there exists R > 0 such that x is quasiregular for all x ∈ B R (x * )∩X .
Remark 3 Assumption 6 can be regarded as a second-order sufficient condition at a local minimizer (x * ,x * ) of problem (3), and B having full column rank is necessary for (24b) to hold. The quasiregularity assumption can be satisfied by a wide range of constraint qualifications.
The quasiregularity condition bridges the normal cone stationarity condition to the well-known KKT condition.
Proposition 2 If x k ∈ X is quasiregular,x k ∈ IntX , and (x k ,x k ) satisfies condition (23) with somed k 1 andd k 2 , then there exists some µ k ∈ R p such that (x k ,x k ) satisfies the approximate KKT condition of problem (21), i.e., h(x k ) = 0, Proof The claim uses the fact that the normal cone N X (x) is the polar cone of the tangent cone T X (x), and NX (x) = {0} forx ∈ IntX. The existence of µ k follows from the Farkas' Lemma [3,Prop 4.3.12].

Proposition 4 Suppose Assumptions 5 and 6 hold. Let M and S be defined as in Proposition 3. Suppose for some
there exists a positive constant η < β k /M such that Denoteλ k := λ k + β k z k . Then we have Proof See Appendix B.2.
Theorem 5 Suppose Assumptions 5 and 6 hold. Let β, δ, M , and S be defined as in Proposition 3. Suppose the three conditions in Proposition 4 are satisfied for all iterates k ∈ Z + , and the initial penalty β 0 > M (1 + η + η) for some ∈ (0, 1). Then the following results hold: 1. the sequence {λ k } k∈Z++ stays inside the interior of [λ, λ], i.e., λ k+1 =λ k = λ k + β k z k ; 2. the dual variable λ k converges to λ * with at least a linear rate i.e., Proof The coefficient in the right-hand side of (31) is less than if β k > M (1 + η + η), and converges to 0 if β k → +∞; thus the first two parts of the theorem are proved. Part 3 is due to (29) and the same derivation as in Proposition 4.
Theorem 5 suggests that if we have a good initial point (inside the set S defined in Proposition 3) and each inner ADMM locates the approximate stationary solution specified by the implicit function theorem (as in Proposition 4), then the two-level algorithm exhibits local linear or super-linear convergence in its outer level. The results are consistent with our empirical observations to be presented in Section 6, where usually only a few outer-level updates are required upon convergence.

Examples
We present some applications of the two-level algorithm. All programs are coded using the Julia programming language 1.1.0 with JuMP package 0.18 [13] and implemented on a 64-bit laptop with one 2.6 GHz Intel Core i7 processor, 6 cores, and 16GB RAM. All nonlinear constrained problems are solved by the interior point solver IPOPT (version 3.12.8) [64] with linear solver MA27.

Nonlinear Network Flow Problem
We consider a specific class of network flow problems, which is covered by the motivating formulation (1). Suppose a connected graph G(V, E) is given, where some nodes have demands of certain commodity and such demands need to be satisfied by some supply nodes. Each node i keeps local variables Variable p i is the production variable at node i, and (x i , x ij , y ij ) determine the flow from node i to node j: For example, in an electric power network or a natural gas network, variables (x i , x ij , y ij ) are usually related to electric voltages or gas pressures of local utilities. Moreover, for each (i, j) ∈ E, nodal variables (x i , x j , x ij , y ij ) are coupled together in a nonlinear fashion: As an analogy, this coupling represents some physical laws on nodal potentials. We consider the problem In (32), the generation cost of each node, denoted by f i (·), is a function of its production level p i . The goal is to minimize total generation cost over the network. Each node is associated with a demand d i and has to satisfy the injection balance constraint (32b); nodal variable Formulation (32) covers a wide range of problems and can be categorized into the GNF problem studied in [60]. Suppose the network is partitioned into a few subregions, and (i, j) is an edge crossing two subregions with i (resp. j) in region 1 (resp. 2). In order to facilitate parallel implementation, we replace constraint (32d) by the following constraints with additional variables: similarly, we replace p ij and p ji in (32c) by Notice that (x 1 i , x 1 j , x ij , y ij ) are controlled by region 1 and (x 2 i , x 2 j , x ji , y ji ) are controlled by region 2. After incorporating constraints (33)-(34) for all crossing edges (i, j) into problem (32), the resulting problem is in the form of (3) and ready for our two-level algorithm. We consider the case where coupling constraints are given by is linear with parameters (a i , b ij , c ij ), while the nonconvex constraint (32d) restricts (x i , x j , x ij , y ij ) on the surface of a rotated second-order cone.
We use the underlying network topology from [75] to generate our testing networks. Each network is partitioned into two, three, or four subregions. The graph information and centralized objectives from IPOPT are recorded in the first three columns of Table 2. The column "LB" records the objective value by relaxing the constraint (32d) to h ij (x i , x j , x ij , y ij ) ≤ 0. It is clear that this relaxation makes problem (32) convex and provides a lower bound to the global optimal value. Partition information are given in the last two columns. We compare our algorithm with PDD in [58] as well as the proximal ADMMg proposed in [32] (which solves problem (5) instead). We set an absolute tolerance = 1.0e − 5, and initialize (x i , x j , x ij , y ij ) with (1, 1, 1, 0) and p i with the initial value provided in [75]. For our two-level algorithm, we choose ω = 0.75, γ = 1.5, and β 1 = 1000. Each component of λ is restricted between ±10 6 . The stopping criteria (14) suggests that k 1 and k 2 should be of the order O(ρ k k 3 ). Motivated by this observation, we terminate the inner-level ADMM when Ax t + Bx t + z t ≤ max{ , √ m/(k · ρ k )}, where m is the dimension of the vector, and ρ k is the inner ADMM penalty at outer iteration k. For PDD, as suggested in [58, Section V.B], we terminate the inner-level of PDD when the relative gap of two consecutive augmented Lagrangian values is less than max{ , 100 × (2/3) k }; at the end of each inner-level rBSUM, the primal feasibility is checked and penalty is updated with the same ω and γ. Notice that the parameters used in the proposed algorithm and PDD are matched in our experiments. For proximal ADMM-g, we choose β = 1/ 2 and ρ = 3/ 2 ; additional proximal terms 1 2 x − x t 2 H and 1 2 x −x t 2 H are added to the subproblem update, where H = 0.01 I. All three algorithms terminate if Ax k + Bx k ≤ √ m × . Test results are presented in Table 3. The number of outer-level updates (ALM multiplier updates for PDD and the two-level algorithm) and the total number of inner-level updates (rBSUM iterations for PDD and ADMM iterations for the two-level algorithm) are reported in columns "Outer" and "Inner", respectively. We see that both the proposed algorithm and PDD converge in all test cases, and both of them take around 10-30 outer-level iterations to drive the constraint violation " Ax + Bx " close to zero. PDD converges fast for three cases of network 300; however, for most cases it requires more total inner and outer iterations for convergence than the proposed algorithm. Such performance is consistent with the analysis in [58], where the inner-level rBSUM algorithm needs to run long enough to guarantee each block variable achieves stationarity. The objective values and duality gaps of solutions generated by the three algorithms are recorded in "Obj" and "Gap (%)". We can see both the proposed algorithm and PDD are able to achieve near global optimality, while the proposed algorithm finds solutions with even higher quality than PDD at termination. The algorithm running time (model building time excluded) is recorded in the last column "Time (s)". We would like to emphasize that, under similar algorithmic settings, the proposed two-level algorithm in general converges faster and shows better scalability than the other two algorithms. Even with sufficiently large penalty on the slack variable z, the proximal ADMM-g does not achieve the desired primal feasibility for cases 300-4, 1354-3, and 1354-4 in 1000 iterations; for other cases, it usually takes more time than the proposed algorithm. We point out that ADMM-g usually finds suboptimal solutions, and the duality gap can be as large as 42%. We believe this happens because problem (5) requires the introduction of large β( ) and ρ( ), which affect the structure of the original problem (3) and result in solutions with poor quality. Moreover, such large parameters also cause numerical issues for the IPOPT solver and slow down the overall convergence, and this is the reason why ADMM-g takes a long time even when the number of iterations is relatively small for the first four test cases. We also tried a smaller penalty O(1/ ), in which case the ADMM-g cannot achieve the desired feasibility level.

Minimization over Compact Manifold
We consider the following problem Problem (35) is obtained from the benchmark set COPS 3.0 [11] of nonlinear optimization problems. The same problem is used in [70] to test algorithms that preserve spherical constraints through curvilinear search. We compare solutions and computation time of our distributed algorithm with those obtained from the centralized IPOPT solver. Each test problem is firstly solved in a centralized way; objective value and total running time are recorded in the second and third column of Table 4. Using additional variables to break couplings in the objective (35a), we divide each test problem into three subproblems. Subproblems have the same number of variables, constraints, and objective terms (as in (35a)). For our two-level algorithm, we choose γ = 2, ω = 0.5; initial value of penalty β 1 is set to 100 for n p ∈ {60, 90}, 200 for n p ∈ {120, 180}, and 500 for n p ∈ {240, 300}. The initial point is set to (x i , y i , z i ) = (0.2, 0.3, 0.1) for all i ∈ [n p ] for IPOPT. We set bounds on each component of λ to be ±10 6 . The inner-level ADMM terminates when Ax t + Bx t + z t ≤ 3n p /(2500k), where k is the current outer-level index; the outer level terminates when Ax k + Bx k ≤ 3n p × 1.0e − 6.
The quality of the centralized solution is slightly better than distributed solutions, while our proposed algorithm is able to reduce the running time significantly except for one case (n p = 90) while ensuring feasibility. In addition, as indicated in Table 4, numbers of iterations for both inner and outer levels stay stable across all test cases, which suggests that the proposed algorithm scales well with the size of the problem. In view of the discussion in Section 4.2, we compare with the penalty method, where λ k = 0 for all k, to demonstrate the effect of the outer-level dual variable. Without updating λ, the penalty method requires more inner/outer updates and substantially longer time.

A Multi-block Problem: Robust Tensor PCA
In this section, we use the robust tensor PCA problem considered in [32] to illustrate that the two-level framework can be generalized to multi-block problem (4), and when Conditions 1 and 2 are satisfied, the resulting two-level algorithm can potentially accelerate one-level ADMM. In particular, given an estimate R of the CP-rank, the problem of interest is casted as where A ∈ R I1×R , B ∈ R I2×R , C ∈ R I3×R , and A, B, C denotes the sum of column-wise outer product of A, B, and C. We denote the mode-i unfolding of tensor Z by Z (i) , the Khatri-Rao product of matrices by , the Hadamard product by •, and the soft shrinkage operator by S. We implement the twolevel framework as in Algorithm 5. We firstly perform ADMM-g in steps 3-10.
We note that there are some modifications to the ADMM-g described in [32]: since our two-level framework requires the introduction of an additional slack variable S, steps 6-8 have an additional term S k (1) , and S k+1 (1) is then updated in step 9 via a gradient step as in ADMM-g; moreover, during the update of B, we also add a proximal term with coefficients δ 6 /2. When the residual Z k+1 + E k+1 + B k+1 + S k+1 − T F is small enough, which can serve as an indicator of the convergence of ADMM-g, we multiply the penalty β by some γ as long as β < 1.0e + 6, and update the outer-level dual variable Λ as in step 12, where the projection step is omitted.
We experiment on tensors with dimensions I 1 = 30, I 2 = 50, and I 3 = 70, which match the largest instances tested by [32]; the initial estimation R is given by R CP + 0.2 * R CP . In our implementation, we set γ = 1.5, c = 3, and the initial β is set to 2; the inner-level ADMM-g terminates if the residual Z k+1 +E k+1 +B k+1 +S k+1 −T F is less than max{1e−5, 1e−3/K out }, where K out is the current outer-level iteration count. All other parameters, generation of problem data, and initialization follow the description in [32,Section 5]. For each value of the CP rank, we generate 10 cases and let ADMM-g and the proposed two-level Algorithm perform 2000 (inner) iterations. We calculate the Algorithm 5 : Two-level Algorithm for Robust Tensor PCA 1: Initialize primal variables A 0 , B 0 , C 0 , E 0 , Z 0 , B 0 , S 0 , dual variables Y 0 , Λ 0 , penalty parameters β, ρ = cβ, stepsize τ = 1 ρ , constants δ i > 0 for i ∈ [6], γ > 1; 2: for k = 0, 1, 2, · · · do 3: if Z k+1 + E k+1 + B k+1 + S k+1 − T F is smaller than some threshold then 12: Λ ← Λ + βS k+1 , β ← γβ, ρ ← cβ, τ ← 1/ρ; 13: end if 14: end for geometric mean r Geo k of the primal residuals r k = Z k + E k + B k − T F over 10 cases, and plot lg r Geo k as a function of iteration count k in Figure 1. We also calculate the geometric mean e Geo k of relative errors Z k − Z true F / Z true F over 10 cases, where Z true is the generated true low-rank tensor, and plot e Geo k in Figure 2. For our two-level algorithm, the primal residual decreases relatively slow during the first few inner ADMM-g; however, as we update the outer-level dual variable Λ and penalty β, r Geo k drops significantly faster than that of ADMM-g, and achieves feasibility with high precision in around 500 inner iterations. The relative error r Geo k of the two-level algorithm converges slightly slower than ADMM-g, while it is able to catch up and obtain the same level of optimality. The result suggests that our proposed two-level algorithm not only ensures convergence for a wider range of applications where ADMM may fail, but also accelerates ADMM on problems where convergence is already guaranteed.

Conclusion
This paper proposes a two-level distributed algorithm to solve the nonconvex constrained optimization problem (3). We identify some limitation of the standard ADMM algorithm, which in general cannot guarantee convergence when parallelization of constrained subproblems is considered. In order to overcome such difficulties, we propose a novel while concise distributed reformulation, which enables us to separate the underlying complication into two levels. The inner level utilizes multi-block ADMM to facilitate parallel implementation while the outer level uses the classic ALM to guarantee convergence to feasible solutions. Global convergence, local convergence, and iteration complexity Rel. Error ADMM-g Two-level of the proposed two-level algorithm are established, and we certify the possibility to extend the underlying algorithmic framework to solve more complicated nonconvex multi-block problems (4). In comparison to the other existing algorithms that are capable of solving the same class of nonconvex constrained programs, the proposed algorithm exhibits its advantages in terms of speed, scalability, and robustness. Thus for general nonconvex constrained multi-block problems, the two-level algorithm can serve an alternative to the workaround proposed in [32] when Condition 1 or 2 fails, and potentially accelerate ADMM on problems where slow convergence is frequently encountered.
Proof We firstly show descent over x andx updates. By Assumption 3, we have In addition, notice that the second equality is due to a , and c = Bx t , and the last inequality is due to (37) of Lemma 2. Now we will show descent over z and y updates. Notice that if we define h(z) = λ z+ β 2 z 2 , then by Lemma 2, we have ∇h The equality is due to the update of dual variable in Algorithm 1, the optimality condition (38), and the fact that −ρ(a + b) (a + c) and c = z t−1 ; the inequality is due to h(z) being β-strongly convex and (38) of Lemma 2. Since ρ = 2β, adding (41)-(43) proves (39).
To see Lρ(x t ,x t , z t , y t ) is bounded from below, we note that the function h(z) defined above is also Lipschitz differentiable with constant β, so define s t : As a result, for all t ∈ Z + , where the last inequality is due to h(s t ) = β 2β . Since λ is bounded, there exists M ∈ R such that λ 2 ≤ M ; since the outer-level penalty β k is nondecreasing, we can define L := f * − M/β 1 , where f * = min x∈X f (x). The minimum is achievable due to Assumption 2. Now we are ready to prove Proposition 1.
Proof By Lemma 3, for any T ∈ Z ++ we have which implies the existence of a particular index t ∈ [T ] such that Using the fact that Ax t + Bx t + z t = β ρ z t−1 − z t = 1 2 z t−1 − z t , the KKT errors can be bounded by where the first inequality is due to the triangle inequality, the second inequality is due to the Cauchy-Schwarz inequality and ρ = ρ k = 2β k ≥ 2β 0 ≥ 1/2, the third inequality is due to (45), and the last inequality is due to the claimed upper bound on T .

A.2 Proof of Theorem 1
Proof Since x k ∈ X ,x k ∈X and X ,X are bounded, we know Ax k +Bx k is bounded; since Ax k + Bx k + z k ≤ k 3 and k 3 → 0, {z k } is also bounded. We conclude that {(x k ,x k , z k )} is bounded and therefore has at least one limit point, denoted by (x * ,x * , z * ). We use kr to denote a subsequence converging to (x * ,x * , z * ). Since X ,X are also closed, we have x * ∈ X andx * ∈X . Moreover, Ax * + Bx * + z * = limr→∞ Ax kr + Bx kr + z kr = 0. Therefore (x * ,x * ) is feasible for problem (3) if and only if z * = 0. If β k is bounded, then according to the update scheme, we have z k → 0, so z * = 0. Now suppose β k is unbounded. Since β k is nondecreasing, any subsequence is also unbounded. By (13c), we have λ kr β kr + z kr + y kr β kr = 0.
Since {λ kr } is bounded, we may assume λ kr → λ * . Again we consider two cases. In the first case, suppose {y kr } has a bounded subsequence, and therefore has a limit point y * . Then taking limit on both sides of (46) along the subsequence converging to y * , we have z * = 0, so (x * ,x * ) is feasible. Otherwise in the second case, limr→∞ y kr = +∞. Denotẽ y kr := y kr β kr . We know the sequence {ỹ kr } converges to −z * , because lim r→∞ỹ kr = lim r→∞ y kr β kr = lim r→∞ −z kr − λ kr β kr = −z * .
Since N X (x kr ) and NX (x kr ) are cones and β kr > 0, we have whereỹ kr := y kr β kr . Due to the closedness of normal cones, we can take limit on (48), then (46) and (13d) implies (x * ,x * ) is a stationary point of the problem (18).

A.3 Proof of Theorem 2
Proof We assume the subsequence {(x kr ,x kr , z kr , y kr )} converges to the limit point (x * , x * , z * , y * ). Using a similar argument in the proof of Theorem 1, we have x * ∈ X ,x * ∈X , and Ax * + Bx * + z * = 0. It remains to show z * = 0 to complete primal feasibility. If β k is bounded, then we have z k → 0 so z * = 0; if β k is unbounded, by taking limits on both sides of (46), we also have z * = 0, since λ k is bounded and y kr converges to y * . Therefore (x * ,x * ) satisfies (7c). Taking limits on (13a) and (13b) as k → ∞, we get (7a) and (7b), respectively. This completes the proof.

A.4 Proof of Theorem 3
Proof We use k to index outer-level iterations of Algorithm 3 and t to index inner-level iterations of Algorithm 1. By Proposition 1, Assumption 4, and the fact that β k = β 0 γ k , the number of iterations T k of the k-th inner ADMM, defined in (17), satisfies Summing T k over k ∈ [K], we obtain the following bound on the total number of ADMM iterations: Since conditions (19a) and (19b) are maintained at the termination of each inner-level ADMM, the total number of outer-level ALM iterations, K, depends on the rate at which (19c) is satisfied. By inequality (44) and Assumption 4, at the termination of each ADMM, we have The Assumption 2, the fact that λ k is bounded, and the above inequality imply that As a result, there exists an index K such that Ax K + Bx K ≤ and γ K = O 1/ 2 . Plugging γ K = O 1/ 2 into (50) gives the claimed O(1/ 4 ) complexity upper bound. For the second claim, consider the K-th inner ADMM, at the termination of which we have Ax K +Bx K +z K ≤ 2 . Since Ax K +Bx K ≤ Ax K +Bx K +z K + z K ≤ 2 + z K . It suffices to find an index K such that z K ≤ 2 . Sinceλ k and λ k are bounded, we have z k = λk − λ k /β k = O 1/γ k . As a result, we can choose K such that γ K = O(1/ ). Plugging γ K = O(1/ ) into (50) gives the claimed O(1/ 3 ) complexity upper bound.

A.5 Proof of Theorem 4
Proof According to [32,Theorem 4.2], given the inner ADMM penalty ρ k , which is a constant multiple of β k , it is sufficient to let the k-th ADMM run T k = O((ρ k ) 2 / 2 ) iterations in order to have some t ∈ [T k ] such that the primal and dual residuals of ADMM at iteration t are less than /2. Denote this solution by x k = (x k 1 , · · · , x k p ). Since we update penalties in each outer iteration as β k = β 0 γ k , the total number of inner-level iterations is bounded by where K is the total number of outer-level iterations. It remains to choose K such that Ax K − b ≤ , and we consider two cases.