1 Introduction

This paper develops a new two-level distributed algorithm with global and local convergence guarantees for solving general smooth and nonsmooth constrained nonconvex optimization problems. We will start with a general constrained optimization model that is motivated by nonlinear network flow problems, and then explore reformulations for distributed computation. This process of reformulation leads us to observe two structural properties that a distributed reformulation should possess, which in fact pose some challenges to existing distributed algorithms in terms of convergence and practical performance. This observation inspired us to develop the two-level distributed algorithm. We will summarize our contributions in the end of this section.

1.1 Constrained nonconvex optimization over a network

Consider a connected, undirected graphFootnote 1\(G(\mathcal {V},\mathcal {E})\) with a set of nodes \(\mathcal {V}\) and a set of edges \(\mathcal {E}\). A centralized constrained optimization problem on G is given as

$$\begin{aligned} \min \quad&\sum _{i\in \mathcal {V}}f_i(x_i) \end{aligned}$$
(1a)
$$\begin{aligned} \mathrm {s.t.} \quad&h_i(x_i,\{x_j\}_{j\in \delta (i)}) = 0, \quad \forall i \in \mathcal {V}, \end{aligned}$$
(1b)
$$\begin{aligned}&g_i(x_i, \{x_j\}_{j\in \delta (i)}) \le 0,\quad \forall i \in \mathcal {V}, \end{aligned}$$
(1c)
$$\begin{aligned}&x_i\in \mathcal {X}_i, \quad \forall i \in \mathcal {V}, \end{aligned}$$
(1d)

where each node \(i\in \mathcal {V}\) of the graph G is associated with a decision variable \(x_i\) and a cost function \(f_i(x_i)\) as in (1a). Variable \(x_i\) and variables \(x_j\) of i’s adjacent nodes \(j\in \delta (i)\) are coupled through constraints (1b)–(1c), and \(\mathcal {X}_i\) in (1d) represents some constraints only for \(x_i\). The functions \(f_i\), \(h_i\), \(g_i\), and the set \(\mathcal {X}_i\) may be nonconvex.

Any constrained optimization problem can be reformulated as (1) after proper transformation. An especially interesting motivation for us is the nonlinear network flow problems. In this case, the graph G represents a physical network such as an electric power network, a natural gas pipeline network, or a water transport network, where the variables \(x_i\) in (1) are nodal potentials such as electric voltages, gas pressures, or water pressures, and the constraints \(h_i\) and \(g_i\) are usually nonconvex functions that describe the physical relations between nodal potentials and flows on the edges, flow balance at nodes, and flow capacity constraints. Notice that a node i in the graph can also represent a sub-network of the entire physical network, and then the constraints could involve variables in adjacent sub-networks. There has been much recent interest in solving nonlinear network flow problems, e.g. applications in the optimal power flow problem in electric power network [33], the natural gas nomination problem [52], and the water network scheduling problem [9].

In many situations, it is desirable to solve problem (1) in a distributed manner, where each node i represents an individual agent that solves a localized problem, while agents coordinate with their neighbors to solve the overall problem. Each agent need to handle its own set of local constraints \(h_i\), \(g_i\), and \(\mathcal {X}_i\). For example, agents may be geographically dispersed with local constraints representing the physics of the subsystems, which cannot be controlled by other agents; or agents may have private data in their constraints, which cannot be shared with other agents; or the sheer amount of data needed to describe constraints or objective could be too large to be stored or transmitted in distributed computation between agents. These practical considerations pose restrictions that each agent in a distributed algorithm has to deal with a set of complicated, potentially nonconvex, constraints.

1.2 Necessary structures of distributed formulations

In order to do distributed computation, the centralized formulation (1) first needs to be transformed into a formulation to which a distributed algorithm could be applied. We call such a formulation a distributed formulation, whose form may depend on specific distributed algorithms as well as on the structure of the distributed computation, e.g. which variables and constraints are controlled by which agents and in what order computation and communication can be carried out. Despite the great variety of distributed formulations, we want to identify some desirable and necessary features for a distributed formulation.

One desirable feature is the capability of parallel decomposition so that all agents can solve their local problems in parallel, rather than in sequence. To realize this, each agent needs a local copy of its neighboring agents’ variables. For problem (1), we may introduce a local copy \(x^i_j\) of the original variable \(x_j\) and a global copy \(\bar{x}_j\), and enforce consensus as

$$\begin{aligned} x_j = \bar{x}_j, \; x^i_j = \bar{x}_j, \quad \forall j\in \mathcal {V}, \; i\in \delta (j). \end{aligned}$$
(2)

Using this duplication scheme, a distributed formulation of (1) can be written as

$$\begin{aligned} \min _{x, \bar{x}} \quad&f({x})=\sum _{i\in \mathcal {V}} f_i(x^i) \end{aligned}$$
(3a)
$$\begin{aligned} \mathrm {s.t.} \quad&A{x}+B\bar{x}=0, \end{aligned}$$
(3b)
$$\begin{aligned}&x^i \in \mathcal {X}_i,~~\forall i\in \mathcal {V},\quad {\bar{x}\in \bar{\mathcal {X}}}. \end{aligned}$$
(3c)

In problem (3), the optimization variables are \({x}=[\{x^i\}_{i\in \mathcal {V}}]\in \mathbb {R}^{n_1}\) and \(\bar{x}=[\{\bar{x}_j\}_{j\in \mathcal {V}}]\in \mathbb {R}^{n_2}\). Each subvector \(x^i=[x_i,\{x^i_j\}_{j\in \delta (i)}]\in \mathbb {R}^{n_{1i}}\) of x denotes all the local variables controlled by agent i including the original variable \(x_i\) and the local copies \(x^i_j\); each subvector \(\bar{x}_j\) of \(\bar{x}\) denotes a global copy of \(x_j\). The set \(\mathcal {X}_i\subseteq \mathbb {R}^{n_{1i}}\) is defined as \(\mathcal {X}_i:=\{v:~h_i(v) = 0,~g_i(v)\le 0\}\), so the original constraints (1b)–(1c) are decoupled into each agent’s local constraints \(\mathcal {X}_i\), which also absorb the constraints (1d). Additionally, the global copy \(\bar{x}\) is constrained in some simple convex set \(\bar{\mathcal {X}}\subseteq \mathbb {R}^{n_2}\). The only coupling among agents are (3b), which formulate the consensus constraint (2) with \(A\in \mathbb {R}^{m\times n_1}\) and \(B\in \mathbb {R}^{m\times n_2}\). An alternating optimization scheme is then natural, as all the agents can solve their subproblems over \(x^i\)’s in parallel once \(\bar{x}\) is fixed; and once \(x^i\)’s are updated and fixed, the subproblems over \(\bar{x}\) can also be solved in parallel.

In fact, for any constrained optimization problem, not necessarily a network flow type problem, if distributed computation is considered, variables of the centralized problem need to be grouped into variables \(x^i\) in a distributed formulation for agents i according to the decision structure, and duplicate variables \(\bar{x}\) need to be introduced to decouple the constraints from agents. In this way, problem (3) provides a general formulation for distributed computation of constrained optimization problems. Conversely, due to the necessity of duplicating variables, any distributed formulation of a constrained program necessarily shares some key structures of (3). In particular, problem (3) has two simple but crucial properties. Namely,

  • Property 1: As the matrices A and B are defined in (2), the image of A strictly contains the image of B, i.e. \(\textrm{Im}(A) \supsetneq \textrm{Im}(B)\).

  • Property 2: Each agent i may face local nonconvex constraints \(\mathcal {X}_i\).

Property 1 follows from the fact that, for any given value of \(\bar{x}_j\) in (2), there is always a feasible solution \((x_j, x^i_j)\) that satisfies the equalities in (2), but if \(x_j\ne x^i_j\), then there does not exist an \(\bar{x}_j\) that satisfies both equalities in (2). Property 2 follows from our desire to decompose the computation for different agents.

In this paper, we will show that the above two properties of distributed constrained optimization pose a significant challenge to the theory and practice of existing distributed optimization algorithms. In particular, existing distributed algorithms based on the alternating direction method of multipliers (ADMM) may fail to converge for the general nonconvex constrained problem (3) without further reformulation or relaxation. Before proceeding, we summarize our contributions.

1.3 Summary of contributions

The contributions of the paper can be summarized below.

Firstly, we propose a new reformulation and a two-level distributed algorithm for solving nonconvex constrained optimization problem (1)–(2), which embeds a specially structured three-block ADMM at the inner level in an augmented Lagrangian method (ALM) framework. The proposed algorithm maintains the flexibility of ADMM in achieving distributed computation.

Secondly, we prove global and local convergence as well as iteration complexity results for the proposed two-level algorithm, and illustrate that the underlying algorithmic framework can be extended to more complicated nonconvex multi-block problems. For the convergence of ADMM, we allow each nonconvex subproblem to be solved to a stationary point with certain improvement in the objective function compared to the previous iterate, which mildly relaxes the global optimality of nonconvex subproblems commonly assumed in the ADMM literature. Our convergence analysis builds on the classical and recent works on ADMM and ALM, and our results are derived by relating these two methods in an analytical way.

Thirdly, we provide extensive computational tests of our two-level algorithm on nonconvex network flow problems, parallel minimization of nonconvex functions over compact manifolds, and a robust tensor PCA problem from machine learning. Numerical results demonstrate the advantages of the proposed algorithm over existing ones, including randomized updates, modified ADMM, and centralized solver, either in the convergence speed to close optimality gap, or to close feasibility gap, or both. Moreover, our test result on the multi-block robust tensor PCA problem suggests that the proposed two-level algorithm not only ensures convergence for a wider range of applications where ADMM may fail, but also tends to accelerate ADMM on problems where convergence of ADMM is already guaranteed.

1.4 Notation

Throughout this paper, we use \(\mathbb {Z}_{+}\) (resp. \(\mathbb {Z}_{++}\)) to denote the set of nonnegative (resp. positive) integers, and \(\mathbb {R}^n\) to denote the n-dimensional real Euclidean space. For \(x,y\in \mathbb {R}^n\), the inner product is denoted by \(x^\top y\) or \(\langle x, y \rangle \); the Euclidean norm is denoted by \(\Vert x\Vert = \sqrt{\langle x, x\rangle }\). A vector x may consist of J subvectors \(x_j\in \mathbb {R}^{n_j}\) with \(\sum _{j=1}^J n_j=n\); in this case, we will write \(x = [\{x_j\}_{j\in [J]}]\), where \([J] = \{1,\ldots , J\}\). Occasionally, we use \(x_i\) to denote the i-th component of x if there is no confusion to do so. For a matrix A, denote its largest singular value by \(\Vert A\Vert \) and image space by \(\textrm{Im}(A)\). We use \(B_r(x)\) to denote the Euclidean ball centered at x with radius \(r>0\). For a closed set \(C\subset \mathbb {R}^n\), the interior of C is denoted by \(\textbf{Int}~C\), the projection operator onto C is denoted by \(\textrm{Proj}_C(x)\), and the indicator function of C is denoted by \(\mathbb {I}_C(x)\), which takes value 0 if \(x\in C\) and \(+\infty \) otherwise.

The rest of this paper is organized as follows. In Sect. 2, we review the literature and summarize two conditions that are crucial to the convergence of ADMM, which are essentially contradicting to Properties 1 and 2. In Sect. 3, we propose our new reformulation and a two-level algorithm for solving problem (3) in a distributed way. In Sect. 4, we provide the global convergence as well as the iteration complexity result, and show our scheme can be applied to more complicated multi-block problems. Then in Sect. 5, we show the local convergence result under standard second-order assumptions. Finally, we present computational results in Sect. 6 and conclude in Sect. 7.

2 Related literature

In this section, we review the literature on ADMM and other distributed algorithms, and identify some limitations of the standard ADMM approach in solving problem (3).

2.1 Earlier works and ADMM for convex problems

ALM and the method of multipliers (MoM) were proposed in the late 1960s by Hestenes [27] and Powell [53]. ALM enjoys more robust convergence properties than dual decomposition [3, 55], and convergence for partial elimination of constraints has been studied [4]. ADMM was proposed by Glowinski and Marrocco [18] and Gabay and Mercier [17] in the mid-1970s, and has deep roots in maximal monotone operator theory and numerical algorithms for solving partial differential equations [12, 14, 51]. ADMM solves the subproblems in ALM by alternately optimizing through blocks of variables and in this way achieves distributed computation. The convergence of ADMM with two block variables is proved for convex optimization problems [14, 16,17,18] and the \(\mathcal {O}(1/k)\) convergence rate is established [25, 26, 49]. Some applications in distributed consensus problems include [6, 45, 46, 50, 59, 68]. More recent convergence results on multi-block convex ADMM can be found in [7, 8, 10, 22,23,24, 29, 37,38,39,40,41].

2.2 ADMM for nonconvex problems

The convergence of ADMM has been observed for many nonconvex problems with various applications in matrix completion and factorization [57, 72,73,74], optimal power flow [15, 44, 61], asset allocation [69], and polynomial optimization [32], among others. For convergence theory, several conditions have been proposed to guarantee convergence on structured nonconvex problems that can be abstracted in the following form

$$\begin{aligned} \min _{x_1,\ldots , x_p, z}~&\sum _{i=1}^p f_i(x_i) + h(z) + g(x_1,\ldots ,x_p,z)\nonumber \\ ~\mathrm {s.t.}~&\sum _{i=1}^pA_ix_i + Bz = b,~ x_i\in \mathcal {X}_i~\forall i\in [p]. \end{aligned}$$
(4)

We summarize some convergence conditions in Table 1. For instance, Hong et al. [30] studied ADMM for nonconvex consensus and sharing problems under cyclic or randomized update order. Li and Pong [36] and Guo et al. [21] studied two-block ADMM, where one of the blocks is the identity matrix. One of the most general frameworks for proving convergence of multi-block ADMM is proposed by Wang et al. [67], where the authors showed a global subsequential convergence with a rate of \(o(1/\sqrt{k})\). A more recent work by Themelis and Patrinos [62] established a primal equivalence of nonconvex ADMM and Douglas-Rachford splitting.

Another line of research explores some variants of ADMM. Wang et al. [65, 66] studied the nonconvex Bregman-ADMM, where a Bregman divergence term is added to the augmented Lagrangian function during each block update to facilitate the descent of certain potential function. Gonçalves, Melo, and Monteiro [19] provided an alternative convergence rate proof of proximal ADMM applied to convex problems, which was shown to be an instance of a more general non-Euclidean hybrid proximal extragradient framework. The two-block, multi-block, and Jacobi-type extensions of this framework to nonconvex problems can be found in [20, 47, 48], where an iteration complexity of \(\mathcal {O}(1/\sqrt{k})\) was also established. Jiang et al. [31] proposed two variants of proximal ADMM. Some proximal terms are added to the first p block updates; for the last block, either a gradient step is performed, or a quadratic approximation of the augmented Lagrangian is minimized.

Table 1 Comparisons of the nonconvex ADMM literature

For general nonconvex and nonsmooth problems, we note that the convergence of ADMM relies on the following two conditions.

  • Condition 1: Denote \(A:=[A_0, \ldots , A_p]\), then \(\textrm{Im}([A,b])\) \(\subseteq \textrm{Im}(B)\).

  • Condition 2: The last block objective function h(z) is Lipschitz differentiable.

Due to the sequential update order of ADMM, \(z^{k}\) is obtained after \(x^{k}\) is calculated. If Condition 1 on the images of A and B is not satisfied, then it is possible that \(x^k\) converges to some \(x^*\) such that there is no \(z^*\) satisfying \(Ax^*+Bz^*=b\). In addition, Condition 2 provides a way to control dual iterates by primal iterates via the optimality condition of the z-subproblem. This relation requires unconstrained optimality condition of z-update, so the last block variable z cannot be constrained elsewhere. See also [67] for some relevant discussions. As indicated from Table 1, these two conditions (and their variants) are almost necessary for ADMM to converge in the absence of convexity. We also note that, even for convex problems, these two conditions are used to relax the strong convexity assumption in the objective [39] or accelerate ADMM with \(\mathcal {O}(1/k^2)\) iteration complexity [63].

It turns out that the two conditions and the two properties we mentioned in Sect. 1.2 may conflict each other. By Property 1, the image of A strictly constrains the image of B, so by Condition 1, we should update local variables after the global variable in each ADMM iteration to ensure feasibility. However, by Property 2, each local variable is subject to some local constraints, so Condition 2 cannot be satisfied; technically speaking, we cannot utilize the unconstrained optimality condition of the last block to link primal and dual variables, which again makes it difficult to ensure primal feasibility of the solution. When ADMM is directly applied to nonconvex problems, divergence is indeed observed [44, 61, 67]. As a result, for many applications in the form of (3) where the above two conditions are not available, the ADMM framework cannot guarantee convergence.

After completing a draft of this paper, we were informed of a ADMM-based approach in [31], where the authors proposed to solve the relaxed problem of (3)

$$\begin{aligned} \min _{x\in \mathcal {X}, \bar{x}\in \tilde{\mathcal {X}},z} \quad&f({x}) + \frac{\beta (\epsilon )}{2}\Vert z\Vert ^2\quad \mathrm {s.t.}\quad A{x}+B\bar{x}+z=0. \end{aligned}$$
(5)

Notice first that, as proved in [31], in order to achieve a desired feasibility with \(\Vert Ax+B\bar{x}\Vert = \mathcal {O}(\epsilon )\), the coefficient \(\beta (\epsilon )\) and ADMM penalty need to be as large as \(\mathcal {O}(1/\epsilon ^2)\). Such large parameters may lead to slow convergence and large optimality gaps. Also notice that, applying ADMM to (5) may produce an approximate stationary solution to (3), even when the problem is infeasible to begin with. As we will show in Sect. 4, our proposed two-level algorithm is able to achieve the same order of iteration complexity as the reformulation (5) and the one-level ADMM approach proposed in [31], and meanwhile the proposed two-level algorithm provides information on ill conditions and infeasibility; in Sect. 6, we empirically demonstrate with computation that the proposed algorithm robustly converges on large-scale constrained nonconvex programs with a faster speed and obtains solutions with higher qualities.

2.3 Other distributed algorithms

Some other distributed algorithms not based on ADMM are also studied in the literature. Hong [28] introduced a proximal primal-dual algorithm for distributed optimization problems, where a proximal term is added to cancel out cross-product terms in the augmented Lagrangian function. Lan and Zhou [35] proposed a randomized incremental gradient algorithm for a class of convex problems over a multi-agent network. Lan and Yang [34] proposed accelerated stochastic algorithms for nonconvex finite-sum and multi-block problems; interestingly, the analysis for the multi-block problem also requires the last block variable to be unconstrained with an invertible coefficient matrix and a Lipschitz differentiable objective, which further confirms the necessity of Conditions 1 and 2. We end this subsection with a recent work by Shi et al. [58]. They studied the problem

$$\begin{aligned} \min _{\textbf{x}, \textbf{y}}~~f(\textbf{x}, \textbf{y}) + \sum _{j=1}^m \tilde{\phi }_j(\textbf{y}_j)~~ \mathrm {s.t.}~~ h(\textbf{x}, \textbf{y}) = 0,~~ g_i(\textbf{x}_i)\le 0,~~\textbf{x}_i\in \mathcal {X}_i~~ \forall i \in [n]. \end{aligned}$$
(6)

The variables \(\textbf{x}\) and \(\textbf{y}\) are divided into n and m subvectors, respectively. \(f(\textbf{x}, \textbf{y})\), \(h(\textbf{x}, \textbf{y})\), \(g_i(\textbf{x}_i)\) are continuously differentiable, \(\tilde{\phi }_j(\textbf{y}_j)\) is a composite function, and \(\mathcal {X}_i\)’s are convex. The authors proposed a doubly-looped penalty dual decomposition method (PDD). The overall algorithm used the ALM framework, where the coupling constraint \(h(\textbf{x}, \textbf{y})=0\) is relaxed and each ALM subproblem is solved by a randomized block update scheme. We note that randomization is crucial in their convergence analysis, and a deterministic implementation of the inner-level algorithm for solving the ALM subproblem may not converge when nonconvex functional constraints are present.

3 A key reformulation and a two-level algorithm

We say \(({x}^*, \bar{x}^*, y^*)\in \mathbb {R}^{n_1}\times \mathbb {R}^{n_2}\times \mathbb {R}^{m}\) is a stationary point of problem (3) if it satisfies the following condition

$$\begin{aligned}&0\in \nabla f({x}^*)+ A^\top y^* + N_{\mathcal {X}}({x}^*), \end{aligned}$$
(7a)
$$\begin{aligned}&0 \in B^\top y^* + N_{\bar{\mathcal {X}}}(\bar{x}^*), \end{aligned}$$
(7b)
$$\begin{aligned}&0 = A{x}^*+B\bar{x}^*; \end{aligned}$$
(7c)

or equivalently, \(0\in \partial L({x}^*, \bar{x}^*, y^*)\), where

$$\begin{aligned} L({x}, \bar{x}, y) := f(x) + \mathbb {I}_{\mathcal {X}}({x}) + \mathbb {I}_{\bar{\mathcal {X}}}(\bar{x})+ \langle y, A{x} + B\bar{x}\rangle . \end{aligned}$$
(8)

In equations (7) and (8), the notation \(N_{\mathcal {X}}(x)\) denotes the general normal cone of \(\mathcal {X}\) at \(x\in \mathcal {X}\) [56, Def 6.3], and \(\partial L(\cdot )\) denotes the general subdifferential of \(L(\cdot )\) [56, Def 8.3]. Some properties and calculus rules of normal cones and the general subdifferential can be found in [56, Chap 6, 8, 10].

It can be shown that if \(({x}^*, \bar{x}^*)\) is a local minimum of (3) and satisfies some mild regularity condition, then condition (7) is satisfied [56, Thm 8.15]. If \(\mathcal {X}\) and \(\bar{\mathcal {X}}\) are defined by finitely many continuously differentiable constraints, then condition (7) is equivalent to the well-known KKT condition of problem (3) under some constraint qualification. Therefore, condition (7) can be viewed as a generalized first-order necessary optimality condition for nonsmooth constrained problems. Our goal is to find such a stationary point \(({x}^*, \bar{x}^*, y^*)\) for problem (3).

3.1 A key reformulation

As analyzed in the previous section, since directly applying ADMM to a distributed formulation of the general constrained nonconvex problem (3) cannot guarantee convergence without using the relaxation scheme in [31], we want to go beyond the standard ADMM framework. We propose two steps for achieving this. The first step is taken in this subsection to propose a new reformulation, and the second step is taken in the next subsection to propose a new two-level algorithm for the new reformulation.

We consider the following reformulation of (3)

$$\begin{aligned} \min _{ x\in \mathcal {X}, \bar{x} \in \bar{\mathcal {X}},z} \quad f(x) \quad \mathrm {s.t.} \quad A{x} + B\bar{x} +z = 0, ~z=0. \end{aligned}$$
(9)

The idea of adding a slack variable \(z\in \mathbb {R}^m\) has two consequences. The first consequence is that the linear coupling constraint \(Ax+B\bar{x}+z=0\) now has three blocks, and the last block is an identity matrix \(I_m\), whose image is the whole space. Given any x and \(\bar{x}\), we can always let \(z = -Ax-B\bar{x}\) to make the constraint satisfied. The second consequence is that the artificial constraint \(z=0\) can be treated separately from the coupling constraint. Notice that a direct application of ADMM to problem (9) still does not guarantee convergence since Conditions 1 and 2 are not satisfied yet. So it is necessary to separate the linear constraints into two levels. If we ignore \(z=0\) for the moment, existing techniques in ADMM analysis can be applied to the rest of the problem. Since we want to utilize the unconstrained optimality condition of the last block, we can relax \(z=0\). This observation motivates us to choose ALM. To be more specific, consider the problem

$$\begin{aligned} \min _{ x\in \mathcal {X}, \bar{x} \in \bar{\mathcal {X}},z} \quad f(x)+\langle \lambda ^k, z\rangle + \frac{\beta ^k}{2}\Vert z\Vert ^2\quad \mathrm {s.t.} \quad A{x}+B\bar{x}+z=0, \end{aligned}$$
(10)

which is obtained by dualizing constraint \(z=0\) with \(\lambda ^k\in \mathbb {R}^m\) and adding a quadratic penalty \(\frac{\beta ^k}{2}\Vert z\Vert ^2\) with \(\beta ^k>0\). The augmented Lagrangian term \(\langle \lambda ^k, z\rangle + \frac{\beta ^k}{2}\Vert z\Vert ^2\) can be viewed as an objective function in variable z, which is not only Lipschitz differentiable but also strongly convex. Problem (10) can be solved by a three-block ADMM in a distributed fashion when a separable structure is available. Notice that the first-order optimality condition of problem (10) at a stationary solution \(({x}^{k}, \bar{x}^{k}, z^{k}, y^{k})\) is

$$\begin{aligned}&0\in \nabla f(x^{k}) +A^\top y^{k}+ N_{\mathcal {X}}({x}^{k}), \end{aligned}$$
(11a)
$$\begin{aligned}&0\in B^\top y^{k} +N_{\bar{\mathcal {X}}}(\bar{x}^{k}), \end{aligned}$$
(11b)
$$\begin{aligned}&0 =\lambda ^k+\beta ^kz^{k} +y^{k}, \end{aligned}$$
(11c)
$$\begin{aligned}&0 = A{x}^{k}+B\bar{x}^{k} +z^{k}. \end{aligned}$$
(11d)

However, such a solution may not satisfy primal feasibility \(Ax+B\bar{x}=0\), which is the only difference from the optimality condition (7) (note that (11c) is analogous to the dual feasibility in variable z in the KKT condition). Fortunately, the ALM offers a scheme to drive the slack variable z to zero by updating \(\lambda \) and we can expect iterates to converge to a stationary point of the original problem (3). In summary, reformulation (9) separates the complication of the original problem into two levels, where the inner level (10) provides a formulation that simultaneously satisfies Conditions 1 and 2, and the outer level drives z to zero. We propose a two-level algorithmic architecture in the next subsection to realize this.

3.2 A two-level algorithm

The proposed algorithm consists of two levels, both of which are based on the augmented Lagrangian framework. The inner-level algorithm is described in Algorithm  1, which uses a three-block ADMM to solve problem (10) and its iterates are indexed by t. The outer-level algorithm is described in Algorithm  2 with iterates indexed by k.

Given \(\lambda ^k\in \mathbb {R}^m\) and \(\beta ^k>0\), the augmented Lagrangian function associated with the k-th inner-level problem (10) is defined as

$$\begin{aligned} L_{\rho ^k}({x},\bar{x}, z,y) :=&f(x) + \mathbb {I}_{\mathcal {X}}({x}) + \mathbb {I}_{\bar{\mathcal {X}}}(\bar{x}) + \langle \lambda ^k, z\rangle +\frac{\beta ^k}{2}\Vert z\Vert ^2 \nonumber \\&+ \langle y, A{x}+B\bar{x}+z\rangle + \frac{\rho ^k}{2}\Vert A{x}+B\bar{x}+z\Vert ^2, \end{aligned}$$
(12)

where \(y \in \mathbb {R}^{m}\) is the dual variable for constraint \(Ax+B\bar{x}+z=0\) and \(\rho ^k\) is a penalty parameter for ADMM. In view of (11), the k-th inner-level ADMM aims to find an approximate stationary solution \((x^k, \bar{x}^k, z^k, y^k)\) of (10) in the sense that there exist \(d_1^k\), \(d_2^k\), and \(d_3^k\) such that

$$\begin{aligned}&d^{k}_1\in \nabla f({x}^k) +A^\top y^{k}+N_{\mathcal {X}}({x}^k), \end{aligned}$$
(13a)
$$\begin{aligned}&d^k_2 \in B^\top y^k+ N_{\bar{\mathcal {X}}}(\bar{x}^k), \end{aligned}$$
(13b)
$$\begin{aligned}&0 = \lambda ^k + \beta ^kz^k + y^k, \end{aligned}$$
(13c)
$$\begin{aligned}&d^k_3 = A{x}^{k}+B\bar{x}^{k}+z^{k}, \end{aligned}$$
(13d)
$$\begin{aligned}&\Vert d_i^k\Vert \le \epsilon _i^k, \ \forall i\in [3], \end{aligned}$$
(13e)

where \(\epsilon _i^k\)’s are positive tolerances. The optimality conditions of \(x^{t}\) in Line 5 and \(\bar{x}^t\) in Line 7 of Algorithm  1 read:

$$\begin{aligned} 0 \in&\nabla f(x^{t}) + A^\top y^{t-1} + \rho ^k A^\top (Ax^{t} + B\bar{x}^{t-1} + z^{t-1}) + N_{\mathcal {X}}(x^t), \\ 0 \in&B^\top y^{t-1} + \rho ^k B^\top (Ax^{t} + B\bar{x}^{t} + z^{t-1}) + N_{\bar{\mathcal {X}}}(\bar{x}^t). \end{aligned}$$

With the dual update in Line 11, we can see that

$$\begin{aligned} -\rho ^k A^\top (B\bar{x}^{t-1}+z^{t-1}-B\bar{x}^{t}-z^{t}) \in&\nabla f(x^{t}) + A^\top y^{t} + N_{\mathcal {X}}(x^t), \\ -\rho ^k B^\top (z^{t-1}-z^{t})\in&B^\top y^t+ N_{\bar{\mathcal {X}}}(\bar{x}^t). \end{aligned}$$

As a result, Algorithm  1 can be terminated if it finds \(({x}^{t}, \bar{x}^{t}, z^{t})\) such that

$$\begin{aligned} \Vert \rho ^k A^\top (B\bar{x}^{t-1}+z^{t-1}-B\bar{x}^{t}-z^{t})\Vert&\le \epsilon ^{k}_1, \end{aligned}$$
(14a)
$$\begin{aligned} \Vert \rho ^k B^\top (z^{t-1}-z^{t}) \Vert&\le \epsilon ^{k}_2, \end{aligned}$$
(14b)
$$\begin{aligned} \Vert A{x}^{t}+ B\bar{x}^{t}+z^{t}\Vert&\le \epsilon ^{k}_3. \end{aligned}$$
(14c)

Notice that \(\rho ^k\) does not appear in (14c), so we can use different tolerances for the above three measures. Since (13c) is always maintained by ADMM with \((y^k, z^k) = (y^t, z^t)\), a solution satisfying (14) is an approximate stationary solution to problem (10) by assigning \((x^k, \bar{x}^k, z^k, y^k) := (x^t, \bar{x}^t, z^t, y^t)\).

figure a

The first block update in Algorithm  1 reads as

$$\begin{aligned} \min _{{x}\in \mathcal {X}} ~~f({x}) + \langle y^{t-1}, A{x}+B\bar{x}^{t-1}+z^{t-1}\rangle + \frac{\rho ^k}{2}\Vert A{x}+B\bar{x}^{t-1}+z^{t-1}\Vert ^2, \end{aligned}$$
(15)

so line 5 of Algorithm  1 searches for a stationary solution \({x}^{t}\) of the constrained problem (15). The second and third block updates in lines 7 and 9 admit closed form solutions, so in view of the network flow problem (1), the proposed reformulation (9) does not introduce additional computational burden. All primal and dual updates in Algorithm  1 can be implemented in parallel as f and \(\mathcal {X}\) admit separable structures. In each ADMM iteration, agents solve their own local problems independently and only need to communicate with their immediate neighbors. We resolve this by updating \(\lambda \) and \(\beta \), which is referred as outer-level iterations indexed by k in Algorithm  2.

figure b

In Algorithm  2, we choose some predetermined bounds \([\underline{\lambda }, \overline{\lambda }]\) and explicitly project the “true” dual variable \(\lambda ^k +\beta ^k z^k\) onto this hyper-cube to obtain \(\lambda ^{k+1}\) used in the next outer iteration. Such safeguarding technique is essential to establish the global convergence of ALM [1, 42]. We increase the outer-level penalty \(\beta ^k\) if there is no significant improvement in reducing \(\Vert z^k\Vert \).

Before proceeding to the next section, we note that the key reformulation (9) is inspired by the hope to reconcile the conflict between the two properties and the two condition so that ADMM can be applied. The introduction of additional variable z is not necessary in the sense that any method that achieves distributed computation for the subproblem

$$\begin{aligned} \min _{x\in \mathcal {X}, \bar{x} \in \bar{\mathcal {X}}} f(x)+ \langle \lambda ^k, Ax+B\bar{x} \rangle + \frac{\beta ^k}{2}\Vert Ax+B\bar{x}\Vert ^2 \end{aligned}$$
(16)

can be embedded inside the ALM framework. The aforementioned PDD method [58] is such an approach. There are some other update schemes [5, 71] that can handle functional constraints in (16), assuming that the (Euclidean) projection oracle onto the nonconvex set \(\mathcal {X}\) is available. It would be interesting to compare their performances with ADMM when used in the inner level, and we leave this to future work. Meanwhile, as we will demonstrate in Sect. 6, the proposed two-level algorithm preserves the desirable properties of ADMM in practice, such as fast convergence in early stages and scalability to handle large-scale problems.

4 Global convergence

In this section, we prove global convergence and convergence rate of the proposed two-level algorithm. Starting from any initial point, iterates generated by the proposed algorithm have a limit point; every limit point is a stationary solution to the original problem under some mild condition. In particular, we make the following assumptions.

Assumption 1

Problem (9) is feasible and the set of stationary points satisfying (7) is nonempty.

Assumption 2

The objective function \(f: \mathbb {R}^n\rightarrow \mathbb {R}\) is continuously differentiable, \(\mathcal {X}\subseteq \mathbb {R}^n\) is a compact set, and \(\bar{\mathcal {X}}\) is convex and compact.

Assumption 3

Given \(\lambda ^k\), \(\beta ^k\), and \(\rho ^k\), the first block update can find a stationary solution \(x^t\) such that \(0 \in \partial _x L_{\rho ^k} (x^t, \bar{x}^{t-1}, z^{t-1},y^{t-1} )\) and

$$\begin{aligned} L_{\rho ^k} ({x}^{t}, \bar{x}^{t-1},z^{t-1}, y^{t-1}) \le L_{\rho ^k}( {x}^{t-1}, \bar{x}^{t-1},z^{t-1},y^{t-1})< +\infty \end{aligned}$$

for all \(t\in \mathbb {Z}_{++}\).

We give some comments below. Assumption 1 ensures the feasibility of problem (9), which is standard. Though it is desirable to design an algorithm that can guarantee feasibility of the limit point, usually this is too much to ask: the powerful ALM may converge to an infeasible limit point even if the original problem is feasible. If this situation happens, or problem (9) is infeasible in the first place, our algorithm will converge to a limit point that is stationary to some problem, as stated in Theorem 1. The compactness required in Assumption 2 ensures that the sequence generated by our algorithm stays bounded, and can be dropped if the existence of a limit point is directly assumed or derived from elsewhere. We do not make any explicit assumptions on matrices A and B in this section, and our analysis does not rely on any convenient structures that A and B may process, such as full row or column rank.

For Assumption 3, we note that finding a stationary point usually can be achieved at the successful termination of some nonlinear solvers. In addition, the state-of-the-art nonlinear solver IPOPT [64] will accept a trial point if either the objective or the constraint violation is decreased in each iteration. In step 1 of Algorithm  1, since \({x}^{t-1}\) is already a feasible solution, if we start from \({x}^{t-1}\), it is reasonable to expect a new stationary point \({x}^{t}\) is reached with an improved objective value. Assumption 3 is slightly weaker and more realistic than assuming that the nonconvex subproblem can be solved globally, which is commonly adopted in the nonconvex ADMM literature.

In Sect. 4.1, we show that each inner-level ADMM converges to a solution that approximately satisfies the stationary condition (11) of problem (10). This sequence of solutions that we obtain at termination of the inner ADMM is referred as outer-level iterates. Then in Sect. 4.2, we firstly characterize limit points of outer-level iterates, whose existence is guaranteed. Then we show that a limit point is stationary to problem (3) if some mild constraint qualification is satisfied.

4.1 Convergence of inner-level iterations

In this subsection, we show that, by applying the three-block ADMM to problem (10), we will get an approximate stationary point \(({x}^{k}, \bar{x}^{k}, z^{k}, y^{k})\) satisfying the approximate stationary condition (13). The convergence of the inner-level ADMM in this subsection uses some techniques from the literature, e.g., [67]. We present a self-contained proof in the appendix and demonstrate that the descent oracle assumed in Assumption 3 relaxes the global optimality of subproblems without affecting the overall convergence.

Proposition 1

Suppose Assumptions 23 hold. The k-th inner-level ADMM of Algorithm  1 terminates, i.e., the stopping criteria (14) is satisfied, in at most

$$\begin{aligned} T_k:= \left\lceil \frac{8 \max \{\Vert A\Vert ^2, \Vert B\Vert ^2, 1\}\beta ^k (\overline{L}_k - \underline{L})}{\min \{\epsilon ^k_1,\epsilon ^k_2, \epsilon ^k_3 \}^2 } \right\rceil \end{aligned}$$
(17)

iterations, where \(\overline{L}_k := L_{\rho ^k}(x^0, \bar{x}^0, z^0, y^0)\) and \(\underline{L}\in \mathbb {R}\) is a finite constant independent of outer-level index k.

Proof

See Appendix “Proof of Proposition 1”. \(\square \)

In particular, the approximate stationary condition (13) is satisfied with the solution returned by ADMM.

4.2 Convergence of outer-level iterations

In this subsection, we prove the convergence of outer-level iterations. In general, when the method of multipliers is used as a global method, there is no guarantee that the constraint being relaxed can be satisfied at the limit. Due to the special structure of our reformulation, we are able to give a characterization of limit points of outer-level iterates.

Theorem 1

Suppose Assumptions 23 hold. Let \(\{({x}^k, \bar{x}^k, z^k, y^k)\}_{k\in \mathbb {Z}_{++}}\) be the sequence of outer-level iterates of Algorithm  2 satisfying condition (13). Then the sequence of the primal solutions \(\{({x}^k, \bar{x}^k, z^k)\}_{k\in \mathbb {Z}_{++}}\) are bounded, and every limit point \(({x}^*, \bar{x}^*, z^*)\) of this sequence satisfies one of the following:

  1. 1.

    \(({x}^*, \bar{x}^*)\) is feasible for problem (3), i.e., \(z^*=0\);

  2. 2.

    \(({x}^*, \bar{x}^*)\) is a stationary point of the problem

    $$\begin{aligned} \min _{{x}\in \mathcal {X}, \bar{x}\in \bar{\mathcal {X}}} \quad \frac{1}{2}\Vert A{x}+B\bar{x}\Vert ^2. \end{aligned}$$
    (18)

Proof

See Appendix “Proof of Theorem 1”. \(\square \)

Theorem 1 gives a complete characterization of limit points of outer-level iterates. If the limit point is infeasible, i.e. \(z^*\ne 0\), then \((x^*,\bar{x}^*)\) is a stationary point of the problem (18). This is also the case if problem (3) is infeasible, i.e. the feasible region defined by \(\mathcal {X}\) and \(\bar{\mathcal {X}}\) does not intersect the affine plane \(Ax+B\bar{x}=0\), since each inner-level problem (10) is always feasible and the first case in Theorem 1 cannot happen. We also note that even if \((x^*,\bar{x}^*)\) falls into the second case of Theorem 1, it is still possible that the associated \(z^*=0\), but then \((x^*, \bar{x}^*)\) will be some irregular feasible solution. In both cases, we believe \((x^*, \bar{x}^*)\) generated by the two-level algorithm has its own significance and may provide some useful information regarding the problem structure. Since stationarity and optimality are maintained in all subproblems, we should expect that any feasible limit point of the outer-level iterates is stationary for the original problem. As we will prove in the next theorem, this is indeed the case if some mild constraint qualification is satisfied.

Theorem 2

Suppose Assumptions 13 hold. Let \(({x}^*, \bar{x}^*, z^*)\) be a limit point of the outer-level iterates \(\{({x}^{k}, \bar{x}^{k}, z^{k})\}_{k\in \mathbb {Z}_{++}}\) of Algorithm  2. If \(\{y^{k}\}_{k\in \mathbb {Z}_{++}}\) has a limit point \(y^*\) along a subsequence converging to \(({x}^*, \bar{x}^*, z^*)\), then \(({x}^*, \bar{x}^*, y^*)\) is a stationary point of problem (3) satisfying stationary condition (7).

Proof

See Appendix “Proof of Theorem 2”. \(\square \)

In Theorem 2, we assume the dual variable \(\{y^{k}\}\) has a limit point \(y^*\). Since by (38) we have \(\lambda ^k+\beta ^kz^{k}+y^{k}=0\), the “true" multiplier \(\tilde{\lambda }^{k+1}:=\lambda ^k+\beta ^kz^{k}\) also has a limit point. We note that the existence of a limit point can be ensured by the existence of a bounded dual subsequence, which is known as the sequentially bounded constraint qualification (SBCQ) [43]. More specifically in the context of smooth nonlinear problems, the constant positive linear dependence (CPLD) condition proposed by Qi and Wei [54] also guarantees that the sequence of dual variables has a bounded subsequence. Therefore, we think our assumption of \(y^*\) is analogous to some constraint qualification in the KKT condition for smooth problems, and does not restrict the field where our algorithm is applicable.

We also give some comments regarding the predetermined bound \([\underline{\lambda },\overline{\lambda }]\) on outer-level dual variable \(\lambda \). In principle, the bound should be chosen large enough at the beginning of the algorithm. Otherwise \(\lambda ^k\) will probably stay at \(\underline{\lambda }\) or \(\overline{\lambda }\) all the time; in this case, the outer-level ALM automatically converts to the penalty method, which usually requires \(\beta ^k\) to go to infinity, because, in general, exact penalization does not hold for a quadratic penalty function. In contrast, a proper choice of the dual variable can compensate asymptotic exactness even when the penalty function is not sharp at the origin. In terms of convergence analysis, one may notice that the choice of \(\lambda \) is actually not that important: if we set \(\lambda ^k=0\) for all k, the analysis can still go through. This is because in the framework of ALM, the dual variable \(\lambda \) is closely related to local optimal solutions. While we study global convergence, it is not clear which local solution the algorithm will converge to, so the role of \(\lambda \) is not significant. It seems difficult to establish the uniform boundedness of dual variables without the projection step, especially when there are nonconvex constraints

In Sect. 5, we will show our algorithm inherits some nice local convergence properties of ALM, where \(\lambda \) does play an important role, and in Sect. 6, we will demonstrate that keeping \(\lambda \) indeed enables the algorithm to converge faster than the penalty method.

4.3 Iteration complexity

In this subsection, we provide an iteration complexity analysis of the proposed algorithm. In view of (7), our goal is to give a complexity bound on the number of ADMM iterations for finding an \(\epsilon \)-stationary solution \((x^K,\bar{x}^K, y^K)\) in the sense that there exist \(d_1, d_2, d_3\) such that

$$\begin{aligned}&d_1\in \nabla f(x^K) + A^\top y^K+ N_\mathcal {X} (x^K), \end{aligned}$$
(19a)
$$\begin{aligned}&d_2 \in B^\top y^K + N_{\bar{\mathcal {X}}} (\bar{x}^K), \end{aligned}$$
(19b)
$$\begin{aligned}&d_3 = Ax^K+B\bar{x}^K, \end{aligned}$$
(19c)
$$\begin{aligned}&\max \{\Vert d_1\Vert ,\Vert d_2\Vert ,\Vert d_3\Vert \} \le \epsilon . \end{aligned}$$
(19d)

In order to illustrate the main result in a concise and clear way, we slightly modify the outer-level Algorithm  2 as follows.

figure c

In Algorithm  3, we choose some tolerance \(\epsilon >0\) and apply the stopping criteria (14) with \(\epsilon ^k_1=\epsilon ^k_2 = 2\epsilon ^k_3 = \epsilon \) for the k-th inner-level ADMM. For the ease of the analysis, we multiply the outer-level penalty \(\beta ^k\) by some \(\gamma >1\) in each outer-iteration, instead of checking the improvement in primal feasibility. Moreover, we add the following technical assumption.

Assumption 4

There exists some \(\overline{L}\in \mathbb {R}\) such that \(L_{\rho ^k} (x^0, \bar{x}^0,z^0, y^0) \le \overline{L}\) for all \(k\in \mathbb {Z}_{++}\).

Remark 1

This assumption can be satisfied if ADMM can make significant progress in reducing \(\Vert z^k\Vert \) or equivalently \(\Vert Ax^k + B\bar{x}^k\Vert \). Another naive implementation can be seen as follows: suppose a feasible point \((x, \bar{x})\) is known a priori, i.e., \((x, \bar{x})\in \mathcal {X}\times \bar{\mathcal {X}}\), and \(Ax+B\bar{x}=0\), then the initialization of the k-ADMM with \((x^0, \bar{x}^0, z^0, y^0) = (x, \bar{x},0, -\lambda ^{k})\) guarantees that \(L_{\rho ^k} (x^0, \bar{x}^0, z^0,y^0)\le \overline{L}\), where \(\overline{L} = \max _{x\in \mathcal {X}} f(x)\).

Theorem 3

Under Assumptions 14, Algorithm  3 finds an \(\epsilon \)-stationary solution \((x^K,\bar{x}^K, y^K)\) of (3) in the sense of (19) in no more than \(\mathcal {O}\left( 1/\epsilon ^4\right) \) inner ADMM iterations. Furthermore, if \(\hat{\lambda }^{k}:=\lambda ^k + \beta ^k z^k\) is bounded, then the iteration complexity can be improved to \(\mathcal {O}\left( 1/\epsilon ^3\right) .\)

Proof

See Appendix “Proof of Theorem 3”. \(\square \)

We acknowledge that \(\{\hat{\lambda }^k\}_k\) may not be bounded for some applications. The second part of Theorem 3 (as well as Theorem 4 to be presented next) aims to reasonably justify the performance of the proposed algorithm under the boundedness condition.

4.4 Extension to multi-block problems

In this section, we will discuss the extension of the two-level framework to the more general class of multi-block problems (4). In particular, we are interested in the case where Conditions 1 and 2 are not satisfied. As we mentioned earlier, Jiang et al. [31] proposed to solve the following perturbed problem of (4):

$$\begin{aligned} \min _{x_1,\ldots , x_p, z}~&\sum _{i=1}^p f_p(x_i) + g(x_1,\ldots ,x_p)+\lambda ^\top z+\frac{\beta }{2}\Vert z\Vert ^2\\ ~\mathrm {s.t.}~&\sum _{i=1}^pA_ix_i +z = b,~ x_i\in \mathcal {X}_i~\forall i\in [p]. \nonumber \end{aligned}$$
(20)

for \(\lambda =0\), where \(f_i\)’s are lower semi-continuous, and \(f_p\) and g are Lipschitz differentiable. Notice that we change h and B in (4) to \(f_p\) and \(A_p\) for ease of presentation. The iteration complexity for this one-level workaround is \(\mathcal {O}\left( 1/\epsilon ^4\right) \) when the dual variable is bounded, and \(\mathcal {O}\left( 1/\epsilon ^6\right) \) otherwise. In contrast, we can apply our two-level framework to the multi-block problem (4) as well: with some initial guess \(\lambda \) and moderate \(\beta \), we solve (20) approximately using ADMM, and then we update \(\lambda \) and \(\beta \). We define dual residual similarly as in (14a)–(14b) for each block variable, and \(\epsilon \)-stationary solution as a pair of primal-dual points where the primal residual (\(\Vert \sum _{i=1}^p A_ix_i-b\Vert \)) and dual residuals (with respect to each primal block) are less than some \(\epsilon >0\). An extension of the two-level framework is presented in Algorithm  4 below.

figure d

Theorem 4

Under Assumption 4, Algorithm  4 finds an \(\epsilon \)-stationary solution of (4) in no more than \(\mathcal {O}(1/\epsilon ^6)\) ADMM iterations. Furthermore, if \(\hat{\lambda }^{k}:=\lambda ^k + \beta ^k z^k\) is bounded, then the iteration complexity can be improved to \(\mathcal {O}\left( 1/\epsilon ^4\right) .\)

Proof

See Appendix “Proof of Theorem 4”. \(\square \)

Although the proposed algorithm invokes a series of ADMM with varying outer-level dual variables and penalties, Theorem 4 suggests that its iteration complexity for finding a stationary solution is no worse than that of the single-looped ADMM variant proposed in [31]. In Sect. 5, local convergence results are presented as an alternative perspective to help us understand the behavior of the proposed algorithm.

5 Local convergence

We show in this section that the proposed algorithm inherits some nice local convergence properties of the augmented Lagrangian method. The analysis builds on the classic local convergence of ALM [4], and our purpose is to provide some quantitative justification for the fast convergence of the two-level algorithm, which will be presented in Sect. 6.

To begin with, we note that the inner-level problem (10) solved by ADMM is closely related to the problem

$$\begin{aligned} \min _{x\in \mathcal {X}, \bar{x}\in \bar{\mathcal {X}}} f(x)- \langle \lambda ^k, Ax+B\bar{x}\rangle +\frac{\beta ^k}{2}\Vert Ax+B\bar{x}\Vert ^2. \end{aligned}$$
(21)

It is straightforward to verify that \((x^k, \bar{x}^k)\) is a stationary point of (21) in the sense that

$$\begin{aligned} 0 \in&\nabla f(x^k) + A^\top (-\lambda ^k + \beta ^k(Ax^k+B\bar{x}^k)) +N_{\mathcal {X}}(x^k), \end{aligned}$$
(22a)
$$\begin{aligned} 0 \in&B^\top (-\lambda ^k + \beta ^k(Ax^k+B\bar{x}^k)) +N_{\bar{\mathcal {X}}}(\bar{x}^k), \end{aligned}$$
(22b)

if and only if \((x^k,\bar{x}^k,z^k,y^k)\) is a stationary point of (10) satisfying (11) with \(z^k = -Ax^k-B\bar{x}^k\) and \(y^k = -\lambda ^k + \beta ^k(Ax^k+B\bar{x}^k)\). In addition, an approximate stationary solution of (10) can be mapped to an approximate solution of (21).

Lemma 1

Let \((x^k, \bar{x}^k, z^k, y^k)\) be a \((d_1^k,d_2^k,d_3^k)\)-stationary point of (10) in the sense of (13). Then \((x^k, \bar{x}^k)\) is a \((\tilde{d}_i^k, \tilde{d}_2^k)\)-stationary point of (21), i.e.,

$$\begin{aligned} \tilde{d}_1^k \in&\nabla f(x^k) + A^\top (-\lambda ^k + \beta ^k(Ax^k+B\bar{x}^k)) +N_{\mathcal {X}}(x^k), \end{aligned}$$
(23a)
$$\begin{aligned} \tilde{d}_2^k \in&B^\top (-\lambda ^k + \beta ^k(Ax^k+B\bar{x}^k)) +N_{\bar{\mathcal {X}}}(\bar{x}^k), \end{aligned}$$
(23b)

where \( \tilde{d}_1^k = d_1^k + \beta ^k A^\top d_3^k\), and \(\tilde{d}_2^k = d_2^k + \beta ^k B^\top d_3^k\).

Proof

By (13c) and (13d), we have \(y^k = -\lambda ^k +\beta ^k (Ax^k+Bz^k-d_3^k)\); plugging this equality into (13a)–(13b) yields the result. \(\square \)

Thus we will mainly focus on problem (21) and its approximate stationarity system (1) in this section. We add following assumptions on problem (3).

Assumption 5

The set \(\mathcal {X}=\{x\in \mathbb {R}^{n_1} : h(x)=0\}\) is compact with \(h:\mathbb {R}^{n_1}\rightarrow \mathbb {R}^{p}\) being second-order continuously differentiable, the objective f is second-order continuously differentiable over some open set containing \(\mathcal {X}\), and \(\bar{\mathcal {X}}\) is a convex set with nonempty interior in \(R^{n_2}\). The matrix B has full column rank.

Remark 2

Any inequality constraint in \(\mathcal {X}\) can be converted to the form \(h(x)=0\) by adding the squares of additional slack variables. The second-order continuous differentiability of f and h are standard to establish local convergence of the augmented Lagrangian method. In addition, we explicitly require B to have full column rank, which can be justified by the reformulation (2).

Definition 1

Let \(x^* \in \mathcal {X}=\{x|h(x)=0\}\) and \(\nabla h(x^*) = [\nabla h_1(x^*),\ldots , \nabla h_p(x^*)]\in \mathbb {R}^{n_1 \times p}\).

  1. 1.

    The tangent cone of \(\mathcal {X}\) at \(x^*\):

    $$\begin{aligned} T_{\mathcal {X}}(x^*)=\left\{ d\in \mathbb {R}^{n_1}~|~ \exists x^k\in \mathcal {X}, x^k\rightarrow x^*, \frac{x^k-x^*}{\Vert x^k-x^*\Vert }\rightarrow \frac{d}{\Vert d\Vert }\right\} . \end{aligned}$$
  2. 2.

    The cone of the first-order feasible variation of \(\mathcal {X}\) at \(x^*\):

    $$\begin{aligned} V_{\mathcal {X}}(x^*)=\{d \in \mathbb {R}^{n_1}: \nabla h(x^*)^\top d=0\}. \end{aligned}$$
  3. 3.

    We say that \(x^*\) is quasiregular if \(T_{\mathcal {X}}(x^*)=V_{\mathcal {X}}(x^*)\).

Assumption 6

Problem (3) has a feasible solution \((x^*, \bar{x}^*)\), where \(\bar{x}^*\in \textrm{Int}~\bar{\mathcal {X}}\) and all equality constraints have linearly independent gradient vectors. In addition, \((x^*, \bar{x}^*)\), together with some dual multipliers \(\lambda ^*\in (\underline{\lambda }, \overline{\lambda })\) and \(\mu ^*\in \mathbb {R}^p\), satisfy

$$\begin{aligned}&\nabla f(x^*) - A^\top \lambda ^* + \nabla h(x^*)\mu ^* = 0, ~~B^\top \lambda ^* = 0, \end{aligned}$$
(24a)
$$\begin{aligned}&u^\top \left( \nabla ^2 f(x^*)+\sum _{i=1}^p \mu ^*_i \nabla ^2 h_i(x^*)\right) u >0, \nonumber \\&\quad \quad ~\forall (u,v)\ne 0 ~\mathrm {s.t.}~ Au+Bv =0, ~\nabla h(x^*)^\top u = 0. \end{aligned}$$
(24b)

Moreover, there exists \(R>0\) such that x is quasiregular for all \(x\in B_{R}(x^*) \cap \mathcal {X}\).

Remark 3

Assumption 6 can be regarded as a second-order sufficient condition at a local minimizer \((x^*,\bar{x}^*)\) of problem (3), and B having full column rank is necessary for (24b) to hold. The quasiregularity assumption can be satisfied by a wide range of constraint qualifications.

The quasiregularity condition bridges the normal cone stationarity condition to the well-known KKT condition.

Proposition 2

If \(x^k\in \mathcal {X}\) is quasiregular, \(\bar{x}^k \in \textrm{Int}~\bar{\mathcal {X}}\), and \((x^k, \bar{x}^k)\) satisfies condition (1) with some \(\tilde{d}_1^k\) and \(\tilde{d}_2^k\), then there exists some \(\mu ^k \in \mathbb {R}^p\) such that \((x^k, \bar{x}^k)\) satisfies the approximate KKT condition of problem (21), i.e., \(h(x^k)=0\),

$$\begin{aligned} \tilde{d}_1^k&= \nabla f(x^k) + A^\top (-\lambda ^k +\beta ^k(Ax^k+B\bar{x}^k)) + \nabla h(x^k)\mu ^k, \end{aligned}$$
(25a)
$$\begin{aligned} \tilde{d}_2^k&= B^\top (-\lambda ^k + \beta ^k(Ax^k+B\bar{x}^k)). \end{aligned}$$
(25b)

Proof

The claim uses the fact that the normal cone \(N_{\mathcal {X}}(x)\) is the polar cone of the tangent cone \(T_{\mathcal {X}}(x)\), and \(N_{\bar{\mathcal {X}}}(\bar{x}) = \{0\}\) for \(\bar{x}\in \textrm{Int}~\bar{X}\). The existence of \(\mu ^k\) follows from the Farkas’ Lemma [2, Prop 4.3.12]. \(\square \)

Proposition 3

Suppose Assumption 5 holds, and let \((x^*, \bar{x}^*, \mu ^*, \lambda ^*)\) be defined as in Assumption 6. There exist positive \(\underline{\beta }\) and \(\delta \) such that for all \(s=(\lambda , \beta , \tilde{d}_1,\tilde{d}_2)\) belonging to the set

$$\begin{aligned} S := \left\{ s=(\lambda , \beta , \tilde{d}_1,\tilde{d}_2)~|~\left( \frac{\Vert \lambda -\lambda ^*\Vert ^2}{\beta ^2} + \Vert \tilde{d}_1\Vert ^2 + \Vert \tilde{d}_2\Vert ^2\right) ^{1/2} \le \delta , \beta \ge \underline{\beta } \right\} , \end{aligned}$$

there exist unique continuously differentiable mappings x(s), \(\bar{x}(s)\), \(\mu (s)\), and \(\tilde{\lambda }(s) = \lambda -\beta [Ax(s)+B\bar{x}(s)]\) defined in the interior of S satisfying

$$\begin{aligned}&\nabla f[x(s)] - A^\top \tilde{\lambda }(s) +\nabla h[x(s)] \mu (s) = \tilde{d}_1, ~B^\top \tilde{\lambda }(s) = \tilde{d}_2,~ h[x(s)]= 0; \end{aligned}$$
(26)
$$\begin{aligned}&\left( x(\lambda ^*, \beta ,0,0), \bar{x}(\lambda ^*, \beta ,0,0), \mu (\lambda ^*, \beta ,0,0), \tilde{\lambda }(\lambda ^*, \beta ,0,0)\right) = (x^*, \bar{x}^*, \mu ^*, \lambda ^*); \end{aligned}$$
(27)
$$\begin{aligned}&\bar{x}(s)\in \textrm{Int}~\bar{\mathcal {X}}, \Vert x(s)-x^*\Vert \le R. \end{aligned}$$
(28)

Moreover, there exists \(M>0\) such that for any \(s\in S\), we have

$$\begin{aligned}&\max \{\Vert x(s)-x^*\Vert , \Vert \bar{x}(s)-\bar{x}^*\Vert , \Vert \tilde{\lambda }(s) - \lambda ^*\Vert \} \nonumber \\ \le&M(\Vert \lambda - \lambda ^*\Vert ^2/\beta ^2 + \Vert \tilde{d}_1\Vert ^2+\Vert \tilde{d}_2\Vert ^2)^{1/2}. \end{aligned}$$
(29)

Proof

See Appendix “Proof of Proposition 3”. \(\square \)

Proposition 4

Suppose Assumptions 5 and 6 hold. Let M and S be defined as in Proposition 3. Suppose for some \((\beta ^k, \lambda ^k)\) with \(\beta ^k \ge M\), ADMM finds a \((d_1^k, d_2^k, d_3^k)\)-stationary solution \((x^k, \bar{x}^k, z^k, y^k)\) satisfying (13) such that

  1. 1.

    \(s^k = (\lambda ^k, \beta ^k, \tilde{d}_1^k, \tilde{d}_2^k)\in S\), where \(\tilde{d}_1^k = d_1^k + \beta ^k A^\top d_3^k\), and \(\tilde{d}_2^k = d_2^k + \beta ^k B^\top d_3^k\);

  2. 2.

    \((x^k, \bar{x}^k )= (x(s^k), \bar{x}(s^k))\);

  3. 3.

    there exists a positive constant \(\eta < \beta ^k/M\) such that

    $$\begin{aligned} \left( \Vert A\Vert +\Vert B\Vert +\frac{1}{M} \right) (\Vert d_1^k\Vert +\Vert d_2^k\Vert +\Vert d_3^k\Vert )\le \frac{\eta }{\beta ^k} \Vert Ax^k+B\bar{x}^k\Vert . \end{aligned}$$
    (30)

Denote \(\hat{\lambda }^k := \lambda ^k + \beta ^k z^k\). Then we have

$$\begin{aligned} \Vert \hat{\lambda }^k - \lambda ^*\Vert \le \left( \frac{M}{\beta ^k} +\frac{M\eta (M+\beta ^k)}{\beta ^k(\beta ^k-M\eta )} \right) \Vert \lambda ^k - \lambda ^*\Vert . \end{aligned}$$
(31)

Proof

See Appendix “Proof of Proposition 4”. \(\square \)

Theorem 5

Suppose Assumptions 5 and 6 hold. Let \(\underline{\beta }\), \(\delta \), M, and S be defined as in Proposition 3. Suppose the three conditions in Proposition 4 are satisfied for all iterates \(k\in \mathbb {Z}_{+}\), and the initial penalty \(\beta ^0 > \frac{M}{\varrho }(1 +\eta + \varrho \eta )\) for some \(\varrho \in (0,1)\). Then the following results hold:

  1. 1.

    the sequence \(\{\lambda ^k\}_{k\in \mathbb {Z}_{++}}\) stays inside the interior of \([\underline{\lambda },\overline{\lambda }]\), i.e., \(\lambda ^{k+1} = \hat{\lambda }^k = \lambda ^k +\beta ^k z^k\);

  2. 2.

    the dual variable \(\lambda ^k\) converges to \(\lambda ^*\) with at least a linear rate i.e.,

    $$\begin{aligned} \lim _{k\rightarrow +\infty }\frac{\Vert \lambda ^{k+1}-\lambda ^*\Vert }{\Vert \lambda ^{k}-\lambda ^*\Vert } \le \varrho < 1, ~\text {and}~\lim _{k\rightarrow +\infty }\frac{\Vert \lambda ^{k+1}-\lambda ^*\Vert }{\Vert \lambda ^{k}-\lambda ^*\Vert } =0 \text {~if~} \beta ^k\rightarrow +\infty ; \end{aligned}$$
  3. 3.

    \(\max \{\Vert x^k-x^*\Vert , \Vert \bar{x}^k-\bar{x}^*\Vert \} \le \varrho \Vert \lambda ^k-\lambda ^*\Vert \le \varrho ^{k+1} \Vert \lambda ^0-\lambda ^*\Vert .\)

Proof

The coefficient in the right-hand side of (31) is less than \(\varrho \) if \(\beta ^k > \frac{M}{\varrho }(1 +\eta + \varrho \eta )\), and converges to 0 if \(\beta ^k\rightarrow +\infty \); thus the first two parts of the theorem are proved. Part 3 is due to (29) and the same derivation as in Proposition 4. \(\square \)

Theorem 5 suggests that if we have a good initial point (inside the set S defined in Proposition 3) and each inner ADMM locates the approximate stationary solution specified by the implicit function theorem (as in Proposition 4), then the two-level algorithm exhibits local linear or super-linear convergence in its outer level. The results are consistent with our empirical observations to be presented in Sect. 6, where usually only a few outer-level updates are required upon convergence.

6 Examples

We present some applications of the two-level algorithm. All programs are coded using the Julia programming language 1.1.0 with JuMP package 0.18 [13] and implemented on a 64-bit laptop with one 2.6 GHz Intel Core i7 processor, 6 cores, and 16GB RAM. All nonlinear constrained problems are solved by the interior point solver IPOPT (version 3.12.8) [64] with linear solver MA27.

6.1 Nonlinear network flow problem

We consider a specific class of network flow problems, which is covered by the motivating formulation (1). Suppose a connected graph \(G(\mathcal {V}, \mathcal {E})\) is given, where some nodes have demands of certain commodity and such demands need to be satisfied by some supply nodes. Each node i keeps local variables \([p_i;x_i; \{x_{ij}\}_{j\in \delta (i)}; \{y_{ij}\}_{j\in \delta (i)}]\) \(\in \mathbb {R}^{2|\delta (i)|+2}\). Variable \(p_i\) is the production variable at node i, and \((x_i, x_{ij}, y_{ij})\) determine the flow from node i to node j: \(p_{ij} = g_{ij}(x_i, x_{ij}, y_{ij})\) where \(g_{ij}: \mathbb {R}^3\rightarrow \mathbb {R}\). For example, in an electric power network or a natural gas network, variables \((x_i, x_{ij}, y_{ij})\) are usually related to electric voltages or gas pressures of local utilities. Moreover, for each \((i,j)\in \mathcal {E}\), nodal variables \((x_i,x_j, x_{ij}, y_{ij})\) are coupled together in a nonlinear fashion: \(h_{ij}(x_i, x_j, x_{ij},y_{ij}) =0\) where \(h_{ij}:\mathbb {R}^4\rightarrow \mathbb {R}\). As an analogy, this coupling represents some physical laws on nodal potentials. We consider the problem

$$\begin{aligned} \min \quad&\sum _{i\in \mathcal {V}}f_i(p_i) \end{aligned}$$
(32a)
$$\begin{aligned} \mathrm {s.t.} \quad&p_i - d_i = \sum _{j\in \delta (i)} p_{ij}\quad \forall i \in \mathcal {V}, \end{aligned}$$
(32b)
$$\begin{aligned}&p_{ij} = g_{ij}(x_i, x_{ij}, y_{ij})\quad \forall (i,j)\in \mathcal {E}, \end{aligned}$$
(32c)
$$\begin{aligned}&h_{ij}(x_i, x_j, x_{ij},y_{ij})=0\quad \forall (i,j)\in \mathcal {E}, \end{aligned}$$
(32d)
$$\begin{aligned}&x_i \in [\underline{x}_i, \overline{x}_i]\quad \forall i\in \mathcal {V}. \end{aligned}$$
(32e)

In (32), the generation cost of each node, denoted by \(f_i(\cdot )\), is a function of its production level \(p_i\). The goal is to minimize total generation cost over the network. Each node is associated with a demand \(d_i\) and has to satisfy the injection balance constraint (32b); nodal variable \(x_i\) is bounded in \([\underline{x}_i, \overline{x}_i]\). Formulation (32) covers a wide range of problems and can be categorized into the GNF problem studied in [60]. Suppose the network is partitioned into a few subregions, and (ij) is an edge crossing two subregions with i (resp. j) in region 1 (resp. 2). In order to facilitate parallel implementation, we replace constraint (32d) by the following constraints with additional variables:

$$\begin{aligned}&h_{ij}(x^1_{i}, x^1_{j}, x_{ij}, y_{ij}) =0,~h_{ji}(x^2_{j}, x^2_{i}, x_{ji}, y_{ji}) =0, \end{aligned}$$
(33a)
$$\begin{aligned}&x^1_i = \bar{x}_i,~x^2_i = \bar{x}_i, ~x^1_j = \bar{x}_j,~x^2_j = \bar{x}_j; \end{aligned}$$
(33b)

similarly, we replace \(p_{ij}\) and \(p_{ji}\) in (32c) by

$$\begin{aligned}&p_{ij} = g_{ij}(x_i^1, x_{ij}, y_{ij}),~p_{ji} = g_{ji}(x_j^2, x_{ji}, y_{ji}). \end{aligned}$$
(34)

Notice that \((x_i^1, x_j^1, x_{ij}, y_{ij})\) are controlled by region 1 and \((x_i^2, x_j^2, x_{ji}, y_{ji})\) are controlled by region 2. After incorporating constraints (33)–(34) for all crossing edges (ij) into problem (32), the resulting problem is in the form of (3) and ready for our two-level algorithm. We consider the case where coupling constraints are given by \(p_{ij} = \frac{a_i}{|\delta (i)|}x_i + b_{ij} x_{ij}+ c_{ij} y_{ij}\) and \(h_{ij}(x_i,x_j, x_{ij}, y_{ij}) =x_{ij}^2+y_{ij}^2 - x_ix_j\). Constraint (32c) is linear with parameters \((a_i, b_{ij}, c_{ij})\), while the nonconvex constraint (32d) restricts \((x_i,x_j, x_{ij}, y_{ij})\) on the surface of a rotated second-order cone.

We use the underlying network topology from [75] to generate our testing networks. Each network is partitioned into two, three, or four subregions. The graph information and centralized objectives from IPOPT are recorded in the first three columns of Table 2. The column “LB” records the objective value by relaxing the constraint (32d) to \(h_{ij}(x_{i}, x_{j}, x_{ij}, y_{ij})\le 0\). It is clear that this relaxation makes problem (32) convex and provides a lower bound to the global optimal value. Partition information are given in the last two columns.

Table 2 Network information

We compare our algorithm with PDD in [58] as well as the proximal ADMM-g proposed in [31] (which solves problem (5) instead). We set an absolute tolerance \(\epsilon =1.0e-5\), and initialize \((x_i, x_j, x_{ij}, y_{ij})\) with (1, 1, 1, 0) and \(p_i\) with the initial value provided in [75]. For our two-level algorithm, we choose \(\omega =0.75\), \(\gamma =1.5\), and \(\beta ^1 = 1000\). Each component of \(\lambda \) is restricted between \(\pm 10^6\). The stopping criteria (14) suggests that \(\epsilon _1^k\) and \(\epsilon ^k_2\) should be of the order \(\mathcal {O}(\rho ^k\epsilon _3^k)\). Motivated by this observation, we terminate the inner-level ADMM when \(\Vert Ax^t+B\bar{x}^t +z^t\Vert \le \max \{\epsilon , \sqrt{m}/(k\cdot \rho ^k)\}\), where m is the dimension of the vector, and \(\rho ^k\) is the inner ADMM penalty at outer iteration k. For PDD, as suggested in [58, Section V.B], we terminate the inner-level of PDD when the relative gap of two consecutive augmented Lagrangian values is less than \(\max \{\epsilon ,100\epsilon \times (2/3)^k\}\); at the end of each inner-level rBSUM, the primal feasibility is checked and penalty is updated with the same \(\omega \) and \(\gamma \). Notice that the parameters used in the proposed algorithm and PDD are matched in our experiments. For proximal ADMM-g, we choose \(\beta = 1/\epsilon ^2\) and \(\rho = 3/\epsilon ^2\); additional proximal terms \(\frac{1}{2}\Vert x-x^t\Vert ^2_H\) and \(\frac{1}{2}\Vert \bar{x}-\bar{x}^t\Vert ^2_H\) are added to the subproblem update, where \(H = \frac{0.01}{\epsilon }I\). All three algorithms terminate if \(\Vert Ax^k+B\bar{x}^k\Vert \le \sqrt{m}\times \epsilon \). Test results are presented in Table 3.

Table 3 Comparison with PDD [58], proximal ADMM-g [31]

The number of outer-level updates (ALM multiplier updates for PDD and the two-level algorithm) and the total number of inner-level updates (rBSUM iterations for PDD and ADMM iterations for the two-level algorithm) are reported in columns “Outer" and “Inner", respectively. We see that both the proposed algorithm and PDD converge in all test cases, and both of them take around 10-30 outer-level iterations to drive the constraint violation “\(\Vert Ax+B\bar{x}\Vert \)” close to zero. PDD converges fast for three cases of network 300; however, for most cases it requires more total inner and outer iterations for convergence than the proposed algorithm. Such performance is consistent with the analysis in [58], where the inner-level rBSUM algorithm needs to run long enough to guarantee each block variable achieves stationarity. The objective values and duality gaps of solutions generated by the three algorithms are recorded in “Obj" and “Gap (%)". We can see both the proposed algorithm and PDD are able to achieve near global optimality, while the proposed algorithm finds solutions with even higher quality than PDD at termination. The algorithm running time (model building time excluded) is recorded in the last column “Time (s)". We would like to emphasize that, under similar algorithmic settings, the proposed two-level algorithm in general converges faster and shows better scalability than the other two algorithms.

Even with sufficiently large penalty on the slack variable z, the proximal ADMM-g does not achieve the desired primal feasibility for cases 300-4, 1354-3, and 1354-4 in 1000 iterations; for other cases, it usually takes more time than the proposed algorithm. We point out that ADMM-g usually finds sub-optimal solutions, and the duality gap can be as large as 42%. We believe this happens because problem (5) requires the introduction of large \(\beta (\epsilon )\) and \(\rho (\epsilon )\), which affect the structure of the original problem (3) and result in solutions with poor quality. Moreover, such large parameters also cause numerical issues for the IPOPT solver and slow down the overall convergence, and this is the reason why ADMM-g takes a long time even when the number of iterations is relatively small for the first four test cases. We also tried a smaller penalty \(\mathcal {O}(1/\epsilon )\), in which case the ADMM-g cannot achieve the desired feasibility level.

6.2 Minimization over compact manifold

We consider the following problem

$$\begin{aligned} \min \quad&\sum _{i=1}^{n_p-1}\sum _{j=i+1}^{n_p} \big ((x_i-x_j)^2 +(y_i-y_j)^2+(z_i-z_j)^2\big )^{-\frac{1}{2}} \end{aligned}$$
(35a)
$$\begin{aligned} \mathrm {s.t.}\quad&x_i^2+y_i^2+z_i^2=1,\quad \forall i \in [n_p]. \end{aligned}$$
(35b)

Problem (35) is obtained from the benchmark set COPS 3.0 [11] of nonlinear optimization problems. The same problem is used in [70] to test algorithms that preserve spherical constraints through curvilinear search. We compare solutions and computation time of our distributed algorithm with those obtained from the centralized IPOPT solver. Each test problem is firstly solved in a centralized way; objective value and total running time are recorded in the second and third column of Table 4. Using additional variables to break couplings in the objective (35a), we divide each test problem into three subproblems. Subproblems have the same number of variables, constraints, and objective terms (as in (35a)). For our two-level algorithm, we choose \(\gamma = 2\), \(\omega = 0.5\); initial value of penalty \(\beta ^1\) is set to 100 for \(n_p \in \{60,90\}\), 200 for \(n_p \in \{120,180\}\), and 500 for \(n_p\in \{240, 300\}\). The initial point is set to \((x_i,y_i,z_i)=(0.2, 0.3, 0.1)\) for all \(i\in [n_p]\) for IPOPT. We set bounds on each component of \(\lambda \) to be \(\pm 10^6\). The inner-level ADMM terminates when \(\Vert Ax^t+B\bar{x}^t+z^t\Vert \le \sqrt{3n_p}/(2500k)\), where k is the current outer-level index; the outer level terminates when \(\Vert Ax^k+B\bar{x}^k\Vert \le \sqrt{3n_p}\times 1.0e-6\).

Table 4 Comparison of centralized and distributed solutions

The quality of the centralized solution is slightly better than distributed solutions, while our proposed algorithm is able to reduce the running time significantly except for one case (\(n_p=90\)) while ensuring feasibility. In addition, as indicated in Table 4, numbers of iterations for both inner and outer levels stay stable across all test cases, which suggests that the proposed algorithm scales well with the size of the problem. In view of the discussion in Sect. 4.2, we compare with the penalty method, where \(\lambda ^k=0\) for all k, to demonstrate the effect of the outer-level dual variable. Without updating \(\lambda \), the penalty method requires more inner/outer updates and substantially longer time.

6.3 A multi-block problem: robust tensor PCA

In this section, we use the robust tensor PCA problem considered in [31] to illustrate that the two-level framework can be generalized to multi-block problem (4), and when Conditions 1 and 2 are satisfied, the resulting two-level algorithm can potentially accelerate one-level ADMM. In particular, given an estimate R of the CP-rank, the problem of interest is casted as

$$\begin{aligned} \min _{A,B,C,\mathcal {Z}, \mathcal {E}, \mathcal {B}} \Vert \mathcal {Z}-\llbracket A, B, C\rrbracket \Vert ^2 +\alpha \Vert \mathcal {E}\Vert _1 + \alpha _N \Vert \mathcal {B}\Vert _F^2 \quad \mathrm {s.t.}~~ \mathcal {E}+\mathcal {Z}+\mathcal {B} = \mathcal {T}, \end{aligned}$$
(36)

where \(A \in \mathbb {R}^{I_{1} \times R}, B \in \mathbb {R}^{I_{2} \times R}, C \in \mathbb {R}^{I_{3} \times R}\), and \(\llbracket A, B, C \rrbracket \) denotes the sum of column-wise outer product of A, B, and C. We denote the mode-i unfolding of tensor \(\mathcal {Z}\) by \(Z_{(i)}\), the Khatri-Rao product of matrices by \(\odot \), the Hadamard product by \(\circ \), and the soft shrinkage operator by \(\textbf{S}\). We implement the two-level framework as in Algorithm  5.

figure e

We firstly perform ADMM-g in steps 3-10. We note that there are some modifications to the ADMM-g described in [31]: since our two-level framework requires the introduction of an additional slack variable \(\mathcal {S}\), steps 6-8 have an additional term \(S^{k}_{(1)}\), and \(S_{(1)}^{k+1}\) is then updated in step 9 via a gradient step as in ADMM-g; moreover, during the update of \(\mathcal {B}\), we also add a proximal term with coefficients \(\delta _6/2\). When the residual \(\Vert \mathcal {Z}^{k+1}+\mathcal {E}^{k+1}+\mathcal {B}^{k+1} +\mathcal {S}^{k+1}-\mathcal {T}\Vert _F\) is small enough, which can serve as an indicator of the convergence of ADMM-g, we multiply the penalty \(\beta \) by some \(\gamma \) as long as \(\beta <1.0e+6\), and update the outer-level dual variable \(\varLambda \) as in step 12, where the projection step is omitted.

We experiment on tensors with dimensions \(I_1 = 30\), \(I_2 = 50\), and \(I_3 = 70\), which match the largest instances tested by [31]; the initial estimation R is given by \(R_{CP}+ \lceil 0.2*R_{CP} \rceil \). In our implementation, we set \(\gamma =1.5\), \(c=3\), and the initial \(\beta \) is set to 2; the inner-level ADMM-g terminates if the residual \(\Vert \mathcal {Z}^{k+1}+\mathcal {E}^{k+1}+\mathcal {B}^{k+1} +\mathcal {S}^{k+1}-\mathcal {T}\Vert _F\) is less than \(\max \{1e-5, 1e-3/K_{\text {out}}\}\), where \(K_{\text {out}}\) is the current outer-level iteration count. All other parameters, generation of problem data, and initialization follow the description in [31, Section 5]. For each value of the CP rank, we generate 10 cases and let ADMM-g and the proposed two-level Algorithm perform 2000 (inner) iterations. We calculate the geometric mean \(r^{\text {Geo}}_k\) of the primal residuals \(r_k=\Vert \mathcal {Z}^{k}+\mathcal {E}^{k}+\mathcal {B}^{k}-\mathcal {T}\Vert _F\) over 10 cases, and plot \(\lg r^{\text {Geo}}_k\) as a function of iteration count k in Fig. 1. We also calculate the geometric mean \(e^{\text {Geo}}_k\) of relative errors \(\Vert \mathcal {Z}^k-\mathcal {Z}_{\textrm{true}}\Vert _F/\Vert \mathcal {Z}_{\textrm{true}}\Vert _F\) over 10 cases, where \(\mathcal {Z}_{\textrm{true}}\) is the generated true low-rank tensor, and plot \(e^{\text {Geo}}_k\) in Fig. 2. For our two-level algorithm, the primal residual decreases relatively slow during the first few inner ADMM-g; however, as we update the outer-level dual variable \(\varLambda \) and penalty \(\beta \), \(r^{\text {Geo}}_k\) drops significantly faster than that of ADMM-g, and achieves feasibility with high precision in around 500 inner iterations. The relative error \(r^{\text {Geo}}_k\) of the two-level algorithm converges slightly slower than ADMM-g, while it is able to catch up and obtain the same level of optimality. The result suggests that our proposed two-level algorithm not only ensures convergence for a wider range of applications where ADMM may fail, but also accelerates ADMM on problems where convergence is already guaranteed.

Fig. 1
figure 1

Comparison of infeasibility \(\lg r^{\text {Geo}}_k\)

Fig. 2
figure 2

Comparison of relative errors \(e^{\text {Geo}}_k\)

7 Conclusion

This paper proposes a two-level distributed algorithm to solve the nonconvex constrained optimization problem (3). We identify some limitation of the standard ADMM algorithm, which in general cannot guarantee convergence when parallelization of constrained subproblems is considered. In order to overcome such difficulties, we propose a novel while concise distributed reformulation, which enables us to separate the underlying complication into two levels. The inner level utilizes multi-block ADMM to facilitate parallel implementation while the outer level uses the classic ALM to guarantee convergence to feasible solutions. Global convergence, local convergence, and iteration complexity of the proposed two-level algorithm are established, and we certify the possibility to extend the underlying algorithmic framework to solve more complicated nonconvex multi-block problems (4). In comparison to the other existing algorithms that are capable of solving the same class of nonconvex constrained programs, the proposed algorithm exhibits its advantages in terms of speed, scalability, and robustness. Thus for general nonconvex constrained multi-block problems, the two-level algorithm can serve an alternative to the workaround proposed in [31] when Condition 1 or 2 fails, and potentially accelerate ADMM on problems where slow convergence is frequently encountered.