A linear algebra perspective on the random multi-block ADMM: the QP case

Embedding randomization procedures in the Alternating Direction Method of Multipliers (ADMM) has recently attracted an increasing amount of interest as a remedy to the fact that the direct multi-block generalization of ADMM is not necessarily convergent. Even if, in practice, the introduction of such techniques could mitigate the diverging behaviour of the multi-block extension of ADMM, from the theoretical point of view, it can ensure just the convergence in expectation, which may not be a good indicator of its robustness and efficiency. In this work, analysing the strongly convex quadratic programming case from a linear algebra perspective, we interpret the block Gauss–Seidel sweep performed by the multi-block ADMM in the context of the inexact Augmented Lagrangian Method. Using the proposed analysis, we are able to outline an alternative technique to those present in the literature which, supported from stronger theoretical guarantees, is able to ensure the convergence of the multi-block generalization of the ADMM method.


Introduction
In this work we consider the solution of the Quadratic Programming (QP) problem: where H ∈ R d×d is Symmetric Positive Definite (SPD in short) and A ∈ R m×d (d ≥ m) has full rank.
Recently, problem (QP) has been widely used as a sample problem for the convergence analysis of the n-block generalization of the Alternating Direction Method of Multipliers (ADMM) [6,11,22,49,55].In particular, in [11], a counterexample in the form of problem (QP) has been given to show that the direct n-block extension of ADMM is not necessarily convergent when solving non-separable convex minimization problems.This counterexample has motivated a series of very recent works, including [8, 10, 12, 14, 19, 29, 32-35, 41, 42, 44, 53, 54], where the authors analyse modifications of ADMM which ensure its convergence when n ≥ 3.In particular, in [12,44,53] a series of randomization procedures has been introduced which is able to guarantee the convergence in expectation of the n-block generalization of ADMM.Since then such techniques have been proposed as a possible remedy to the fact that the deterministic direct n-block extension of ADMM is not necessarily convergent.
The ADMM [6,22] was originally proposed in [28] and, in its n-block version, it embeds a n-block Gauss-Seidel (GS) decomposition [5,27] into each iteration of the Augmented Lagrangian Method (ALM) [36,50]: the primal variables, partitioned into n blocks, are cyclically updated and then a dual-ascent-type step for the dual variables is performed.
Adopting a purely linear-algebraic approach, in the particular case of problem (QP), ALM and ADMM can be simply interpreted in terms of matrix splitting techniques (see [31,56]) for the solution of the corresponding Karush-Kuhn-Tucker (KKT) linear system (see Sects. 3 and 6).
Even if in the numerical linear algebra community the study of matrix splitting techniques for the solution of linear systems arising from saddle point problems is a well established line of research (see [2,Sec. 8] for an overview), this connection seems to be only partially exploited in the works [12,44,53] and, despite the fact that analogies between ADMM and GS+ALM are apparent, to the best of our knowledge, very few works perform a precise investigation in this direction (even in the simple case when the problem is given by Eq. (QP)).
Indeed, even if it is natural to view ADMM as an approximate version of the ALM, as reported in [22,23], there were no known results in quantifying this interpretation until the very recent work [15]: here the authors investigate the connection of the block symmetric Gauss-Seidel method [31,Sec. 4.1.1]with the inexact proximal ALM, which represents somehow a different setting from the one investigated here.
Broadly speaking, this work aims to depict a precise picture of the synergies occurring between GS and ALM in order to give rise to ADMM and, in turn, to shed new light on the hidden machinery which controls its convergence.
For the reasons explained above, our starting point is an analysis of the ALM from an inexact point of view and specifically tailored for problem (QP).Indeed, inexact ALMs (iALM) have attracted the attention of many researchers in the last years and we refer to [57,Sec. 1.4] for a very recent literature review.We mention explicitly the works [38,39,43,45], where iALM is analysed for solving linearly constrained convex programming problems, a very similar framework to the one analysed here.To the best of our knowledge, our approach does not have any evident analogy to the previously mentioned papers.
On the other hand, the connections of the ALM with monotone operators/splitting methods are well understood [21,51] and, our analysis, resembles this line of research more closely: we use, in essence, a matrix splitting of the augmented KKT matrix of (QP) to represent the ALM/iALM iterations.It is not surprising that, as a result of this line of reasoning, we are able to relate the convergence of ALM/iALM (and their rate of convergence to an ε-accurate primal-dual solution) to the spectral radius ρ of the iteration map of a fixed point problem (see Eq. ( 9)).
A careful checking of the literature revealed some analogies of our approach with the inexact Uzawa's method [1].Indeed the ALM method can be interpreted as the Uzawa's method applied to the augmented KKT system of problem (QP) and in the context of the inexact Uzawa's method, it is empirically well documented [25] and theoretically well understood [7,[16][17][18]24], that a fixed number of Successive Over-Relaxation (SOR) [26,58] steps per inner solve (typically 10) is needed in order to reproduce the convergence rate of the exact algorithm.
All the inexactness criteria developed in the previously mentioned works are characterized by a summability condition or a relative error condition based on the residual previously computed.
A first important by-product of our analysis, is that we are able to prove the convergence of the iALM without imposing any summability condition on the sequence {η k } k which controls the amount of inexactness of the iALM at k-th iteration (see Theorem 9) also in the case when the source of inexactness is modelled using a random variable (see Lemma 10).A second important advantage of our approach, is that we are able to give explicit bounds for the rate of convergence of the iALM in relation to the speed characterizing the convergence to zero of the sequence {η k } k .
Beyond the previously mentioned advantages of our analysis, we trace the main contribution of this work in the production of an explicit link between the accuracy required to ensure the convergence and the specific solver used to address the minimization step in the ALM, which, in the case of problem (QP), is equivalent to the solution of a SPD linear system.Using explicit error-reduction bounds for the SOR method [47] and its Randomly Shuffled version [48], we are able to prove that the inexactness criterion η k = R k+1 (R < 1 suitably user-defined), can be satisfied by performing a constant number of iterations (see Theorem 17).Moreover, observing that the GS decomposition is a particular case of the SOR decomposition, we are able to connect the very well known convergence issues [49,55] of the direct n-block extension of ADMM (and its randomized versions [12,44,53]) to the fact that one GS sweep for iALM-step may not be sufficient to ensure enough of the accuracy in the algorithm to deliver convergence.Finally, as an interesting result of our analysis, we are able to propose a simple numerical strategy aiming to mitigate, if not to eliminate entirely, the convergence issues of ADMM (see Sect. 6): this proposal, due to its solid theoretical guarantees of convergence, could be considered as a competitive alternative to the techniques introduced to date [12,44,53].We provide some preliminary computational evidence of this fact (see Sect. 7).

Test problems
In order to showcase the developed theory, in the remainder of this work, we will consider the following test problems (all the numerical results presented are obtained using Matlab R2020b): Problem 1 H is the Kernel Matrix associated with the radial basis function for the dataset heart_scale from [9] (270 instances, 13 features).In particular, we consider with h = 0.5 and g a random vector.For the constraints, we choose A = e T where e is the vector of all ones and b = 1.
Problem 2 Following [11], we consider H = h I 3×3 with h = 0.05 and g a random vector.For the constraints we consider the matrix and b a random vector (rank(A) = 3).

Notation
In the following, as it is customary in the optimization community, we will use superscripts to denote particular elements of a sequence (scalars, vectors, matrices).In particular, for scalars, this choice could locally clash with the power operation of a scalar but the meaning will be always clear from the context.To denote the whole sequence, instead, we will use subscripts, e.g., {η k } k ∈ R, {x k } k ∈ R d and so on.Given a vector v ∈ R d , v denotes the Euclidean norm whereas, given a matrix H ∈ R d×d , H denotes the 2-norm.Moreover, k 2 (H ) := H H −1 will be used for the condition number in 2-norm.Finally, in the following, we will freely use a series of standard definitions from linear algebra, e.g., that of minimal polynomial, diagonalizability and similarity, and we refer the interested reader to [37].

Augmented Lagrangian and KKT
If we consider the Augmented Lagrangian the corresponding KKT conditions are Multiplying by β the second KKT condition, we obtain the system where H β := H +β A T A. As it will be clear in Sect.3, the main reason for undergoing the rearrangement of the KKT conditions as in (1), is that we will be able to interpret the Augmented Lagrangian Method of Multipliers (ALM) as a stationary method corresponding to a particular splitting of the matrix A.
Theorem 1 states the existence of a unique solution of problem (1): Theorem 1 The matrix A is invertible for all β > 0.
Proof Observe that The non-singularity follows by using the fact that A is of full rank.See also [2,Sec. 3] for different factorizations of saddle point matrices.

Let us define:
Definition 2 (ε-accurate primal-dual solution) We say that [x, μ] T is an ε-accurate primal-dual solution for problem (QP) if T is a random variable, we say that it is an expected ε-accurate primal-dual solution for problem (QP) if

The Augmented Lagrangian method of Multipliers (ALM)
The general form of ALM is given by which, for problem (QP), reads as It is important to observe that the iterates [x k+1 , μ k+1 ] T produced by (2) are dual feasible, i.e., It is well known that ALM can be derived applying the Proximal Point Method to the dual of problem (QP), see [51,Sec. 6.1], but in this particular case can be also recast in an operator splitting framework (see [51,Sec. 7], [21]): indeed, the ALM scheme can be interpreted as a fixed point iteration obtained from a splitting decomposition for the KKT linear algebraic system (1) (see [30,56] and [2,Sec. 8]).Writing we can write Eq. ( 2) as i.e., as a fixed point iteration of the form The following Theorem 3 (see [13,Sec. 2] for a similar result) is the cornerstone to prove the convergence of the ALM (see Eq. ( 2)) and its inexact version (see Eq. ( 8)).

Theorem 3
The eigenvalues of G β are s.t.λ ∈ [0, 1) for all β > 0 and, moreover, The proof is structured into three parts.Part 1: If λ is an eigenvalue of G β , then λ = 1.By contradiction suppose that λ = 1, then from (3) we have the condition which leads to an absurd since A is invertible for β > 0 (see Theorem 1).Part 2: By contradiction, if u = 0, then from the second equation in (3), we obtain (1−λ)v = 0 and hence an absurd using Part 1. Part 3: If v = 0, multiplying by u T the first equation in (3), we obtain λu T H β u = 0, which leads to λ = 0 since H β is SPD.If v = 0, from (3), we obtain If in Eq. ( 4) u T A T Au u T u = 0, reasoning as before and using Part 1, we obtain λ = 0.
Instead, if in Eq. ( 4) we have u T A T Au The proof is divided into two parts.Part 1: We prove that the matrix I − β AH −1 β A T is invertible.To prove this fact, it is enough to prove that β AH −1 β A T does not have unitary eigenvalues.Using Woodbury formula and defining C :

Thesis follows by observing that β AH
Part 2: We prove that the minimal polynomial of G β factorizes in distinct linear factors.
The proof of this fact follows by observing that the minimal polynomials of the blocks on the diagonal of G β , namely the null matrix and the symmetric matrix , factorize in distinct linear factors since they are diagonalizable (see [37,Cor 3.3.10]).Moreover, since the matrix I − β AH −1 β A T is invertible, they do not have common factors.Hence, the product of such minimal polynomials (which coincides with the lowest common multiple (lcm)) is the minimal polynomial of the whole matrix.Indeed, this last implication holds for generic block upper triangular matrices.To prove that, let us consider a block upper triangular matrix F with b diagonal blocks F ii , i = 1, . . ., b.Let us denote, moreover, by m i (x) the minimal polynomials of the blocks and m(x) the minimal polynomial of the whole matrix F. We have lcm(m i (x))|m(x) because m(F ii ) = 0.Moreover, by direct computation, one can check that, defining s(x) := n i=1 m i (x), it holds s(G) = 0.If the polynomials m i (x) are pairwise relatively prime (i.e., they do not have common factors), then s(x) = lcm(m i (x)) and hence s The final proof of statement, i.e., the diagonalizability of G β , follows by observing that, if the minimal polynomial of a given matrix factorizes in distinct linear factors, then the matrix is diagonalizable (see, once more, [37,Cor 3.3.10]).
Remark 1 (Non unitary step length) The framework presented until now allows to consider also the case of non-unitary dual steps.Indeed, considering the splitting it is easy to see that the corresponding ALM-type update is Moreover, using the techniques from Theorem 3 and Lemma 4, it can be proved that its convergence and rate of convergence depend on the spectral radius of Hence, the choice of the parameter γ could be used, in principle, to further improve the rate of convergence of ALM.In the following we consider γ = 1.

Lemma 5 There exists a constant M
Definition 6 In the following, [x, μ] T denotes the unique solution of linear system (1) (see Theorem 1 for existence and uniqueness).Moreover, we define,

Lemma 7
The ALM in (2) converges for all β > 0.Moreover, we have for all k ∈ N, Proof From direct computation, we have where we used Ae k = d k .Thesis follows by passing to the norms and using Lemma 5.
In Fig. 1 we report the behaviour of the condition number in 2-norm of the matrices A, H β (respectively k 2 (A), k 2 (H β )) and the spectral radius ρ β for different values of β.The results obtained in Fig. 1 confirm the statement regarding ρ β in Theorem 3, i.e., ρ β decreases when β increases.And indeed, using Lemma 7, we can observe that the convergence of ALM can be consistently sped-up by increasing the value of β, which corresponds to a decrease of ρ β .On the other hand, the eventual speed-up resulting from considering large values for β comes at the cost of solving an increasingly illconditioned linear system involving H β (see the first equation in (2) and the behaviour of k 2 (H β ) in Fig. 1).Indeed, when β is large, the matrix H β is dominated by the term β A T A (see [2,Sec. 8.1] and references therein for more details) and, if A T A is singular, the condition number of the matrix H β progressively degrades when β increases (see the behaviour of k 2 (H β ) for Problem 1 in the upper panel of Fig. 1).
The following Lemma 8 states the worst case complexity of ALM.
Proof Observe that we have where in the last inequality we used Lemma 7. Since, as observed at the beginning of this section, the iterates [x k , μ k ] T produced by the ALM are dual feasible, we have iterations of the ALM are sufficient to deliver an ε-accurate primal-dual solution.
In Fig. 2, we show the behaviour of the quantities involved in the proof of Lemma 8 (the notation used in the legend is consistent with that used in Lemma 8 except for the fact that the numerical value of the constant C used there is normalized using M, i.e., in Fig. 2 we report C ≡ C/M).As Lemma 8 states and Fig. 8 shows, the function Cρ k β is an upper bound for the quantity Ax k − b .In this example, in order to further highlight the dependence of ρ β on β, we choose different values of β (β = 0.1 and β = 5) such that, for Problem 1 and Problem 2, we obtain ρ β ≈ 0.05.Let us point out that the results reported in Fig. 2 are obtained solving the linear system in (2) using a high accuracy (a direct method using Matlab's "backslash" operator) and, since the iterates must be dual feasible, the residuals H x k + g − A T μ k are close to the machine precision.

Inexact ALM (iALM)
In this section we study in detail the iALM for problem (QP).The reader may see [57,Sec. 1.4] for a recent survey on this subject.In particular, we assume that the first equation in ( 2) is not solved exactly, i.e., x k+1 is such that In our framework, the iALM read as and ( 8) can be alternatively written as the following inexact fixed point iteration (see [4] and [46,Sec 12.2] for more details on this topic): On the contrary of what was observed for the exact ALM (see the beginning of Sect.3), the iterates produced by (8) are not dual feasible since i.e., the error r k introduced in the solution of the first equation in ( 8) can be interpreted as a measure of the violation of the dual feasibility condition.In Sect.5.1.2we will consider the point x k+1 in (7) as a result of a randomized procedure and, for this reason, we are going to present this section assuming that {r k } k in ( 9) is a sequence of random variables (and hence all the generated {[x k , μ k ] T } k are random variables).Moreover, all the results presented here can be easily restated in a deterministic framework substituting the "almost sure (a.s.) convergence" with "convergence" and not considering the "expectation operator".For a review of the probabilistic concepts we use in the following see [52,Ch. 2].
The following Theorem 9 addresses the convergence of the iALM using the inexact fixed point formulation in (9) under the condition that r k , i.e., the error introduced in the solution of the first equation in (8), converges a.s. to zero.Theorem 9 Let β > 0. If lim j→∞ r j = 0 a.s., then the iALM in (8) converges a.s. to the solution of the linear system (1) and the following inequalities hold a.s.for every k ∈ N: Proof If [x, μ] T is a solution of (1), then it satisfies the fixed point equation Subtracting ( 11) from ( 9) we obtain and hence Passing to the norms in (12) and using Lemma 5, we have The a.s.convergence to zero of { e k } k follows from (13) observing that, if lim k→∞ r k = 0 a.s., then (this is a particular case of the Toeplitz Lemma, see [46, for the deterministic case, [40] and references therein for the probabilistic case).The second part of the statement follows by observing that and that e 0 ≤ A −1 d 0 .
Lemma 10 Suppose E( r j ) ≤ R j+1 for all j ∈ N and R < 1.Then the iterates of the iALM in (8) converge a.s. to the solution of the linear system (1).Moreover, if R < ρ β , then O(log ρ β ε) iterations are sufficient to produce an expected ε-accurate primal-dual solution; else, if ρ β ≤ R, then O(log R ε) iterations are sufficient (given that ε is sufficiently small).
Proof If E( r j ) ≤ R j+1 for all j ∈ N, then ∞ j=0 E( r j ) < ∞ and hence, using [52, Th. 2.1.3],we have lim j→∞ r j = 0 a.s.Using now Theorem 9, we have that d k converges a.s. to zero.
Using Eq. ( 10) and the hypothesis E( r j ) ≤ R j+1 , we have Let us observe, moreover, that and hence where we defined C 1 := (1 + A T ) and used the fact that Using ( 14), we have where Moreover, using the above inequality, we have also and hence, using (15) and defining , we obtain that k ≥ log ρ β (ε/C) iterations of iALM are sufficient to produce an expected ε-accurate primaldual solution.
Case ρ β ≤ R. Using ( 14), we have where Let us observe that, in this case, we have and hence, using (15) and defining C := max{C 1 C 2 , 1 β C 2 }, we obtain that to produce an expected ε-accurate primal-dual solution it suffices to perform k + log R k ≥ log R (ε/C) iterations of iALM.The last part of the statement follows by observing that lim k→∞ k+log R k k = 1.
Before concluding this section, let us state the following Corollary 11, which will be used later: and hence, we have Proof Using (10) we have from which thesis follows by observing that k−1 j=0 for all k.

The solution of the linear system
In this section, given [x k , μ k ], we suppose that the linear system is solved using an iterative solver.Despite the fact that any iterative solver can be used for the solution of the SPD system in (18), we will focus our attention only on the block Successive Over-Relaxation method (SOR) [26,58] or its Randomly Shuffled version (RSSOR) [48].Indeed, these choices will allow us to clearly interpret the Random n-block ADMM as an iALM, see Sect. 6.Since r k in the first equation of ( 8) is the (possibly deterministic) residual associated to the linear system (18), i.e., one would be tempted to think that the increasing accuracy condition for the a.s.convergence to zero of the expected residual r k in Lemma 10, i.e., E( r k ) ≤ R k+1 , requires that expected number of iterations of the chosen iterative solver increases when the iterates of iALM proceed.In this section we will show that this is not the case if R > ρ β .For the remaining of this work let us define and {η k } k → 0 as the forcing sequence such that E( r k ) ≤ η k for all k ∈ N. We use, moreover, the following inequalities: given B ∈ R d×d SPD, if we order the eigenvalues of B as λ 1 (B) ≥ . . .λ d (B), then and For the sake of completeness, before presenting our results, we deliver a brief survey on the block SOR method which is based on [31,48,56].

A brief survey on SOR and randomly shuffled SOR
Let B ∈ C d×d .Consider the linear system We can express the matrix B as the sum of block-matrices B = D − L − U where Let us suppose now that the block-diagonal matrix D is invertible.The fixed point problem corresponding to Eq. ( 21) can be written as and the SOR method is defined as Gauss-Seidel (GS) method is recovered for ω = 1.It is important to point out at this stage that, to interpret the block Gauss-Seidel sweep performed by the multi-block ADMM in the context of the inexact Augmented Lagrangian Method as in Sect.6, we will need the results contained in this section only in the particular case of ω = 1 (which corresponds precisely to the Gauss-Seidel case).On the other hand, we prefer to state all the theory in the more general framework of SOR (0 < ω < 2) as the results presented in Sect.6 hold also for relaxation parameters different from ω = 1.Despite the fact that the detailed study of such generalizations of the multi-block ADMM falls out of the scope of this work, it is important to note that they might be of great practical interest due to the enhanced rate of convergence of SOR w.r.t.Gauss-Seidel when suitably selected relaxation parameters are chosen.
Observe that Eq. ( 23) can be written alternatively as and for this reason, usually, the point successive over-relaxation matrix is defined as The following Corollary of the Ostrowski-Reich Theorem states the convergence of the block SOR iteration: Corollary 12 [56,Cor. 3.14] Let B ∈ C n×n and D, L, U be defined as in (22).If D is positive definite, then the block SOR method in (23) is convergent for all y 0 if and only if 0 < ω < 2 and B is positive definite.
In this work we are going to deal just with symmetric matrices and, for this reason, we denote the factor U in ( 22) with L T .It is worth noting, moreover, that using the equality (1−ω)D +ωL T = (D −ωL)−ωB, we can further rewrite the SOR iteration in (23) as In [48], a Randomly Shuffled version of SOR (RSSOR) has been introduced and studied: it is obtained considering P j as a random permutation matrix (with uniform distribution and independent from the current guess y j ) and applying the SOR splitting to the linear system P j B P j T P j x = P j χ , i.e., considering The RSSOR is defined as y j+1 = (I − ω P j T (D P j − ωL P j ) −1 P j B)y j + ω P j T (D P j − ωL P j ) −1 P j χ . ( 3 ρ(L P 1 ) for 100 random matrices of the form Moreover, let us observe that after defining Q P j := ω P j T (D P j − ωL P j ) −1 P j , (26) can be written as a function of the random variables P 1 , . . ., P j , i.e., where we set ( Before concluding this section, let us point out that the main idea connected with RSSOR is related to the fact that, although the spectral distribution of the matrix P B P T does not depend on any particular permutation matrix P, the spectrum of the lower triangular part D P − L P does depend on it.As a result, also the spectral radius of the matrix L P ω := (I −ω P T (D P −ωL P ) −1 P B) is affected by the particular choice of P. To further highlight the aforementioned dependence and to strengthen the intuition of the reader in this regard, in Fig. 3 we report ρ(L P 1 ) for all the permutation matrices P and for 100 randomly generated matrices of the form B = R T R + I ∈ R 7×7 (R is generated using the Matlab's function "rand").

Rate of convergence of SOR
This section is based on [48].If B is SPD and is partitioned as in ( 22), the linear system in ( 21) can be transformed as (D is SPD since B is SPD) and hence the coefficient matrix can be decomposed as where D −1/2 L D −1/2 and (D −1/2 L D −1/2 ) T are, respectively, strictly lower triangular and strictly upper triangular.For the above explained reasons, in this section we will suppose that B = I − L − L T .Observe, moreover, that the SOR method applied to the system in (28) with the splitting (29) coincides exactly with (24) and hence, the fact that in this section we suppose that the diagonal of B is the identity, is expected to simplify the presentation.
following Theorem 13 gives a precise bound for the rate of convergence of the SOR method: Theorem 13 [48, Th. 1] Let B a SPD matrix, then the SOR method (25) converges for 0 < ω < 2 in the energy norm associated with B according to The rate of convergence stated in (30) depends on the dimension of the problem d and this feature is not desirable for large scale problems.
One of the main advantages of the RSSOR consists in the fact that the expected error reduction factor is independent from the dimension of the problem, as stated in the following: Theorem 14 [48,Th. 4] The expected squared energy norm error of the RSSOR iteration converges exponentially with the bound for any ω ∈ (0, 2).
As already pointed out, Eq. ( 31) does not exhibit any dependence on the dimension of the problem and, for this reason, the Randomly Shuffled versions of SOR should be considered for large scale problems.Moreover, the following corollary addresses the convergence of the iterates to the solution of the linear system: Corollary 15 lim j→∞ y − y j 2 B = 0 a.s.Proof Using (31), we have that ∞ j=0 E( y−y j 2 B ) < ∞.Thesis follows by applying [52, Th. 2.1.3].

Using SOR in iALM
We are ready to analyse the behaviour of SOR method in the framework of the iALM (8).In particular, we are going to present our results for the RSSOR method (see Eq. ( 26)), but analogous techniques/results apply/hold for the non-randomized version (25).This choice is mainly driven by the reasons of timeliness: in the next Sect.6 we are able to interpret the recently introduced Randomized ADMM (RADMM) as a particular case of iALM where the linear system (18) is solved (inexactly) using RSSOR with ω = 1 (which will be denoted, in the following, as Randomly Shuffled Gauss-Seidel (RSGS)).For this reason, in this section, we apply the results presented in Sect. 4 in the probabilistic form considering {r k } k and {[x k , μ k ] T } k as sequences of random variables.
Of course, the same results as presented here hold, with simple modifications, for the deterministic ADMM and the classical GS method.
In order to use the rate of convergence in (31), we write H β = D − L − L T and transform the linear system in (18) as follows: Let us define Consider, moreover, the random variable where H β x k+1 = χ k and { x k+1, j } j is the random sequence generated by RSSOR method in (26) to approximate x k+1 , i.e., the solution of problem (32).
The following Lemma 16 will be useful to state the main result of this section: Lemma 16 Let us suppose that the RSSOR in Eq. ( 26) is used for the solution of the linear system (32) with y 0 = D 1/2 x k =: x k+1,0 .If the random variable (P k+1,0 , . . ., P k+1, j ) is independent from {[x k , μ k ] T } k for every j, k ∈ N (beyond the standard assumptions required on the P k+1, j by RSSOR), then Proof Let us observe that, using (27), we can write where g is a deterministic function.
Using the fact that, if the random variable Y is independent from X (see Freezing Lemma, [20, Example 5.1.5]),it holds and using (31), we have 123 Moreover, using the conditional Jensen's Inequality in the left hand-side of the previous equation (see [3,Th. 34.4]) and then passing square root, we have Thesis follows by considering the expectation on both sides of the above inequality and using the properties of the conditional expectation [3,Th. 34.4].
We are now ready to state the following Theorem 17 which summarizes the properties of the iALM in (8) when each sub-problem is solved using RSSOR: where {r k, j := H β x k+1, j − χ k } j is the sequence of random residuals generated by RSSOR initialized using x k+1,0 = x k .Then, there exists j ∈ N such that j ≥ j (k) for all k.Moreover, an expected ε-accurate primal-dual solution of problem (QP) can be obtained in O(log R ε) iALM iterations.
Proof Using (19) in (33) and since the expectation is a linear function, we have and hence, using (20), where we defined x k+1, j := D −1/2 x k+1, j for j ≥ 1.If in the above equation we use the definition of r k+1, j , we have and hence, defining Observe, moreover, that using the second equation in ( 8), we have and hence using the hypothesis η k = R k+1 and Eq. ( 17), we are able to state the existence of a constant C > 0 such that We obtain From (35), we obtain the first part of the statement observing that j (k) ≥ j (k) for all k.The last part of the statement follows by observing that with this choice of η k the hypotheses of Lemma 10 are satisfied.
In the upper panels of Fig. 4 we report the quantities analysed in the proof of Lemma 10 (also in this case the notation used in the legend is consistent with that used in Lemma 10 except for the fact that the numerical constant C used there is normalized using M, i.e., in Fig. 4 we report C ≡ C/M).The expectations and E( d k ) are approximated using the empirical mean over 15 iALM simulations, whereas, for each fixed k and j, E( r k, j ) is approximated using the empirical mean of E(E( r k, j | x k μ k )) over 15 trajectories for [x k , μ k ] T and 15 simulations of the RSGS step.In the lower panels, we report, for each iALM step and for each simulation, the box-plots of the obtained j (k) (see Eq. ( 34)).As Theorem 17 states and Fig. 4 confirms, j (k) shows a bounded-from-above behaviour for all the iALM iterations (the choice of the parameters β and R is reported on top of the figure).34)) obtained in each simulation of iALM when RSSOR is used for the solution of (18) using {η k } k and {x k+1,0 } k as in Theorem 17

Interpreting (random)ADMM as an iALM
Given a block partition of x, i.e., x = [x d 1 , . . ., [11] and references therein) is defined as If we apply the iterative method in (36) to solve problem (QP), splitting H β as H β = D − L − L T , it is possible to re-write (36) in compact form (see [12,53]): Since Eq. ( 37) can be written alternatively as we can observe that the first equation in (38) is precisely one step of the SOR method with ω = 1 (see Eq. ( 23)), i.e., ADMM performs exactly one GS iteration for the solution of the linear system Let us point out that in [11] it has been proved that the n-block extension of ADMM is not always convergent since there exist examples where the spectral radius of G ADM M in Eq. ( 37) satisfies ρ(G ADM M ) > 1.The analysis performed in Sects.4 and 5 reveals a simple strategy to remedy this: performing more steps of the GS iteration to fulfil the requirements needed on the residuals will ensure convergence.Indeed, as proved in Sect. 5 (deterministic case), a constant number of iterations of SOR per iALM-step is sufficient to guarantee that the produced residuals satisfy the sufficient conditions for convergence.To further underpin the previous claim, in Fig. 5, we report the behaviour of d k , Ax k − b and H x k + g − A T μ k for ADMM and for iALM&GS where, at each inner iteration, 10 GS sweeps are performed.For the particular case of Problem 2 when β = 1 and all the blocks have size one, we have ρ(G ADM M ) = 1.0148 > 1 and the ADMM is not convergent (see the upper panel in Fig. 5).On the contrary, performing more than one GS sweep (lower panel of Fig. 5) is enough to observe a convergent behaviour of all residuals.Exactly the same observation can be made for the RADMM [12,53]: this method is obtained considering a block permutation matrix P k which selects the order for solving the block-equations and then splitting the matrix P k H β P k T as (the random permutation matrix is selected independently from the iterate x k and uniformly at random among all possible block-permutation matrices).In more details, if we consider the iterative method to solve problem (QP), using the splitting (39), we can write (40) in the fixed point form and hence The first equation in (42) coincides exactly with one iteration of the RSSOR with ω = 1 (see Eq. ( 26)) for the solution of the linear system H β x = A T μ k + β A T b − g.On the other hand, as proved in Theorem 17, the number of RSSOR sweeps per iALMstep sufficient to obtain an expected residual which ensures the a.s.convergence, is uniformly bounded above by a constant.We find that this is a noteworthy improvement of the results obtained in [12,44,53].Indeed, in these works, only the the convergence in expectation of the iterates produced by (42) has been proved, i.e., the convergence To be precise, using the notation introduced in (41), the authors prove that ρ β := ρ(G β ) < 1 where and P is a specific subset of all permutation matrices (P is the subset of block permutation matrices with blocks of order n in [12,53] and, in [44], P is the subset of the permutation matrices obtained as P = P 1 P 2 , where P 1 is a block permutation matrix with blocks of order n and P 2 is a permutation corresponding to a partition of d elements into n groups).Overall, as already pointed out in [44, Sec.2.2.4], the convergence in expectation may not be a good indicator of the robustness and the effectiveness of RADMM as there may exist problems characterized by a high V(G P β ) , where V(G P β ) denotes the variance of the random variable G P β .We find that switching from a convergence in expectation to an a.s.convergence with provable expected worst case complexity as stated in Theorem 17, could be beneficial for the solution of such problems.
Even in this case, to further underpin the previous claim, we report in Fig. 6 the behaviour of d k , Ax k − b and H x k + g − A T μ k for RADMM and for iALM&RSGS where, at each inner iteration, 10 RSGS sweeps are performed.As it is clear from the comparison between the upper panels of Figs. 5 and 6 (and expected from the results obtained in [12,53]), the introduction of a randomization procedure in the ADMM scheme is able to mitigate the divergence in the case of Problem 2. At the same time, analogously of what was observed in Fig. 5 for the deterministic case, the benefits of performing more than one RSGS sweep per iALM-step are evident (lower panel of Fig. 6).

Numerical results
In this section we briefly present a series numerical results aiming to guide the practitioners in the selection of some of the parameters involved in the optimal implementation of iALM&RSGS.As test set we introduce and use suitable generalizations of Problem 2. The choice of such test set is motivated from the fact that it represents, somehow, a set of pathological examples for which the convergence in expectation might not deliver satisfactory performance.To define our test set, let us introduce the matrix We consider problem (QP) where H = h I d×d ∈ R d with h = 0.05, b, g random vectors, and Clearly when m = d = 3 and c = 1, we recover Problem 2. In the next Fig.7 we plot ρ(G ADM M ) for different values of c (x-axis of the figure) and m when d = 1000 and β = 1 and when all the blocks are of order one (which will be precisely the setting used for the numerical results presented later).As Fig. 7 confirms, for all the considered values of c and m we have ρ(G ADM M ) > 1, feature which endows the selected class of problems with meaningful pathologies suitable for testing the goodness and the robustness of our proposal when compared to RADMM.
As it is clear from Eq. ( 42), the dominant computational cost per RADMM step, is the solution of a block-lower triangular system performed during the GS sweep.For this reason, in the remainder of this section, we will fix a prescribed number of GS sweeps and measure the performance of iALM&RSGS when compared to RADMM.
In Figs. 8 and 9 we report Ax k − b and H x k + g − A T μ k for RADMM and for iALM&RSGS when the maximum allowed RSGS sweeps is 5000.For the sake of brevity, we present computational results only for selected representative values of m and c (m = 400 and c = 1, 100) but a similar behaviour is observed for all the values m and c considered in Fig. 7.In particular, in Fig. 8 we present the comparison of RADMM (denoted with J = 1) with iALM&RSGS when J > 1 RSGS sweeps are performed per iALM iteration (J = 5, 25).In Fig. 9 instead, we compare RADMM with iALM&RSGS when RSGS for the linear system (18) is stopped if the observed residual is such that r k < P R k+1 (see Lemma 10) with R = 0.999 and P is a constant depending on the initial residual r 0 .Accordingly to the theoretical analysis carried out in Sect.6, the numerical results presented in Figs. 8 and 9 confirm that performing more than one RSGS per iALM iteration consistently outperforms RADMM in the reduction of the dual residual H x k + g − A T μ k .Moreover, as the comparison between Figs. 8 and 9 shows, the choice of the first strategy (fixed RGSG sweeps per iALM iteration) is preferable in general since it allows to have a faster primal/dual residuals reduction.

Conclusions
In this work we studied the inexact Augmented Lagrangian Method (iALM) for the solution of problem (QP).Using a splitting operator perspective, we proved that if the amount of introduced inexactness (which could be modelled also with a random variable) decreases (in expectation) accordingly to suitably chosen R k where R < 1, then we are able to give explicit asymptotic rate of convergence of the iALM (see Lemma 10).Moreover, even if the above mentioned condition requires an increasing accuracy in the linear systems to be solved at each iteration, we proved that when these linear systems are solved using the Successive-Over-Relaxation method (SOR) and its Randomly Shuffled version (RSSOR), the number of iterations sufficient to satisfy the convergence requirements can be uniformly bounded from above (see Sect. 5).Finally, using the developed theory and interpreting the n-block (Random)Alternating Direction Method of Multipliers ((R)ADMM) as an iALM which performs exactly one (RS)SOR sweep to obtain the approximate solutions of the inner linear systems, we provided computational evidence which demonstrates that the very well known convergence issues of the n-block (R)ADMM could be remedied if more than one (RS)SOR sweep for every iALM iteration were permitted (see Sect. 7).

Fig. 2
Fig. 2 Behaviour of the quantities analysed in Lemma 8 (logarithmic scale on y-axis)

Fig. 4
Fig.4 Upper panels: Behaviour of the quantities analysed in Lemma 10 (logarithmic scale on y-axis) approximated using the empirical mean over 15 simulations of iALM.Lower panels: Box-plots of the j (k) 's (see Eq. (34)) obtained in each simulation of iALM when RSSOR is used for the solution of(18)

Fig. 7
Fig. 7 Spectral radius of ADMM matrices for the generalization of Problem 2 (logarithmic scale on y-axis)

R
d×d A = ee T +

Fig. 8
Fig.8 Random ADMM vs iALM&RSGS(J ) for selected values of m and c (logarithmic scale on both axis)

Fig. 9
Fig. 9 RADMM vs iALM&RSGS for selected values of m and c (logarithmic scale on both axis)