Managing randomization in the multi-block alternating direction method of multipliers for quadratic optimization

The Alternating Direction Method of Multipliers (ADMM) has gained a lot of attention for solving large-scale and objective-separable constrained optimization. However, the two-block variable structure of the ADMM still limits the practical computational efficiency of the method, because one big matrix factorization is needed at least once even for linear and convex quadratic programming. This drawback may be overcome by enforcing a multi-block structure of the decision variables in the original optimization problem. Unfortunately, the multi-block ADMM, with more than two blocks, is not guaranteed to be convergent. On the other hand, two positive developments have been made: first, if in each cyclic loop one randomly permutes the updating order of the multiple blocks, then the method converges in expectation for solving any system of linear equations with any number of blocks. Secondly, such a randomly permuted ADMM also works for equality-constrained convex quadratic programming even when the objective function is not separable. The goal of this paper is twofold. First, we add more randomness into the ADMM by developing a randomly assembled cyclic ADMM (RAC-ADMM) where the decision variables in each block are randomly assembled. We discuss the theoretical properties of RAC-ADMM and show when random assembling helps and when it hurts, and develop a criterion to guarantee that it converges almost surely. Secondly, using the theoretical guidance on RAC-ADMM, we conduct multiple numerical tests on solving both randomly generated and large-scale benchmark quadratic optimization problems, which include continuous, and binary graph-partition and quadratic assignment, and selected machine learning problems. Our numerical tests show that the RAC-ADMM, with a variable-grouping strategy, could significantly improve the computation efficiency on solving most quadratic optimization problems.


Introduction
In this paper we consider the linearly constrained convex minimization model with an objective function that is the sum of multiple separable functions and a coupled quadratic function: min where H ∈ R n×n is a symmetric positive semidefinite matrix, vector c ∈ R n and the problem parameters are the matrix A = [A 1 , . . . , A p ], A i ∈ R m×d i , i = 1, 2, . . . , p with p i=1 d i = n and the vector b ∈ R m . The constraint set X is the Cartesian product of possibly non-convex real, closed, nonempty sets, X = X 1 × · · · × X p , where x i ∈ X i ⊆ R d i .
Problem (1) naturally arises from applications such as machine and statistical learning, image processing, portfolio management, tensor decomposition, matrix completion or decomposition, manifold optimization, data clustering and many other problems of practical importance. To solve problem (1), we consider in particular a randomly assembled multi-block and cyclic alternating direction method of multipliers (RAC-ADMM), a novel algorithm with which we hope to mitigate the problem of slow convergence and divergence issues of the classical alternating direction method of multipliers (ADMM) when applied to problems with cross-block coupled variables.
ADMM was originally proposed in 1970's [31,32] and after a long period without too much attention it has recently gained in popularity for a broad spectrum of applications [28,41,44,57,67]. Problems successfully solved by ADMM range from classical linear programming (LP), semidefinite programming (SDP) and quadratically constrained quadratic programming (QCQP) applied to partial differential equations, mechanics, image processing, statistical learning, computer vision and similar problems (for examples see [10,39,45,53,58,70]) to emerging areas such as deep learning [71], medical treatment [81] and social networking [2]. ADMM is shown to be a good choice for problems where high accuracy is not a requirement but a "good enough" solution is needed to be found fast.
Cyclic multi-block ADMM (C-ADMM) is an iterative algorithm that embeds a Gaussian-Seidel decomposition into each iteration of the augmented Lagrangian method (ALM) [36,59]. It consists of a cyclic update of the blocks of primal variables, x i ∈ X i , x = (x 1 , . . . , x p ), and a dual ascent type update of the variable y ∈ R m , i.e., . , x k p ; y k )}, . . .
Another variant of multi-block ADMM was suggested in [5], where the authors introduce the distributed multi-block ADMM (D-ADMM) for separable problems. The method creates a Dantzig-Wolfe-Benders decomposition structure and sequentially solves a "master" problem followed by solving distributed multi-block "slave" problems. It converts the multi-block problem into an equivalent two-block problem via variable splitting [6] and performs a separate augmented Lagrangian minimization over x i . The method assumes that the objective function is separable across blocks, f (x) = i f i (x i ) + c T x, and is not provably working for solving problems with non-separable objective functions.
(4) Because of the variable splitting, the distributed ADMM approach based on (4) increases the number of variables and constraints in the problem, which in turn makes the algorithm not very efficient for large p in practice.
The classical two-block ADMM (Eq. 2 with p = 2) and its convergence have been extensively studied in the literature (e.g. [20,22,31,35,54]. However, the two-block variable structure of the ADMM still limits the practical computational efficiency of the method, because one factorization of a large matrix is needed at least once even for linear and convex quadratic programming (e.g., [45,65]). This drawback may be overcome by enforcing a multi-block structure of the decision variables in the original optimization problem. Indeed, due to the simplicity and practical implications of a direct extension of ADMM to the multi-block variant (2), an active research recently has been going on in developing ADMM variants with provable convergence and competitive numerical efficiency and iteration simplicity (e.g. [17,35,37,58]), and on proving global convergence under some special conditions (e.g. [13,24,46,47]). Unfortunately, in general the Cyclic multi-block ADMM, with more than two blocks, is not guaranteed to be convergent even for solving a single system of linear equations, which settled a long-standing open question [15].
Moreover, in contrast to the work on separable convex problems, little work has been done on understanding properties of the multi-block ADMM for (1) with a nonseparable convex quadratic or even non-convex objective function. One of the rare works that addresses coupled objectives is [17] where authors describe convergence properties for non-separable convex minimization problems. A good description of the difficulties of obtaining a rigorous proof is given in [23]. For solving non-convex problems, a rigorous analysis of ADMM is by itself a very hard problem, with only a couple of works being done for generalized, but still limited (by an objective function), separable problems. For examples see [38,40,76,77,82].
Randomization is commonly used to reduce information and computation complexity for solving large-scale optimization problems. Typical examples include Q-Learning or Reinforced Learning, Stochastic Gradient Descent (SGD) for Deep Learning, Randomized Block-Coordinate-Descent (BCD) for convex programming, and so on. Randomization of ADMM has recently become a matter of interest as well. In [68] the authors devised randomly permuted multi-block ADMM (RP-ADMM) algorithm, in which on every cyclic loop the blocks are solved or updated in a randomly permuted order. Surprisingly the algorithm eliminated the divergence example constructed in [15], and RP-ADMM was shown to converge linearly in expectation for solving any square system of linear equations with any number of blocks. Subsequently, in [17] the authors focused on solving the linearly constrained convex optimization with coupled convex quadratic objective, and proved the convergence in expectation of RP-ADMM for the non separable multi-block convex quadratic programming, which is a much broader class of computational problems.
x k+1 σ p = arg min . , x σ p , y k )}, The main goal of the work proposed in this paper is twofold. First, we add more randomness into the ADMM by developing a randomly assembled cyclic ADMM (RAC-ADMM) where the decision variables in each block are randomly assembled. In contrast to RP-ADMM in which the variables in each block are fixed and unchanged, RAC-ADMM randomly assembles new blocks at each cyclic loop. It can be viewed as a decomposition-coordination procedure that decomposes the problem in a random fashion and combines the solutions to small local sub-problems to find the solution to the original large-scale problem. RAC-ADMM, in-line with RP-ADMM, admits multiple blocks with possibly cross-block coupled variables and updates the blocks in the cyclic order. The idea of re-constructing block variables at each cyclic loop was first mentioned in [51], where the authors present a framework for solving discrete optimization problems which decomposes a problem into sub-problems by randomly (without replacement) grouping variables into subsets. Each subset is then used to construct a sub-problem by considering variables outside the subset as fixed, and the sub-problems are then solved in a cyclic fashion. Subsets are constructed once per iteration. The algorithm presented in that paper is a variant of the block coordinate descent (BCD) method with an addition of methodology to handle a small number of special constraints, which can be seen as a special case of RAC-ADMM. In the current paper we discuss the theoretical properties of RAC-ADMM and show when the additional random assembling helps and when it hurts.
Secondly, using the theoretical guidance on RAC-ADMM, we conduct multiple numerical tests on solving both randomly generated and benchmark quadratic optimization problems, which include continuous, and binary graph-partitioning and quadratic assignment problems, and selected machine learning problems such as linear regression, LASSO, elastic-net, and support vector machine. Our numerical tests show that RAC-ADMM, with a systematic variable-grouping strategy (designate a set of variables always belonging to a same block), could significantly improve the computation efficiency on solving most quadratic optimization problems.
The current paper is organized as follows. In the next section we present RAC-ADMM algorithm and present theoretical results with respect to convergence. Next we discuss the notion of special grouping, thus selecting variables in less-random fashion by analyzing a problem structure, and the use of partial Lagrangian, approaches, which improve convergence speed of the algorithm. In Sect. 3, we present a solver, RACQP, we built that uses RAC-ADMM to address linearly constrained quadratic problems. The solver is implemented in Matlab [50] and the source code available online [61]. The solver's performance is investigated in Sect. 4, where we compare RACQP with commercial solvers, Gurobi [34] and Mosek [55], and the academic OSQP which is a ADMM-based solver developed by [65]. We also consider machine learning problems and compare our general purpose solver with tailored heuristic solutions, Glmnet [30,64] and LIBSVM [14]. The summary of our contributions with concluding remarks is given in Sect. 5.

RAC-ADMM
In this section we describe our randomly assembled cyclic alternating direction method of multipliers (RAC-ADMM). We start by presenting the algorithm, then analyze its convergence for linearly constrained quadratic problems, and finalize the section by introducing accelerated procedures that improve the convergence speed of RAC-ADMM by means of a grouping strategy of highly coupled variables and a partial Lagrangian approach. Note that although our analysis of convergence is restricted to quadratic and/or special classes of problems, it serves as a good indicator of the convergence of the algorithm in more general case.

The algorithm
RAC-ADMM is an algorithm that is applied to solve convex problems (1). The algorithm addresses equality and inequality constraints separately, with the latter converted into equalities using slack variables, s: where matrix A eq ∈ R m e ×n and vector b eq ∈ R m e describe equality constraints and matrix A ineq ∈ R m i ×n and the vector b ineq ∈ R m i describe inequality constraints.
Primal variables x ∈ X are in constraint set X ⊆ R n which is the Cartesian product of possibly non-convex real, closed, nonempty sets, and slack variables s ∈ R m i + . The augmented Lagrangian function used by RAC-ADMM is then defined by L β (x; s; y eq ; y ineq ) with dual variables y ∈ R m e and z ∈ R m i , and penalty parameter β > 0. In (6) we keep inequality and equality constraint matrices separate so to underline a separate slack variable update step of (8) which has a close form solution described in more details in Sect. 3. RAC-ADMM is an iterative algorithm that embeds a Gaussian-Seidel decomposition into each iteration of the augmented Lagrangian method (ALM). It consists of a cyclic update of randomly constructed blocks † of primal variables, x i ∈ X i , followed by the update of slack variables s and a dual ascent type update for Lagrange multipliers y eq and y ineq : Randomly (without replacement) assemble primal variables x † into p blocks, x i , i = 1, . . . , p, then solve : x k+1 1 = arg min . † Structure of a problem, if known, can be used to guide grouping as described in Sect. 2.3.1 (8) Randomly assembled cyclic alternating direction method of multipliers (RAC-ADMM), can be seen as a generalization of cyclic ADMM, i.e. cyclic multi-block ADMM is a special case of RAC-ADMM in which the blocks are constructed at each iteration using a deterministic rule and optimized following a fixed block order. Using the same analogy, RP-ADMM can be seen as a special case of RAC-ADMM, in which blocks are constructed using some predetermined rule and kept fixed at each iteration, but sub-problems (i.e. blocks minimizing primal variables) are solved in a random order.
The main advantage of RAC-ADMM over other multi-block ADMM variants is in its potential to significantly reduce primal and, especially, dual residuals, which is a common obstacle for applying multi-block ADMMs. Intuitively, switching variables between the blocks increases chances of finding descent directions which favor RAC-ADMM. The following example further explains such intuition.

Example 1 Consider the problem min
x,y,u,v Starting from the origin with two blocks, if we group (x, u) and (y, v) then we cannot minimize further. But if we group (x, y) and (u, v) then we can find the problem is unbounded from below. Thus, re-grouping the variables gives RAC-ADMM higher chances of finding (better) descent direction(s), which consequently leads to a better performance for dual residuals.
To illustrate this feature we ran a simple experiment in which we fix the number of iterations and check the final residuals among the aforementioned multi-block ADMM variants. In Table 1 we show performance of the ADMMs when solving a simple quadratic problem with a single constraint, represented by a regularized Markowitz min-variance problem (defined in Sect. 4.1.3). Figure 1 gives the insight in evolution of the both residuals with iterations. From the figure, it is noticeable that both D-ADMM (Eq. 4) and RP-ADMM (Eq. 5) suffer from a very slow convergence speed, with the main difference that the latter gives a slightly lower error on dual residual. Multi-block Cyclic-ADMM (Eq. 2) does not converge to a KKT point for any k, but oscillates around a very weak solution. RAC-ADMM converges to the KKT solution very quickly with both residual errors below 10 −8 in less than 40 iterations.

Convergence of RAC-ADMM
This section concerns with convergence properties of RAC-ADMM when applied to unbounded (i.e. x ∈ R n ) linearly-equality constrained quadratic optimization problems. To simplify the notation, we use A = A eq and b = b eq . min Convergence analysis of problems that include inequalities (bounds on variables and/or inequality constraints) is still an open question and will be addressed in our subsequent work.

Preliminaries
Double Randomness Interpretation Let R AC(n, p) denote all possible updating combinations for RAC with n variables and p blocks, and let σ R AC ∈ R AC(n, p) denote one specific updating combination for RAC-ADMM. Then the total number of updating combinations for RAC-ADMM is given by where s ∈ Z + denotes size of each block with p · s = n.
RAC-ADMM could be viewed as a double-randomness procedure based on RP-ADMM with different block compositions. Let σ R P ∈ R P( p) denote an updating combinations of RP-ADMM with p blocks where the variable composition in each block is fixed. Clearly, the total number of updating combinations for RP-ADMM is given by | R P( p) | = p! the total number of possible updating orders of the p blocks. Then, one may consider RAC-ADMM first randomly chooses a block composition and then applies RP-ADMM. Let υ i ∈ ϒ(n, p) denote one specific block composition or partition of n decision variables into p blocks, where ϒ(n, p) is the set of all possible block compositions. Then, the total number of all possible block compositions is given by For convenience, in what follows let R P( p),υ i denote all possible updating orders with a fixed block composition υ i . To further illustrate the relations of RP-ADMM and RAC-ADMM, consider the following simple example. Example 2 Let n = 6, p = 3, so | R P(6,3) | = 3! = 6, and the total number of block compositions or partitions is 15: RAC-ADMM could be viewed as if, at each cyclic loop, the algorithm first selects a block composition υ i uniformly random from all possible 15 block compositions ϒ(n, p), and then performs RP-ADMM with the chosen specific block composition υ i . In other words, RAC-ADMM then randomly selects σ ∈ R P( p),υ i , which leads to a total of 90 possible updating combinations.

RAC-ADMM as a linear transformation
Recall that the augmented Lagrangian function for (9) is given by Consider one specific update order generated by RAC, σ R AC ∈ R AC(n, p) . Note that we use σ instead σ R AC when there is no confusion. One possible update combination generated by RAC, σ = [σ 1 , . . . , σ p ], where σ i is an index vector of size s, is as follows, For convenience, we follow the notation in [17] and [68,69] to describe the iterative scheme of RAC-ADMM in a matrix form. Let L σ ∈ R n×n be s × s block matrix defined with respect to σ i rows and σ j columns as and let R σ be defined as By setting z := (x; y), RAC-ADMM could be viewed as a linear system mapping iteration Define the matrix Q by Notice that for any block structure υ i any update order within this fixed block structure and because L T σ = Lσ , matrix Q υ i is symmetric for all i, and Finally, the expected mapping matrix M is given by or, by direct computation, where S = H +β A T A.

Expected convergence of RAC-ADMM
With the preliminaries defined, we are now ready to show that RAC-ADMM converges in expectation under the following assumption:

Assumption 1 Assume that for any block of indices σ i that generated by RAC-ADMM
where σ i is the index vector describing indices of primal variables of the block i.
Theorem 2 Suppose that Assumption (1) holds, and that RAC-ADMM (8) is employed to solve problem (9). Then the expected output converges to some KKT point of (9).
Theorem 2 suggests that the expected output converges to some KKT point of (9). Such convergence in expectation criteria has been widely used in many randomized algorithms, including convergence analysis for RP-BCD and RP-ADMM (e.g. [16,68]), and stochastic quasi-newton methods (e.g. [12]). It is worth mentioning that if the optimization problem is strictly convex (H > 0), we are able to prove that the expected mapping matrix has spectrum that is strictly less than 1 (Corollary 1). Although convergence in expectation is widely used in many literature, it is still a relatively weak convergence criteria. This is why in Sect. 2.2.4 we propose a sufficient condition for almost surely convergence of RAC-ADMM. The section also provides an example showing a problem with ρ(M) < 1 which does not converge. Rather it oscillates almost surely (Example 3). To the best of our knowledge, this is the first example showing that even if a randomized optimization algorithm has expected spectrum radius strictly less than 1, the algorithm may still oscillate-to construct an example with expected spectrum radius equals to 1 that does not converge is an easy task. Consider for example a sequence {x t , t ≥ 0} with x t = −1 and x t = 1, chosen with equal probabilities (prob=1/2). Then, the sequence does not converge with probability 1. However, under the such example, the expected spectrum of this mapping procedure ρ(M) actually equals to 1, which implies that the sequence may not converge. Despite the fact that such example exists for RAC-ADMM, in all the numerical tests provided in Sect. 4, RAC-ADMM converges to the KKT point of the optimization problem under few iterations. Such strong numerical evidences imply that in practice, our algorithm does not require taking expectation over many iterations to converge.
The proof of Theorem 2 follows the proof structure of [17,68,69] to show that under Assumption 1: The proof builds on Theorem 2 from [17], which describes RP-ADMM convergence in expectation under specific conditions put on matrices H and A, and Weyl's inequality, which gives the upper bound on maximum eigenvalue and the lower bound on minimum eigenvalue of a sum of Hermitian matrices. Proofs for items (2) and (3) are identical to proofs given in [17,Section 3.2], so here the focus in on proving item (1). The following lemma completes the proof of expected convergence of RAC. To prove Lemma 1, we first show that for any block structure υ i , the following proposition holds: Proposition 1 Q υ i S is positive semi-definite and symmetric, and Intuitively, a different block structure of RAC-ADMM iteration could be viewed as relabeling variables and performing RP-ADMM procedure as described in [17].
Where e i is the row vector with ith element equal to 1. Notice P υ i is orthogonal matrix for any υ i , i.e. P υ i P T υ i = I . For any fixed block structure υ i , with an update order within σ R P ∈ R P ( p), the following equality holds where L σ R P ,S,υ i is the construction of L following update order σ R P ∈ R P ( p) and block structure υ i with respect to S, and L σ R P ,S,υ 1 is the construction of L following update order σ R P ∈ R P ( p) and block structure υ 1 , with coefficient matrixS, and Then by the definition of Q matrix (Eq. 12), we get Considering the eigenvalues of Q υ i ,S S, and from [17], under Assumption (1),Q υ 1 ,S is positive definite, and which implies Q υ i is positive definite, and .
Notice that by definition of Q, we have and since S is positive definite and symmetric, we could write S = B T B, so Let λ 1 (A) denote the maximum eigenvalue of matrix A, then as all B Q υ i B T are Hermitian matrices, by Weyl's theorem, we have and as λ 1 (Q υ i S) < 4 3 for each i, which completes the proof of Lemma 1, and thus establishes that RAC-ADMM is guaranteed to converge in expectation.
When the problem is strongly convex (H 0), we introduce the following corollary. Note that there are random sequences converging in expectation where their spectrumradius equal to one. Therefore, for solving strongly non-separable convex quadratic optimization, the expected convergence rate of RAC-ADMM is proved to be linear, which result is stronger than just "convergence in expectation".

Convergence speed of RAC-ADMM versus RP-ADMM
Following is a corollary to show that on average or in expectation, RAC-ADMM outperforms RP-ADMM with a fixed block composition in sense of spectral radius of mapping matrix.

Corollary 2
Under Assumption 1, with H = 0 so that S = β A T A, where A ∈ R n×n is a non-singular matrix, there exists some RP-ADMM (with specific block compositions), such that expected spectral radius of RAC-ADMM mapping matrix is (weakly) smaller than expected spectral radius of that of RP-ADMM.
Proof We prove the corollary in solving linear system with A non singular, with null objective function. In this setup, the expected output converges to the unique primal dual optimal solution to (9).
Notice in this setup, we have By calculation, we could characterize λ as roots of quadratic polynomial [69], for all possible block structure. Define τ υ i as the the smallest eigenvalue with respect to Q υ i S, andτ υ i as the largest eigenvalue with respect to Q υ i S. Similarly, τ as the smallest eigenvalue with respect to Q S, andτ the largest eigenvalue of Q S. Consider the following two cases: x is monotone decreasing with respect to x, the above implies that and as τ λ * ≥ τ , the above equation implies which is impossible, as by Weyl's theorem, We Specifically, which is impossible, as by Weyl's theorem,

Variance of RAC-ADMM
Convergence in expectation may not be a good indicator of convergence for solving all problems, as there may exist a problem for which RAC-ADMM is not stable or possesses greater variance. In order to give another probabilistic measure on performance of RAC-ADMM, this section introduces convergence almost surely (a.s.) as an indicator of the algorithm convergence. Convergence almost surely as a measure for stability has been used in linear control systems for quite some time, and is based on the mean-square stability criterion for stochastically varying systems [19]. The criterion establishes conditions for asymptotic convergence of covariance of the system states (e.g. variables).
This section builds on those results and establishes sufficient condition for RAC-ADMM to converge almost surely when applied to solve (9). The condition utilizes the Kronecker product of the mapping matrix, which captures the dynamics of the second moments of the random sequences generated by RAC-ADMM algorithm, and the expectation over the products of mapping matrices that provides the bounds on the variance of the distance between the KKT point and the random sequence generated by our algorithm.

Theorem 3
Suppose that Assumption 1 holds, and that RAC-ADMM (8) is employed to solve problem (9). Then the output of RAC-ADMM converges almost surely to some KKT point of (9) if Proof Let z = [x; y] ∈ R N denote the KKT point of (9), then, at k + 1th iteration we have Define d k = z k −z, and There exists a linear operator T s.t.
which then implies that randomized ADMM converges almost surely.
To illustrate the stability issues with RAC-ADMM, consider the following example.
(a) (b) Fig. 2 Effect of variance on convergence for problem (15). Evaluation of x k 1 (optimal x * 1 = 0) Example 3 Consider the following problem  Fig. 2, showing that convergence in expectation may not be a sufficient indicator from this particular example.
Indeed, if we apply Theorem 3, we find out that RAC-ADMM does not converge almost surely, while RP-ADMM does for solving this example: what explains the results shown in Fig. 2. In fact, RP-ADMM converges almost surely for all 15 block compositions of this example.

Variance reduction in RAC-ADMM
The previous section described sufficient condition for the almost sure convergence of RAC-ADMM algorithm. This section address controlability of the algorithm. More precisely we ask, given a linearly constrained quadratic problem (LCQP) (Eq. 9), what means do we have at our disposal to control convergence of a LCQP-how to bound the covariance and how to improve the convergence rate.

Detecting and utilizing a structure in LCQP
Although some problem types inherit a known structure (e.g. network-flow problems), in general, the structure is not known. There are many sophisticated techniques used to detect a structure of a matrix one can use and apply towards improving performance of RAC-ADMM. Although such elaborate methods have a potential of detecting hidden structure of Hessian and Jacobian matrices almost perfectly, using them or developing our own is beyond the scope of this paper. Instead, we adopt a simple matrix partitioning approach outlined in [27].
In general, for RAC-ADMM we are interested in a structure of a constraint matrix, which can be detected using the following simple approach. Given a constraint matrix A (describing equalities, inequalities or both), a desirable structure such as one shown in (16) can be derived by applying a graph partitioning method.
The outline of the process is as follows: 1. Build a graph representation of matrix A: Each row i and column j is a vertex; vertices are connected with edges if a i, j = 0. 2. Partition the graph using a graph partitioning algorithm or solver, for example [43]. 3. Recreate A as a block matrix from the graph partitions.
Using graph partitioning as a procedure for decomposing a problem in a way that maximizes the modularity has been studied for a while. Although the problem itself is an NP-hard integer program, many efficient algorithms have been proposed that achieve near-optimal performance with good computational scalability [1,7,56]. In addition, successful implementations such as those used by GCG (generic solver for mixed integer programs, part of SCIP Optimization Suite [63]) exist. Currently our solver RACQP, described in Sect. 3, does not implement an automatic routine for modularity detection, but we plan to add it.

Smart grouping
Smart-grouping is a pre-processing method in which we use block structure of constraint matrix A to pre-group certain variables as a single "super-variable" (a group of variables which always stay together in one block). Following the block structure shown in (16), we make one super-variablex i for each group x i , i = 1, . . . , v. Primal variables x v+1 stay shared and are randomly assigned to sub-problems to complement super-variables to which they are coupled with via block-matrices W i , i = 1, . . . , v. More than one super-variable can be assigned to a single sub-problem, dependent upon the maximum size of a sub-problem, if defined. Note that matrix partitioning based on H + A T A may result in a better grouping, but is unpractical and thus not considered as a viable approach.

Partial Lagrangian
The idea of smart-grouping described in the previous section can be further extended by the means of the partial Lagrangian approach. Consider a LCQP (6) having the constraint matrix A structure as shown in (16). Now consider the scenario in which we split the matrix A such that the block W = [W 1 , . . . , W v+1 ] is admitted by the augmented Lagrangian while the rest of the constraints (blocks V i ) are solved exactly as a part of a sub-problem, i.e. a sub-problem i is solved as where J is a set of indices of super-variablesx j constituting sub-problem i at any given iteration. The partial augmented Lagrangian is defined with There are two advantages of the partial Lagrangian approach. First, the rank of the constraint matrix used for the global constraints (matrix W) is lower than the rank of A, and the empirical results (Sect. 4) suggest a strong correlation between a rank of a matrix and the stability of the algorithm and its rate of convergence. Next, local constraints (matrices V i ) imply there is a feasibility region in which x i exist, and that region may not be infinite. In other words, even when the variables themselves are unbounded (i.e. x ∈ R n ), local constraints may put implicit bounds on maximum variation of values of x i .
Empirical results of the partial Lagrangian applied to mixed integer problems (Sect. 4.2) show the approach to be very useful. In such a scenario, local constraints are sets of rules that relate integer variables, while constraints between continuous variables are left global. In the case of a problems where such straight separation does not exist, or when problems are purely integer, a problem structure is let to guide the local/global constraints decision.
Although shown to be useful, the partial Lagrangian method suffers form being a mostly heuristic approach that depends on quality of solution methods applied to subproblems-in the case of continuous problems, a simple barrier based methodology can be applied, but for the mixed integers problems (MIP), sub-problems require a more complex solution (e.g. an external MIP solver).

Example 4
To illustrate the usefulness of the smart grouping and partial Lagrange approaches, consider the following experiments done on selected instances taken from the Mittelmann LP test set [73] augmented with a diagonal quadratic objective to form a standard LCQP (9)).
For each instance, a constraint matrix (A eq , A ineq or A = [A eq ; A ineq ]) was subjected to graph-partitioning procedure outlined in Sect. 2.3.1, and then solved using the smart grouping ("s_grp") and the partial Lagrangian approach ("partial_L"). Table 2 reports on the number of iterations required by RAC-ADMM algorithm to find a solution satisfying the primal/dual residual tolerance of = 10 −4 . If the solution was not found the reason is noted ("time limit" for exceeding sub-problem maximum runtime and "iter. limit" for exceeding maximum number of iterations). Fields showing "divergence" or "oscillation" mark experiments for which RAC-ADMM algorithm experienced an unstable behavior. The baseline for the comparison is the default approach (sub-problems created at random) shown in column "Default RAC".
The partial Lagrangian approach has a potential to help stability and rate of convergence. However, before generalizing, one needs to consider the following: stability (i.e. convergence) of RAC-ADMM algorithm, is a function, among other factors, of mapping operators (matrices M σ , Eq. 11) which are in turn functions, among other factors, of the constraint matrix of a problem being solved. In the case of partial Lagrangian methodology, this matrix is the matrix W, meaning that if W produces an unstable system (e.g. conditions set by Theorem 3 not met), no implicit bounding can help to stabilize it. At the same time, a "correct" W can stabilize a problem which is unstable in its original form. Consider a problem that is not convergent in its original formulation. Amending A by moving "bad" rows to sub-problems, thus constructing W that produces mapping matrices satisfying ρ(E[M σ ⊗ M σ ]) < 1 makes such a problem stable. Theoretical work on structures of W and conditions that stabilize and improve RAC-ADMM are works in progress.
Using smart grouping alone does not make RAC-ADMM unstable, but in some cases increases the number of iterations needed to satisfy feasibility tolerance, a consequence of having less randomness as described by Corollary 2.

RAC-ADMM quadratic programming solver
In this section we outline the implementation of the RAC-ADMM algorithm for linearly constrained quadratic problems as defined below: where symmetric positive semidefinite matrix H ∈ R n×n and vector c ∈ R n define the quadratic objective while matrix A eq ∈ R m×n and the vector b eq ∈ R m describe equality constraints and matrix A ineq ∈ R s×n and the vector b ineq ∈ R s describe inequality constraints. Primal variables x ∈ X , can be integer or continuous, thus the constraint set X is the Cartesian product of nonempty sets X i ⊆ R or X i ⊆ Z,  Introducing auxiliary variables s andx, results in the following equivalent of (17): where the augmented Lagrangian, L β (x; s; y eq ; y ineq ; z), is given as RAC-ADMM, or simply RAC, quadratic programming (RACQP) solver admits continuous, binary and mixed integer problems. Algorithm 1 outlines the solver: the solution vector is initialized to −∞ at the beginning of the algorithm, and the main RAC-ADMM loop described (lines 2-24). The main loop calls different procedures to optimize blocks of x (lines 4-16), followed by updates of slack and then dual variables.
Types of the block optimizing procedure being called to update the blocks depend on the structure of the problem being solved. The default, multi-block implementation for continuous problems is based on the Cholesky factorization, with a specialized one-block variant for very sparse problems that solves the iterates using the LDL factorization. Continuous problems that exhibit a structure (see Sect. 2.3.1) can be addressed using the partial Lagrangian approach. In such a case, sub-problems are solved using either a simple interior point method based methodology, or, when sub-problems include only equality constraints, by employing Cholesky for solving KKT conditions. In addition to the aforementioned methods, the solver supports calls to external solver(s) and specialized heuristic solution to handle hard sub-problem instances.
Binary and mixed integer problems require specialized optimization techniques (e.g. branch-and-bound), that require implementations which are beyond the scope of this paper, so we have decided to delegate optimizing of the blocks with mixed variables to an external solver. Mixed integer problems are addressed by using the partial Lagrangian to solve for primal variables and a simple procedure that helps to escape local optima, as described by Algorithm 2.
Note that Algorithms given in this section are pseudo-algorithms which describe functionality of the solver rather than actual implementation. The implementation can be downloaded from [61].

Algorithm 1 RACQP
Require: Problem model (Eq. 18), run-time parameters † with termination criteria † † Ensure: The optimal solution x * or the best solution found before termination criteria met for all vectors ω i ∈ of block indices do Solve x ω i blocks 5: Prepare Q ω i ,ω i , q ω i andq following Eqs. (20) and (22) 6: x if (sub-problem is mixed integer) then 8: An external solver 9: else if (partial Lagrangian (26) used and sub-problem includes inequalities) then 10: Interior point method based procedure or an external solver 11: else if (partial Lagrangian used) then 12: Cholesky factorization and back substitution solving KKT conditions 13: else 14: Cholesky factorization and back substitution 15: end if 16: end for 17: Update slack variables 18: if (bounds on x not addressed by the partial Lagrangian) then Update dual variables for split variables 21: Update dual variables for equality constraints 23: Update dual variables for inequality constraints 24: end while 25: return x † Number of groups p, penalty parameter β, initial point x 0 , pre-grouped vars set V. † † Termination criteria may include maximum run-time, number of attempts to find a better solution, solution quality and so on.

Solving continuous problems
For the continuous QP problems, we consider (18) where X are possible simple lower and upper bounds on each individual variable: Continuous problems are solved as described by Algorithm 1, which repeats three steps until termination criteria is met: first update or optimize primal variables x in the RAC fashion, then updatex and s in close forms and finally update dual variables y eq , y ineq and z.
Step 1 Update primal variables x Let ω i ∈ be a vector of indices of a block i, i = 1, . . . , p, where p is the number of blocks. The set of vectors is randomly generated (with smart grouping when applicable as described in Sect. 2.3.2) at each iteration of the Algorithm 1 (lines . Let x ω i be a sub-vector of x constructed of components of x with indices ω i , and let x −ω i be the sub-vector of x with indices not chosen by ω i . Algorithm 1 uses either Cholesky factorization or partial Lagrangian to solve each block of variables x ω i while holding x −ω i fixed. By rewriting (19) to reflect the sub-vectors, we get using Cholesky factorization and back substitution. The linear term resulting from Q, q, is given asq A square sub-matrix H ω i ,ω i and column sub-matrix A ω i are constructed by extracting ω i rows and columns from H and A respectively. When p = 1, i.e. we are solving a problem using a single-block approach, then we solve the block utilizing LDL factorization to avoid calculating A T A. Although the factorization can be relatively expensive if the problem size is large as we then factorize a large matrix, the factorization is done only once and re-used in each iteration of the algorithm. From (19), we find minimizer x by solving we can express the equivalent condition to (23) with We factorize the left hand side of the above expression and use the resulting matrices to find x by back substitution at each iteration of the algorithm. For single-block RACQP, LDL approach described above replaces lines 3-16 in Algorithm 1. Furthermore, if H is diagonal, one can rewrite the system as Then we can factorize matrix (I +β A(H +β I) −1 A T ) to solve the system, which would be extremely effective when the number of constraints is very small and/or sparse, since (H +β I) −1 is diagonal and it does not change sparsity of A. Partial Lagrangian approach to solving x blocks, described in Sect. 2.3.3, uses the same implementation as Cholesky approach described above, with additional steps that build local constraints which reflect free and fixed components of x, x ω i and x −ω i respectively. The optimization problem of partial Lagrangian is formulated as ineq describing local objective, equality and inequality constraints, respectively. Note that partial Lagrangian procedure is used by both continuous and mixed integer problems. In the case of the former we set X = R n , while when we solve the latter we let X i ⊆ R and implicitly enforce the bounds. The blocks are solved by either an external solver (e.g. Gurobi) or by using Cholesky to solve KKT conditions when x ω i is unbounded.
Step 2 Update auxiliary variablesx With all variables butx fixed, from augmented Lagrangian (19) we find that the optimal vector l ≤x ≤ u can be found by solving the optimization problem arg min The problem is separable andx has a closed form solution given bŷ Step 3 Update slack variables s Similarly to the previous step, with all variables but s fixed, the optimal vector s is found by solving arg min The problem is separable and s has a closed form solution given by

Termination criteria for continuous problems
Termination criteria for continuous problems include maximum run-time limit settings, maximum number of iterations and primal-dual solution (found up to some tolerance). RACQP terminates when at least one criterion is met. For primal-dual solution criterion RACQP uses the optimality conditions of problem (18) to define primal and dual relative residuals at iteration k, and set RACQP to terminate when the residuals become smaller than some tolerance Note that the aforementioned residuals are similar to those used in [10,65] with relative and absolute residual tolerance ( abs , rel ) set to be equal.

Mixed integer problems
For mixed integer problems we tackle (17) without introducingx, where augmented Lagrangian, L β (x; s; y eq ; y ineq ), is given by where slack variables s ≥ 0, and Mixed integer problems (MIP) are addressed by using the partial Lagrangian to solve for primal variables and a simple procedure that helps to escape local optima, as shown in Algorithm 2. Note that MIP and continuous problems share the same main algorithm (Algorithm 1), but the former ignores the update tox as the bounds on x are explicitly set through X , and thusx = x always. RACQP-MIP Solver, outlined in Algorithm 2, consists of a sequence of steps that work on improving the current (or initial) solution which is then "destroyed" to be possibly improved again. This solve-perturb-solve sequence (lines 2-13) is repeated until termination criteria is met. The criteria for RACQP-MIP is usually set to be maximum run-time, maximum number of attempts to find a better solution, or a solution quality (assuming primal feasibility is met within some > 0). The algorithm can be seen as a variant of a neighborhood search technique usually associated with meta-heuristic algorithms for combinatorial optimization.
† † RACQP-MIP termination criteria (e.g. maximum run-time, number of attempts to find a better solution, solution quality and so on).
After being stuck at some local optimum solution, the algorithm finds a new initial point x 0 by perturbing the best known solution x best and continues from there. The new initial point does not need to be feasible, but in some cases it may be beneficial to be constructed that way. To detect a local optimum we use a simple approach that counts number of times a "feasible" solution is found without improvement in objective value. A solution is considered to be feasible if > 0. Perturbation (line 11) can done, for example by choosing a random number (chosen from a truncated exponential distribution) of components of x best and assigning them new values, or a more sophisticated approach can be used (see Sect. 4.2 for some implementation details). Parameters of permutation are encapsulated in a generic term κ.

Computational studies
The Alternating Direction Method of Multipliers (ADMM) has nowadays gained a lot of attention for solving many problems of practical importance (e.g. large-scale machine learning and signal processing, image processing, portfolio-management, to name a few). Unfortunately, the two most popular approaches, namely the twoblock classical ADMM and the variable-splitting multi-block [10], both characterized by convergence speed and scaling issues somehow hindered a wide acceptance of ADMM as the solution method of choice for ML problems. RAC-ADMM offers the multi-block solution that may help to overcome the problem of ADMM acceptance.
The goal of this section is twofold: (1) to show that RAC-ADMM is a versatile algorithm that can be directly applied to a wide range of LCQP problems and compete with commercial solvers and (2) get an insight on specific ML problems and devise a RAC-ADMM based solution that outperforms or matches the performance of the best tailored solution method(s) in both solution time and quality. To address the former, in Sects. 4.1 and 4.2 we compare RACQP with the state of the art commercial solvers, Gurobi [34] and Mosek [55], and the academic OSQP which is a ADMMbased solver developed by [65]. To address the latter, we focus on Linear Regression (Elastic-Net) and Support Vector Machine (SVM), machine learning algorithm used for classification and regression analysis, and in Sect. 4.1.7 compare RACQP with glmnet [30,64] and LIBSVM [14].
We conduct multiple numerical tests, solving randomly constructed problems and problems from benchmark test sets. Data we collect include run-time, number of iterations until termination criteria is met and quality of a solution, defined differently for continuous, mixed-integer and machine learning problems (described in corresponding subsections). Note that in some sections, due to space concerns we report on a subset of instances. Experiments using larger sets are available together with RACQP solver code online [61] in "demo" directory.
The experiments were done on MacBook Pro with 2.8 GHZ Intel Core i7 and 16Gb memory running macOS High Sierra, v 10.13.2 (Sect. 4.1.7) and 16-core Intel Xeon CPU E5-2650 machine with 96Gb memory running Debian linux 3.16.0-4-amd64 (all other sections).

Continuous problems
The section starts with the analysis of the l 2 regularized regularized Markowitz meanvariance model applied to 2018 CSRP Quarterly Stock data [78] followed by randomly generated convex quadratic problems (QP) with coupled blocks. Next three sets of benchmark problems are addressed: relaxed QAPLIB [60] (binary constraint on variables removed), Maros and Meszaros Convex QP [72], and the Mittelmann LP test set [73] expanded to QP by adding a diagonal Hessian to the problem model.
The goal of the section is to show that the multi-block ADMM approach adapted by RACQP can significantly reduce solution time compared to commercial solvers and two-block ADMM (used by OSQP) for most of the problems we addressed. Results obtained in this section are all done with a single RACQP run, using fixed random number generator seed. Performance of the solver when subjected to different seeds is described in Sect. 4.1.8. The run-time settings applied to solvers to produce results reported in this section, unless noted otherwise, are shown in Table 3.
Authors are aware that either commercial solver can be tuned for maximum performance by adjusting run-time parameters to fit a specific problem structure, which is the same with RACQP and OSQP but to the much smaller extent. In addition, the latter do not have the access to a large number of real-world instances used by the former to fine-tune algorithms to exploit "known" problem structures nor manpower to build heuristics and/or preconditioners that boost solver performance. However, in order to create a more "fair" working conditions, we decided to let Mosek and Gurobi use their default settings, except for disabling multi-threading support and aforementioned optimality termination criteria (Table 3). Although allowing the solvers to execute presolve routines seems to be unfair to RACQP (which does not implement any presolving technique except for a very simple row scaling), disabling it would be even more unfair to the opposing solvers as their performance heavily depends on finesses of the presolve algorithm(s). Multi-threading is disabled for Mosek and Gurobi because both RACQP and OSQP are single-threaded, and leaving it on would be unfair. Finally, to make RACQP and OSQP comparison more fair, and because our target is to compare two ADMM variants, RAC-ADMM and operator splitting two-block ADMM, rather than solvers' implementations, the advanced option that OSQP uses to post-process results, "Polish results", was turned off. Note that such an option is relatively easy to implement and a variant of thereof will be added to a future RACQP version. For continuous problems described in this section, performance is measured in terms of run-time, number of iterations and quality of solution, expressed via primal and dual residuals. Terminating a run after residual(s) have been met (Table 3, rows 2-4) is one way of ensuring quality of a solution. However, this criteria could be misleading. To start with, some solvers use absolute residuals as termination criteria (e.g. Gurobi), some depend on relative residuals (e.g. Mosek, RACQP), and some are adjustable like QSQP.
Next, solvers usually scale problems (e.g. row and column scaling of a constraint matrix) to avoid numerical problems and make matrices with favorable condition numbers. Residuals are then calculated and checked against these scaled models, meaning that a solver may prematurely terminate unless the results are periodically re-scaled and residuals recalculated on the actual model-a "good" scaled solution can actually have a very bad "actual" residual. As each solver performs different scaling (and algorithms are not usually known as it is case with Gurobi and Mosek), direct comparison of residuals reported by the solvers is not possible.
To circumvent the issue, we re-calculate primal and dual residuals using the solutions (primal and dual variables), returned by the solvers as follows: r prim := max(r A eq , r A ineq , r bounds ) where A = [A eq ; A ineq ], y * is a vector of dual variables related to equality and inequality constraints, y * bounds is a vector of dual variables related to primal variable bounds, and x * is a vector of primal variables. Residuals due to equality and inequality constraints and bounds are defined with ).
Note that Gurobi does not provide dual variables for bounds (l ≤ x ≤ u) directly. To get around we convert the bounds into inequality constraints, what makes Gurobi to produce the dual variables. This introduces negligible run-time cost as the additional constraints are discovered as bounds during presolve phase and consequently removed. The initial point x 0 for all instances addressed by RACQP is max(0, l).

Choosing RACQP solver working mode
To address differences in problem structure, the following simple rules are used to decide on the RACQP solver mode: 1. If H is non-diagonal and A is non-structural or the problem is large, use multi-block mode (Eq. 21). 2. If H is non-diagonal and A is structural, which implies that A has non-zero entries that follow some pattern and problem structure is easy to detect, use multi-block mode with smart-grouping as described in Sect. 2.3.2. 3. If H is diagonal, m << n or H and A are very sparse, and the problem is of moderate size, use single-block mode (group all primal variables x together in one block) with localized equality constraints for the sub-problem and apply (Eq. 25). 4. If H is non-diagonal, both H and A are very sparse, and the problem is of moderate size, use single-block ADMM. If only a subset of primal variables is bounded, solve the block using an external solver (e.g. Gurobi or Mosek) with localized bounds. Otherwise, solve the block using (Eq. 24).

Regularized Markowitz mean-variance model
The Markowitz mean-variance model describes N assets characterized by a random vector of returns R = (R 1 , . . . , R N ) with known expected value m i of each random variable R i and covariance σ i j for all pairs of random variables R i and R j . Given some portfolio asset x = (x 1 , . . . , x N ), where x i is the fraction of resources invested in asset i, an investor chooses a portfolio x, satisfying two objectives: expected value of the portfolio return m x = E(R x ) = m, x is maximized and portfolio risk, measured by variance σ 2 x = Var(R x ) = x, Vx , V = (σ i j ) is minimized [25]. The problem of finding the optimal portfolio can be formulated as a quadratic optimization problem, where τ ≥ 0 is risk tolerance parameter, and e is a vector of all ones. The above problem formulation includes the regularization term with parameter κ.
The raw data was collected by the Center for Research in Security Price (CRSP), and provided through Wharton Research Data Services [78] covering daily prices of 4628 assets from Jan 01 to Dec 31, 2018, and monthly prices for 7958 stocks from Jan 31 to Dec 31, 2018. Missing data was filled using the yearly average price. The model uses risk tolerance parameter τ = 1, and is regularized with κ = 10 −5 . For the formulation (29), because Hessian (V) is dense and non-diagonal, the multi-block ADMM is used, following the rules on choosing the RACQP solver mode (rule 1, Sect. 4.1.1). The number of groups p is 50, and the augmented Lagrangian penalty parameter β = 1. Default run settings (Table 3) are used by all solvers, except for OSQP that had max iteration number set to 20,000.
The performance comparison between the solvers, given in Table 4, shows that multi-block RAC finds the solution of high quality in a fraction of time needed by the commercial solvers. In addition, the results show that OSQP requires many iterations to converge to a solution meeting primal/dual tolerance criteria ( = 10 −5 ), confirming the slow convergence issue of a 2-block ADMM approach. Low-rank re-formulation Noting that the number of observations k is not large and that the covariance matrix V is of low rank and thus can be expressed as and R ∈ R k×N , with rows corresponding to time series observations, and columns corresponding to different assets, we reformulate the problem as Since the Hessian of (31) is diagonal, and number of constraints is relatively small, the problem is solved using the single-block ADMM (rule 3, Sect. 4.1.1). Run-time settings are identical to those used for the regular model described previously, with the exception of the augmented Lagrangian penalty parameter which is set to β = 0.1. The performance comparison between the solvers, given in Table 5, shows that RACQP is also competitive in low-rank formulation of the problem. Run-time is given in seconds.

Randomly generated linearly constrained quadratic problems (LCQP)
In this section we analyze RACQP performance for different problem structures and run-time settings (number of blocks p, penalty parameter β, tolerance ). In order to have more control over problem structure we generate synthetic problem instances starting with a simple one row Markowitz-like problem to multi-row problems of large sizes. Note that although we compare RACQP with Gurobi and Mosek on randomly generated instances, which may be considered to be unfair to the latter, our goal is not to diminish the importance of barrier type solution methods those solvers utilize, but to show that multi-block ADMM can be an approach to argument these methods when instances are large and/or dense. In this section we solve linearly constrained quadratic problems LCQP, described by (17), with x ∈ R n . Similarly to [80] we construct a positive definite Hessian matrix H from a random (∼ U (0, 1)) matrix U ∈ R n×n and a normalized diagonal matrix V ∈ R n + whose elements were chosen from a log-uniform distribution to have a specific condition number: where parameters η ∈ (0, 1) and ζ ≥ 0 induce different types of orientation bias. For convenience we normalize matrix H and construct vector c as a random vector (∼ U (0, 1)). Jacobian matrices A eq and A ineq are constructed in a way that the desired sparsity is met and a i. j ∼ N (0, 1) for both matrices. Our analysis of LCQP is based on extensive experimentation using different problem structure embedded in the matrix H, by varying its orientation, condition number and the random seed used to construct H (and vector c).
Markowitz-like Problem Instances RACQP implementation allows solving optimization problems by multi-block ADMM. A question that arises is the optimal number of blocks p (i.e. sub-problems) to use. The optimal number, it turns out, is related to structure and density of both Hessian and Jacobian matrices. For any H that is not a block matrix, and a dense A, as is the case with the Markowitz model, the number of blocks is related to the problem size-having more blocks leads to having more iterations before the process meets the tolerance on residual error and more sub-problems to construct and solve. However, a sub-problem of a smaller size can be constructed and solved in less time than a larger sub-problem. Total time (t T ) is thus a function of opposing arguments. To show this interdependence, we solve simple Markowitz-like problem instances, with randomly generated Q and c, and with A eq = e T , b = 1, and x ∈ R n + (inequity constraints are not used). Following (29), we add a regularization term to the objective function with κ = 10 −5 . Table 6 presents the aggregate results collected over a set of experiments (10 for each group size) using random problems constructed using (32). The reason for constructing problems in such a way is to emulate a real-world situation when a problem model (Hessian, Jacobian, x upper and lower bounds) do not change, but coefficients do. The results confirm that there exist a "right" number of blocks which minimizes overall run-time. For now, choosing that number is based on experience, but we are working on formalizing the procedure. In addition to run-time cost per iteration, Table 6    reports number of iterations until convergence (k) for different number of blocks. It is interesting to observe is that k is very mildly affected by the choice of p, if tolerance is kept the same. This leads to another interesting question on how much a change in affect run-time. Table 7 gives an answer to this question. The table lists RACQP performance over the same problem set, but with different residual tolerances. As expected, results show that the number of iterations increases as the tolerance gets tighter.
General LCQP Building on the results from the previous section, we expand the QP model to include general equality and inequality constraints with unbounded variables x. We analyze RACQP when solving sparse problems (dense problems are covered in the next section where we address relaxed QAP) for problems of size n = 6000 and n = 9000. The number of rows in both constraint matrices is equal (m = m eq = m ineq ), and set to be a function of a problem size, m = r · n, with r = {0.1, 0.5}. The number of blocks used by RACQP is related to size of a block, p n = n/b size , with the optimal block size b size empirically determined to be 60. The penalty parameter β = 1 was found to produce the best results. Tables 8 and 9 give comparative analysis of performance of the solvers with respect to run-time and primal/dual residuals. Although both OSQP and RACQP did well in terms of primal and dual residuals, the results show that multi-block RACQP converges to solutions much faster (4-10×) then OSQP. Both solvers outperform Gurobi and Mosek in run-time, even though the tolerance on residual error is set to the same value . Another observation is that Mosek produces solutions of inferior quality to all aforementioned solvers-dual residuals are of 10 −3 and 10 −4 levels, far below the requested threshold. Investigation of the log files produced by Mosek reveled two problems: (1) Mosek terminates as soon as primal or dual or complementary gap residual criteria is met (unlike the other solvers which terminate when all the residual criteria are met); (2) residuals are not periodically checked on a re-scaled model, resulting in a large discrepancy between internally evaluated residuals (scaled data) and the actual one.

Relaxed QAP
As of this section we continue the study of RACQP but, instead of randomly generating problems, we use benchmark test sets compiled by other authors which reflect real-world problems. We start by addressing large scale instances from the QAPLIB benchmark library [60] compiled by [11] and hard problems of large size, described in [21]. The quadratic assignment problem (QAP) problem is a binary problem, but for the purpose of more realistic comparison between the solvers, we relax it to a continuous problem. The numerical tests solving the binary problem formulation will be given later in Sect. 4.2.3. The quadratic assignment problem belongs to a class of combinatorial optimization problems that arise from problems of practical interest. The QAP objective is to assign n facilities to n locations in such a way that the assignment cost is minimized. The assignment cost is the sum, over all pairs, of a weight or flow between a pair of facilities multiplied by the distance between their assigned locations. Mathematically, the QAP can be presented as follows: where x i j is the entry of the permutation matrix X ∈ R r ×r . To make the problem convex and be admitted by Cholesky factorization, we make H ∈ R n×n strict diagonally dominant, H =Ĥ + d · I, whereĤ = (A ⊗ B) and d = max( n i=1,i = jĥ i, j ) + δ, Table 9 Primal and dual residuals comparison between the solvers for LCQP Problem size with δ being some small positive number and n = r 2 . The "flow" matrix A ∈ R r ×r and the "distance" matrix B ∈ R r ×r . For QAP we apply a method for variance reduction as described in Sect. 2.3 since the assignment constraints are highly structured and observable. We group variables following a simple reasoning-given that the permutation matrix X is doubly stochastic, each row (or column) can be seen as a single super-variable, an integer representing a permutation order. Thus, it makes sense to make one super-variable, x i for each row i of X, so that each super-variable is of size r . For each of the experiments shown we set number of groups p = r (thus we solve for one super-variable per block), and penalty parameter β to the best we found by running multiple experiments with different parameter values. We found that β = r offered the best run-time.
The results showing performance of solvers on a selected set of large QAP instances are summarized in Tables 10 and 11. The instances were chosen in such a way to cover a variety of problem densities (Hessian) and sizes. Table 10 shows run-time and number of iterations. Note that any comparison between barrier based solvers (Gurobi and Mosek) and ADMM solvers (RACQP, OSQP) is not possible, as the solution methods are completely different, but giving the number of iterations allow us to compare performances within each class of the solvers.
Similarly to results presented previously, RACQP is the fastest solver. Solution quality (primal and dual residual tolerance) is achieved in a fraction of time required by the other solvers. The average speedup is 214×, 86× and 83× with respect to Gurobi, Mosek and OSQP respectively. OSQP, although performing a similar number of iterations as RACQP does, is much slower-splitting a large problem into two parts (OSQP executes 2-block ADMM) still leaves two large matrices to solve. On the positive side, OSQP finds better solutions (primal residual smaller by the order of magnitude). Mosek is the worst performing solver-run-time-wise it is close to OSQP, only one returned solution satisfies the dual residual (tai125e01). The other instances report the dual to be as low as 10 −1 . Gurobi found the best solutions, except for tai125e01 and tho150 instances, when max run-time limit (3h) was reached.

Maros and Meszaros convex QP
The Maros and Meszaros test set [72] is a collection of convex quadratic programming examples from a variety of sources [49] of the following form with H ∈ R n×n symmetric positive definite, A ∈ R m×n , b ∈ R m and l, u ∈ R n , meaning that some of components of l and u may be −∞ and +∞ respectively. Constant c 0 is assumed to be |c 0 | < ∞. As in the previous section, only a subset of instances is used in experiments. The instances were chosen in such a way to cover a variety of problem models (density, size) but also to point to strengths and weaknesses of ADMM-based algorithms. Problem Table 10 Relaxed QAP [11,21] instances Run-time and iteration count comparison between the solvers Table 11 Relaxed QAP [11,21] instances. Primal and dual residuals comparison between the solvers Instance name sizes n range from 4 · 10 3 to almost 10 5 with the number of constraints m up to 10 5 . The Hessian matrices are either diagonal, with number of non-zero diagonal elements less or equal to n, or symmetric with no nonzero diagonal elements. The constraint matrices A ∈ R m×n are very sparse across the problems; for most of the instances density is below 10 −3 . In addition to being sparse, the Jacobian matrices are not block separable.
RACQP mode was set to a single-block mode according to the rules 3 and 4 of Sect. 4.1.1, with β = 1 for all instances except for CONT* and UBH1 which use β = 350 and β = 12,000 respectively. Residual tolerance of = 10 −4 was used in producing the results, reported in Tables 12 and 13. The tolerance is lower than the default one (10 −5 ) because ADMM methods had hard time converging on CONT* and CVXQP* instances for tighter residuals (max number of iterations limit is 4000). Positive-definite instances (i.e. H 0) are marked with † .
Overall, for solving sparse and Hessian-diagonal problems, both Gurobi and Mosek seem more robust than OSQP and RACQP, probably due to the linear programming structure. The latter two are of the comparable performance. The results, in terms of the gap are of similar quality, and run-time is approximately the same, except for a couple of instances, where self-adjusting methodology used by OSQP for penalty parameter estimation, gives OSQP speed advantage. Also, some of the run-time variation can also be contributed to different languages used to implement solvers; OSQP is implemented in c/c++ while RACQP uses Matlab.
RACQP solved more instances than OSQP, which in addition to not being able to meet primal/dual residuals for 25% of instances, it also could not find a feasible solution for HUES-MOD and HUETIS instances. Mosek residual issue reported in the previous section continues to persists on these problem instances. For example AUG2DQP instance solution has dual residual of 5.5 · 10 −2 , the value that does not meet the requested tolerance.

Convex QP based on the Mittelmann LP test set
In this section we report on the performance of solvers when applied to very large quadratic problems. Instances are taken from the Mittelmann LP test set [73] augmented with a diagonal Hessian H to form a standard LCQP (17). The results are shown in Tables 14 and 15. Residual tolerance was set to 10 −4 (OSQP could not solve any instance but i_n13 when default tolerance of 10 −5 was used, and RACQP had hard time with nug30). Other default termination criteria apply (Table 3). For all instances the number of blocks was set to p = 200 and penalty parameter, to β = 5 except for nug30 that used β = 50.
RACQP solved very large (n > 750,000) quadratic problems to the required accuracy ( = 10 −5 ) very fast. The results were obtained using different solution strategies: multi-block Cholesky factorization approach for wide15, square15 and long15 instances, and the partial Lagrangian approach for nug30 (localized lower and upper bound of sub-problem primal variables). The best set of parameters were found by a brute-force approach, which implies that additional research work needs to be done to identify algebraic methods to characterize instances so that run-time parameters can be chosen automatically. RACQP was unable to find a solution satisfying both

Table 14
Convex QP based on the Mittelmann LP test set [73] Instance name  Table 15 Convex QP based on the Mittelmann LP test set [73] Instance name OSQP solved only one instance (i_n13) within given run-time and number of iterations limitations, while Gurobi solved all the instances to a high precision, regardless of having termination criteria, Table 3, set to = 10 −5 . Mosek did not find a single solution meeting the residual criteria, due to the aforementioned scaling and termination criteria issue.

Selected machine learning problems
In this section we apply RAC method and RP method to few selected machine learning (ML) problems related to convex quadratic optimization, namely Linear Regression (Elastic-Net) and Support Vector Machine (SVM). To solve the former we apply a specialized implementation of RAC-ADMM (available for download at [61]), while for the latter we use RACQP solver. RAC-ADMM is compared with the specialized methods, Glmnet [30,64] for Elastic-Net and LIBSVM [14] for SVM problems. The results show that our general-purpose solver matches and under certain circumstances exceeds the performance of those tailored methods. Linear Regression using Elastic Net For a classical linear regression model, with observed features X ∈ R n× p and labels y ∈ R n , where n is number of observations and p is number of features, one solves the following unconstrained optimization problem with P λ,α (β) = λ{ 1−α 2 β 2 2 + α β 1 } used for Elastic Net model. By adjusting α and λ, one could obtain different models: for ridge regression, α = 0, for lasso α = 1, and for classic linear regression, λ = 0. For the problem to be solved by ADMM, we use variable splitting and reformulate the problem as follows Note that in (35) we follow the standard machine learning Elastic Net notation in which β is the decision variable in the optimization formulation, rather than x.
Let c = − 1 n X T y, A = X √ n , and let γ denote the augmented Lagrangian penalty parameter with respect to constraint β − z, and ξ be the dual with respect to constraint β − z = 0. The augmented Lagrangian could then be written as We apply RAC-ADMM algorithm by partitioning β into multi-blocks, but solve z as one block. For any given β k+1 , optimizer z * k+1 has the closed form solution.
where ξ i is the dual variable with respect to constraint β i − z i = 0, and S(a, b) is soft-threshold operation [29].
In order to solve classic linear regression directly, X T X must be positive definite which can not be satisfied for p > n. However, RAC-ADMM only requires each subblock X T sub X sub to be positive definite, so, as long as block size s < n, RAC-ADMM can be used to solve the classic linear regression.
It is worth pointing out that although the objective function here is non-smooth, we could still reformulate the problem as convex quadratic programming with bounded constraints. To see this, consider the case where α = 1, and elastic net problem becomes a classic lasso regression, with (35) becomes Optimization problem 35 is equivalent as Let ξ be the dual with respect to constraint β − z + z = 0. The optimal (z ) k and (z ) k at each iteration satisfies (z ) k − (z ) k = 1 γ S(ξ k (i) − γ β k+1 (i), λ), which equals to z k when we solves problem (36). Essentially, the non-smooth objective here could be reformulated as a smooth convex quadratic programming with non-negative constraints. Using the absolute value update is just a compact way to implement the algorithm, so that the expected convergence is guaranteed from our theorem.
We compare our solver with Glmnet [30,64] and Matlab lasso implementation on synthetic data (sparse and dense problems) and benchmark regression data from LIBSVM [14].

Synthetic Data
The data set for dense problems X is generated uniform randomly with n = 10, 000, p = 50, 000, with zero sparsity, while for the ground truth β * we use standard Gaussian and set sparsity of β * to 0.1. Due to the nature of the problem, estimation requires lower feasibility precision, so we fix number of iterations to 10 and 20. Glmnet solver benefits from having a diminishing sequence of λ, but given that many applications (e.g. see [3]) require a fixed λ value , we decided to use fixed λ for all solvers. Note that the computation time of RAC-ADMM solver is invariant regardless of whether λ is decreasing or fixed. Dense problem, n = 10, 000, p = 50, 000 Table 16 reports on the average cross-validation run-time and the average absolute l 2 loss for all possible pairs (α, λ) with parameters chosen from α = {0, 0.1, . . . , 1} and λ = {1, 0.01}. Without specifying, RAC-ADMM solver run-time parameters were identical across the experiments, with augmented Lagrangian penalty parameter γ = 0.1λ for sparsity < 0.995, γ = λ for sparsity > 0.995, and block size s == 100. Large scale sparse data set X is generated uniform randomly with {n = 40, 000, p = 4, 000, 000}, using sparsity = 0.998. For ground truth β * , the standard Gaussian with sparsity β * = 0.5 and fixed λ. Noticing from the previous experiment that increasing a step size from 10 to 20 didn't significantly improve prediction error, we fix number of iteration to 10. Table 17, report on the average cross-validation run time and the average absolute l 2 loss for all possible pairs (α, λ) with parameters chosen from α = {0, 0.1, . . . , 1} and λ = {1, 0.01}. The table also shows the best l 2 loss for each solver. Because it took more than 10, 000 seconds for Matlab lasso to solve even one estimation, the table reports only comparison between glmnet and RAC.
Experimental results on synthetic data show that RAC-ADMM solver outperforms significantly all other solvers in total time while being competitive in absolute l 2 loss. Further RAC-ADMM speedups could be accomplished by fixing block-structure (RP-ADMM). In terms of run-time, for dense problem, RAC-ADMM is 3 times faster compared with Matlab lasso and 7 times faster compared with glmnet. RP-ADMM is 6 times faster compared with Matlab lasso, and 14 times faster compared with glmnet. For sparse problem, RAC-ADMM is more than 30 times faster compared with Matlab lasso, and 3 times faster compared with glmnet. RP-ADMM is 4 times faster compared with glmnet.
Following Corollary 2 RP-ADMM is slower that RAC-ADMM when convergence is measured in number of iterations, and experimental evidence (Table 1) show that it also suffers from slow convergence to a high precision level on L1-norm of equality constraints. However, the benefit of RP-ADMM is that it could store pre-factorized sub-block matrices, as block structure is fixed at each iteration,in contrast to RAC-ADMM which requires reformulation of sub-blocks at each iteration, what it turn makes each iteration more time-wise costly . In many machine learning problems, including regression, due to the nature of problem, a less precision level is required.  This makes RP-ADMM an attractive approach, as it could converge within fewer steps and potentially be faster than RAC-ADMM. In addition, while performing simulations we observed that increasing number of iteration does not significantly improve performance of prediction. In fact, absolute l 2 loss remains similar even when number of iteration is increased to 100. This further gives an advantage to RP-ADMM, as it benefits the most when number of iteration is relatively small.

LIBSVM Benchmark instances
Regression data E2006-tfidf feature size is 150,360 with number of training and testing data points of 16,087 and 3308 respectively. The null training error of test set is 221.8758. Following the findings from the section on synthetic problems and noticing that this dataset is sparse (density=0.991), this setup uses fixed number of iterations to 10, and vary λ = {1, 0.01} and α = {0, 0.1, 0.2, . . . , 1}. The training set is used to predict β * , and the model error (ME) of test set is compared across different solvers. Table 18 shows the performance of OSQP and Matlab lasso for α = 1 and λ = 0.01, and Table 19 compares compare RAC-ADMM with glmnet. The reason for splitting the results in two tables is related to inefficiency of factorizing a big matrix by OSQP solver and Matlab lasso implementation. Each solver requires more than than 1000 seconds to solve the problem for even 10 iterations, making them impractical to use. On the other hand, glmnet, which uses a cyclic coordinate descent algorithm on each variable, performs significantly faster than OSQP and Matlab lasso. However, glmnet can still be inefficient, as a complete cycle through all p variables requires O( pN ) operations [30]. Results given in Table 19 are the averages over run-time and training error collected from experiments with α = {0, 0.1, . . . , 1}. The results show that RAC-ADMM is faster than glmnet for all different parameters and that it achieves the best training model error, 22.0954, among all the solvers. In terms of run-time, RAC-ADMM is 14 times faster than OSQP, 38 times faster than Matlab lasso, and 4 times faster than glmnet. RP-ADMM is 28, 18 and 8 times faster than OSQP, Matlab lasso and glmnet, respectively.
For log1pE2006 benchmark , feature size is 4,272,227, number of training data is 16,087 and number of testing data is 3308. The null training error of test set is 221.8758 and sparsity of data is 0.998. Similarly to the previous benchmark, the performance results are split into two tables. Table 20 shows the performance of OSQP and Matlab lasso, while Table 21 compares RAC-ADMM and glmnet.
The results show that RAC-ADMM and RP-ADMM are still competitive and are of same level as glmnet with respect to model error, and all outperform OSQP and Matlab. In terms of run-time, RAC-ADMM is 12 times faster than OSQP, and 5 times faster than glmnet. RP-ADMM is 16 and 7 times faster than OSQP and glmnet, respectively. Support Vector Machine A Support Vector Machine (SVM) is a machine learning method for classification, regression, and other learning tasks. The method learns a mapping between the features x i ∈ R r , i = 1, . . . n and the target label y i ∈ {−1, 1} of a set of data points using a training set and constructs a hyperplane w T φ(x) + b that separates the data set. This hyperplane is then used to predict the class of further data points. The objective uses Structural Risk Minimization principle which aims to minimize the empirical risk (i.e. misclassification error) while maximizing the confidence interval (by maximizing the separation margin) [74,75].
Training an SVM is a convex optimization problem, with multiple formulations, such as C-support vector classification (C-SVC), υ-support vector classification (υ-SVC), -support vector regression ( -SVR), and many more. As our goal is to compare RACQP, a general QP solver, with specialized SVM software and not to compare SVM methods themselves, we decided on using C-SVC [8,18], with the dual problem formulated as min is a kernel function, and regularization parameter C > 0. The optimal w satisfies w = n i=1 y i z i φ(x i ), and the bias term b is calculated using the support vectors that lie on the margins (i.e. 0 < z i < C) as b i = w T φ(x i ) − y i . To avoid numerical stability issues, b is then found by averaging over b i . The decision function is defined We compare RACQP with LIBSVM [14], due its popularity, and with Matlab-SVM, due to its ease of use. These methods implement specialized approaches to address the SVM problem (e.g. LIBSVM uses a Sequential Minimal Optimization, SMO, type decomposition method [9,26]), while our approach solves the optimization problem (37) directly.
The LIBSVM benchmark library provides a large set of instances for SVM, and we selected a representative subset: training data sets with sizes ranging from 20,000 to 580,000; number of features from eight to 1.3 million. We use the test data sets when provided, otherwise, we create test data by randomly choosing 30% of testing data and report cross-validation accuracy results.
In Table 22 we report on model training run-time and accuracy, defined as (num. correctly predicted data)/(total testing data size)×100%. RAC-ADMM parameters were as follows: max block size s = 100, 500, and 1000 for small, medium and large instances, respectively and augmented Lagrangian penalty β = 0.1 p, where p is the number of blocks, which in this case is found to be p = n/s with n being the size of training data set. In the experiments we use Gaussian kernel, . Kernel parameters σ and C were estimated by running a grid-check on cross-validation. We tried different pairs (C, σ ) and picked those that returned the best cross-validation accuracy (done using randomly choose 30% of train data) when instances were solved using RAC-ADMM. Those pairs were then used to solve the instances with LIBSVM and Matlab. The pairs were chosen from a relatively coarse grid, σ, C ∈ {0.1, 1, 10} because the goal of this experiment is to compare RAC-ADMM with heuristic implementations rather than to find the best classifier. Termination criteria were either primal/dual residual tolerance ( p = 10 −1 and d = 10 −0 ) or maximum number of iterations, k = 10, whichever occurs the first. Dual residual was set to such a low value because empirical observations showed that restricting the dual residual does not significantly increase accuracy of the classification but effects run-time disproportionately. Maximum run-time was limited to 10 hours for mid-size problems, and unlimited for the large ones. Run-time is shown in seconds, unless noted otherwise.
The results show that RACQP produces classification models of competitive quality as models produced by specialized software implementations in a much shorter time. RACQP is in general faster than LIBSVM (up to 27×) except for instances where ratio of number of observations n with respect to number of features r is very large. It is noticeable that while producing (almost) identical results as LIBSVM, the Matlab implementation is significantly slower.
For small and mid-size instances (training test size < 100K) we tried, the difference in accuracy prediction is less than 2%, except for problems where test data sets are much larger than the training sets. In the case of "rcv1.binary" instance test data set is 5× larger than the training set, and for "cod_rna" instance is 4× larger. In both cases RACQP outperforms LIBSVM (and Matlab) in accuracy, by 20% and 9%, respectively.
All instances except for "news20.binary" have n >> r and the choice of the Gaussian kernel is the correct one. For instances where the number of features is larger than the number of observations, linear kernel is usually the better choice as the separability of the model can be exploited [79] and problem solved to similar accuracy in a fraction of time required to solve it with the non-linear kernel. The No test set provided, using 30% of randomly chosen data from the training set. Reporting cross-validation accuracy results reason we used the Gaussian kernel on "news20.binary' instance is that we wanted to show that RACQP is only mildly affected by the feature set size. Instances of similar sizes but different number of features are all solved by RACQP in approximately the same time, which is in contrast with LIBSVM and Matlab that are both affected by the feature space size. LIBSVM slows down significantly while Matlab, in addition to slowing down could not solve "news.binary"-the implementation of fitcsvm() function that invokes Matlab-SVM algorithm requires full matrices to be provided as the input which in the case of "news.binary" requires 141.3GB of main memory.
"Skin_nonskin" benchmark instance "marks" a point where our direct approach starts showing weaknesses-LIBSVM is 5× faster than RACQP because of the finetuned heuristics which exploit very small feature space (with respect to number of observations). The largest instance we addressed is "covtype.binary", with more than half of million observations and the (relatively) small feature size ( p = 54). For this instance, RACQP continued slowing down proportionately to the increase in problem size, while LIBSVM experienced a large hit in run-time performance, requiring almost two days to solve the full size problem. This indicates that the algorithms employed by LIBSVM are put to the limit and specialized algorithms (and implementations) are needed to handle large-scale SVM problems. RACQP accuracy is lower than that of LIBSVM, but can be improved by tightining residual tolerances under the cost of increased run-time.
For large-size problems RACQP performance degraded, but the success with the mid-size problems suggests that a specialized "RAC-SVM" algorithm could be developed to address very large problems. Such a solution could merge RAC-ADMM algorithm with heuristic techniques to (temporarily) reduce the size of the problem (e.g. [42]), smart kernel approximation techniques, probabilistic approach(es) to shrinking the support vector set (e.g. [62]), and similar.

Changing random seed for RACQP
When it comes to algorithms that are stochastic in nature, as RAC-ADMM is, the question that always comes onto mind is about robustness of the algorithm. More precisely, how much is RAC-ADMM sensitive to variations in problem data for a given problem model, and to variations arising from differences in sub-problems due to from randomness of block building procedure (Algorithm 1, line 3). The answer to the former question has been provided in Sect. 4.1.3, and this section tackles the latter.
To answer the question on RACQP sensitivity to sub-problem structure we subject RACQP to different random seeds-each sub-problem is solving minimization problem defined by Lagrangian (Eq. 20), which is, in turn, a function of blocks of primal variables constructed using a stochastic process, following the procedure outlined in Sect. 3.1, Step 1. This stochastic process is guided by values drawn from a pseudo random number generator, which is initialized using a random seed number. For different seeds the generator produces different sequences of numbers, what in turn produces different sub-problems addressed by RACQP. Table 23 shows results over a selected set of instances chosen to represent each problem type addressed so far. The table aggregates statistical data collected by solving each instances using ten different seeds per primal/dual tolerance . Note that CuteR instances (Sect. 4.1.5) are not included in the analysis as all the instances are solved using a single-block approach. The results show that RACQP is a robust algorithm and that using a single run was a correct choice to make, at least when it comes to problem instances reported in this section. To generalize the claim about RACQP robustness with respect to randomness of block building scheme would require much more experiments and theoretical analysis, what we delegate to our future work.

Binary and mixed integer problems
The RAC-ADMM multi-block approach can be applied directly to binary (and mixed integer) problems without any adaptation. However, when dealing with combinatorial problems, a divide-and-conquer approach does not necessary lead to a good solution, because solver may get stuck in some local optima. To mitigate this problem RACQP, we introduce additional randomness into the implementation: a simple perturbation scheme shown in Algorithm 2 that helps the solver to "escape" the local optimum and to continue search for another one (and possibly find the global optimum). Thus, in addition to the run-time parameters used for continuous problems, for MIP we need to specify perturbation parameters such as probability distribution to use when choosing how many variables are perturbed (N p ) and the parameters thereof. As a default, RACQP implements truncated exponential distribution, N p ∼ Exp(λ) with parameter λ = 0.4n, minimum number of variables N p,min = 2, and maximum number of variables N p,max = n, based on the observation that for most of the problems "good" solutions tend to be grouped. Variables are chosen at random, and in the general case, perturbation is done by assigning "new" values (within bounds) to the chosen variables. Default number of trials before perturbation, N trial = min(2, 0.005n). For all binary problems presented in this section the primal residual error was zero, i.e. the problems were solved to feasibility.
As the default solver for sub-problems, RACQP uses Gurobi, but any other solver that admits mixed integer quadratic problem would suffice. The results reported in this section are based on Gurobi 7.5, and may be outdated. However, since we use Gurobi as the sub-solver, we expect RACQP to implicitly gain by the improvements made to Gurobi. Gurobi was ran using its default run-time settings (e.g. presolve option was turned on).
In [66] the authors present a mixed integer quadratic solver, MIQPs, which uses OSQP solver for solving sub-problems resulting from branch-and-bound strategy. Since the solver is built for small and medium size problems that occur in embedded applications, we do not include it in our current study. However, given that MIQPs showed a promising numerical performance (3× faster than Gurobi) even though being implemented in Python, it would be interesting to use it within RACQP as the external solver for MIP (Algorithm 1, line 8) instead of our default solver (Gurobi) and compare performance. We defer this comparison to future work.
To solve MIP problems RACQP uses the partial Lagrangian approach, described in Sect. 2.3.3, to handle bounds on variables X i , x i ∈ X i . Additionally, depending on a problem structure, equality and inequality constraints can also be moved to the local constraint set. Our experiments show that moving some (as it done for QAP), or  Ten experiments per instance per all constraints (e.g. graph cut problems) to a local set is beneficial in terms of block sizes, run-time, and overall solution quality. By using local constraints we help the sub-solver (e.g. Gurobi) reduce the size of the problem and tighten its formulation (using presolve and cutting plane algorithms).
Rather than solving the binary QP problem exactly, our goal is to find a (randomized or deterministic) algorithm that could find a better solution under a fixed solution time constraint. Our preliminary tests show that solving a large-scale problem using RAC-ADMM based approach can lead to a very good quality solution for an integer problem in a very limited time.
The quality of solutions is given in a form of a gap between the objective value of the optimal solution x * opt and the objective value of the solution found by a solver S, x * S : For the instances for which the optimal solution remains unknown (e.g. QAPLIB and GSET instances), we use the best known results from the literature. Note that for maximization problems (e.g. Max-Cut, Max-Bisection) gap is the negative of (38). All binary problems are solved with primal residual equal to zero (i.e. the solutions are feasible and integer).

Randomness helps
We start the analysis of RACQP for binary problems with a short example showing that having blocks that are randomly constructed at each iteration, as done by RAC-ADMM, is the main feature that makes RACQP work well for combinatorial problems, without a need for any special adaptation of the algorithm for the discrete domain.
RAC-ADMM can be easily adapted to execute classical ADMM or RP-ADMM algorithms, so here we compare these three ADMM variants when applied to combinatorial problems. We use a small size problem (n = 1000) and construct a problem using (32) applied to problem of Markowitz type (39), with κ = 10 −5 and a positive integer number r ∈ Z + , r ∈ (1, n) that defines how many stocks from a given portfolio must be chosen. A typical evaluation of the algorithms is shown in Fig. 3. Results show that RAC-ADMM is much better suited for binary optimization problems than either cyclic ADMM or RP-ADMM or distributed-ADMM, which is not surprising since more randomness is adapted into the algorithm making it more likely to escape local optima. All the algorithms are quick to find a local optimum, but besides RAC-ADMM stay at that first found point, while RAC-ADMM continues to find local optima, which could be better or worse than previously found. Because of this behavior, one can keep track

Markowitz portfolio selection
Similarly to the section on continuous problems, we compare RACQP performance with that of Gurobi on Markowitz cardinality constrained portfolio selection problem (39) using real data coming from CRSP 2018 [78]. In the experiments, we set r = n/2 with all other settings identical to those used in Sect. 4.1.2, including V and m, estimated from CRSP 2018 data. The default perturbation RACQP settings with β = 0.05, p = 100 were used in the experiments. Gap is measured from the "Optimal" objective values of the solutions found by Gurobi in about 1 hour run-time after relaxing MIPGAP parameter to 0.1. From the results (Table 24) it is noticeable that RACQP finds relatively good solutions (gap 10 −2 −10 −4 ) in a very short time, in some cases even before Gurobi had time to finalize root relaxation step of its binary optimization procedure. Maximal allowed run-time of 1 min was far too short for Gurobi to find any solution, so it returned a heuristic ones. Note that those solutions (third column of the table) are extremely weak, suggesting that a RAC-ADMM based solution could be implemented and used instead.

Low-rank Markowitz portfolio selection model
Similarly to (31) we formulate the model for low-rank covariance matrix V as   CRSP 2018 data [78]. Max run-time 1 min and solve the model for CRSP 2018 data. We use β = 0.5, p = 50. RACQP gap was measured from the optimal solution returned by Gurobi. In Table 25 we report on the best solutions found by RACQP with max run-time limited to 60 seconds. Results are hard to compare. When Hessian is diagonal and the number of constraints are small, as the case for this data, Gurobi has a very easy time solving the problems (monthly and daily data)-it finds good heuristic points to start with, and solves problems at a root node after a couple of hundreds of simplex iterations. On the other hand, RACQP, which does not directly benefit from diagonal Hessian, needs to execute multiple iterations of ADMM. Even though the problems are small and solved very quickly, the overhead of preparing the sub-problems and initializing Gurobi to solve sub-problems accumulates to the point of overwhelming RACQP run-time. In that light, for the rest of this section we consider problems where Hessian is a non-diagonal matrix, and address the problems that are hard to solve directly by Gurobi (and possibly other MIP QP solvers).

QAPLIB
The binary quadratic assignment problem (QAP) is known to be NP-hard and that binary instances of larger sizes (dimension of the permutation matrix r > 40) are considered to be intractable and cannot be solved exactly (though some instances of a large size with special structure have been solved). Currently, the only practical solutions for solving large QAP instances are heuristic methods. For binary QAP we apply the same method for variance reduction as we did for relaxed QAP (Sect. 4.1.4). We group variables following the structure of constraints, which is dictated by the permutation matrix X ∈ {0, 1} r ×r (see Eq. 33 for QAP problem formulation)-we construct one super-variable, x i for each row i of X . Next we make the use of the partial Lagrangian, and split constraints into the local constraint set consisting of (33) (a) and the global constraint set consisting of (33) (b), so that the partial Lagrangian is At each iteration, we update the ith block by solving Next, continuing on the discussion on perturbation from the previous section, we turn the feature on and set parameters as follows: number of super-variables to perturb is drawn from truncated exponential distribution, N p ∼ Exp(λ) with parameter λ = 0.4r , minimum number of variables N p,min = 2 and maximum number of variables N p,max = r . The number of trials before perturbation N trial is set to its default value. Note that we do not perturb single variables (x i, j ), rather super-variables that we choose at random. If a super-variable x i has value of '1' at one location, and '0' on all other entries, then we randomly swap location of '1' within the super variable (thus keeping the row-wise constraint on X for row i satisfied). If the super-variable is not feasible (number of '1' = 1), we flip values of a random number of variables that make x i . The initial point is a random feasible vector. The penalty parameter is a function of the problem size, β = n, while the number of blocks depends on the permutation matrix size and it is p = r /2 . The summary of the QAPLIB benchmark [60] results is given in Table 26. Out of 133 total instances the benchmark includes, RACQP found the optimal solution (or the best known from literature as not all instances have proven optimal solution) for 18 instances within 10 min of run-time. For the rest of the instances, RACQP returned solutions with an average gap of μ = 0.07. Gurobi solved only three instances to optimality. The average gap of the unsolved instances is μ = 12.15, which includes heuristic solutions returned when root relaxation step was not finalized (20 instances). Removing those outliers results in the average gap of μ = 5.57. Table 27 gives detailed information on 21 large instances from QAPLIB data set. The most important takeaway from the table is that Gurobi can not even start solving very large problems as it can not finalize the root relaxation step within given maximum run time, while RACQP can.

Maximum cut problem
The maximum-cut (Max-Cut) problem consists of finding a partition of the nodes of a graph G = (V , E), into two disjoint sets V 1 and V 2 (V 1 ∩ V 2 = ∅, V 1 ∪ V 2 = V ) in such a way that the total weight of the edges that have one endpoint in V 1 and the other in V 2 is maximized. The problem has numerous important practical applications, and is one of Karp's 21 NP-complete problems. A standard formulation of the problem is max i, j w i, j (1 − y i y j ), which can be re-formulated into where h i, j = w i, j and h i,i = − 1 2 ( n j=1 w i, j + n j=1 w j,i ). We use the Gset benchmark from [33], and compare the results of our experiments with the optimal solutions (found by Gurobi) and the best known solutions from the literature [4,48]. For perturbation we use default parameters and perform perturbation by choosing a random number of variables and negating their values, i.e. x i = 1 − x i . The number of blocks is equal for all instances, p = 4, and the initial point is set to zero (x 0 = 0) for all the experiments. Note that as the max-cut problem is unconstrained, the penalty parameter β is not used (and RACQP is doing a randomly assembled cyclic BCD).
In contrast to continuous sparse problems (rule 4, Sect. 4.1.1), sparse binary problems benefit from using a randomized multi-block approach, as shown in Table 28. The table compares RACQP and Gurobi results collected from experiments on Gset instances for three different maximum run-time limit settings, 10, 30 and 60 minutes. RACQP again outperforms Gurobi, overall, it finds better solutions when run-time is limited. Although Gurobi does better on a few problems, on average RACQP is better. Note that for large(r) problems (n ≥ 5000) RACQP keeps improving, which can be explained by the difference in number of perturbations-for smaller problems, good points have already being visited and a chance to find a better one are small. Adaptively changing perturbation parameters could help, but this topic is out of scope of this work.

Maximum bisection problem
The maximum bisection problem is a variant of the Max-Cut problem that involves partitioning the vertex set V of a graph G = (V , E) into two disjoint sets V 1 and V 2 of equal cardinality (i.e. V 1 ∩ V 2 = ∅, V 1 ∪ V 2 = V , |V 1 | = |V 2 |) such that the total weight of the edges whose endpoints belong to different subsets is maximized. The problem formulation follows (41) with the addition of a constraint e T x = n/2 , where n is the graph size.
For Max-Bisection, at each iteration we update the ith block by solving x k+1 ω i = arg min where d i is the size of block i, x ω i is a sub-vector of x constructed of components of x with indices ω i ∈ , and b ω i = n/2 − e T x −ω i with x −ω i being the sub-vector of x with indices not chosen by ω i . Solving the sub-problems directly has shown to be very time consuming. However, noticing that Gurobi, while solving the problem as whole, makes a good use of cuts for this type of problems (matrix Q structure), we decided to reformulate the sub-problems as follows Note that r can be also defined as a bounded continuous or integer variable, but because the optimal value is zero and because Gurobi makes good use of binary cuts, we decided to define r as binary.
As in the previous section, we use Gset benchmark library and compare the results of our experiments with the best known solutions for max-bisection problems found in the literature [48]. The experimental setup is identical to that of Max-Cut experiments except for the use of the penalty parameter β = 0.005 and the initial point x 0 which is Table 28 Max-Cut, GSET instances Gap between best known results and RACQP/Gurobi objective values a feasible random vector. Perturbation is done with a simple swap-an equal number of variables with values "1" and "0" is chosen and the new value set to be the negation of the old value.
The results are shown in Table 29. Compared to the unconstrained max-cut problem, RACQP seems to have less trouble solving max-bisection problem-adding a single constraint boosted its performance by up to 2×. Gurobi performance on the other worsened. Overall, RACQP outperforms Gurobi, finding better solutions when runtime is limited. Both Gurobi and RACQP continue gaining on solution quality (gap gets smaller) with longer time limits.

Summary
In this paper, we introduced a novel randomized algorithm, randomly assembled multi-block and cyclic alternating direction method of multipliers (RAC-ADMM), for solving continuous and binary convex quadratic problems. We provided a theoretical proof of the performance of our algorithm for solving linear-equality constrained continuous convex quadratic programming, including the expected convergence of the algorithm and sufficient condition for almost surely convergence of the algorithm. We further provided open source code of our solver, RACQP, and numerical results on demonstrating the efficiency of our algorithm.
We conducted multiple numerical tests on solving synthetic, real-world, and benchmark quadratic optimization problems, which include continuous and binary problems. We compare RACQP with Gurobi, Mosek and OSQP for cases that do not require high accuracy, but a strictly improved solution in shortest possible run-time. Computational results show that RACQP, except for a couple of instances with a special structure, finds solutions of a very good quality in a much shorter time than the compared solvers.
In addition to general linearly constrained quadratic problems we applied RACQP to few selected machine learning problems, Linear Regression, LASSO, Elastic-Net, and SVM. Our solver matches the performance of the best tailored methods such as Glmnet and LIBSVM, and often gives better results than that of tailored methods. In addition, our solver uses much less computation memory space than other ADMM based method do, so that it is suitable in real applications with big data.
The following is a quick summary of the pros and cons of RACQP, implementation of RAC-ADMM, for solving quadratic problems, and suggests the future research: -RACQP is remarkably effective for solving continuous and binary convex QP problems when the Hessian is non-diagonal, the constraint matrix are unstructured, or the number of constraints are small. These findings are demonstrated by solving Markowitz portfolio problems with real or random data, and randomly generated sparse convex QP problems. -RACQP, coupled with smart-grouping and a partial augmented Lagrangian, is equally effective when the structure of the constraints is known. This finding is supported by solving continuous and binary bench-mark Quadratic Assignment, Max-Cut, and Max-Bisection problems. However, efficiently deciding on group-  Gap between best known results and RACQP/Gurobi objective values ing strategy is also challenging. We plan to build an "automatic-smart-grouping" method as a pre-solver for unknown structured problem data. -Computational studies done on binary problems show that RAC-ADMM approach to solving problems offers an advantage over the traditional direct approach (solving the problem as whole) when finding a good quality solution for a large-scale integer problem in a very limited time. However, exact binary QP solvers, such as Gurobi, are needed, because our binary RACQP relies on solving many small or medium sized binary sub-problems. Of course, we plan to explore more high efficiency solvers for medium-sized binary problems for RACQP. -The ADMM-based approach, either RACQP or OSQP, is less competitive when the Hessian of the convex quadratic objective is diagonal and the constraints are sparse but structured such as a network-flow type. We believe in this case both Gurobi and Mosek can utilize more efficient Cholesky factorization that is commonly used by interior-point algorithms for solving linear programs; see more details in Sect. 3.1.
In contrary, RACQP has considerable overhead cost of preparing block data and initialization time of the sub-problem solver, and the time spent on solving diagonal sub-problems was an order of magnitude shorter than time needed to prepare data. This, together with the divergence problem of multi-block ADMM, hints that there must be something connected to the problem structure that makes such instances hard for the ADMM-based approach. We plan on conducting additional research to identify problem instances that are well-suited and those that are unsuitable for ADMM. -There are still many other open questions regarding RAC-ADMM. For example, there is little work on how to optimally choose run-time parameters to work with RAC-ADMM, including penalty parameter β, number of blocks, and so for.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.