1 Introduction

In recent years, convex programs have become increasingly popular for solving a wide range of problems in machine learning and other fields, ranging from theoretical modeling, e.g., latent variable graphical model selection (Chandrasekaran et al. 2012), low-rank feature extraction [e.g., matrix decomposition (Candès et al. 2011) and matrix completion (Candès and Recht 2009)], subspace clustering (Liu et al. 2012), and kernel discriminant analysis (Ye et al. 2008), to real-world applications, e.g., face recognition (Wright et al. 2009), saliency detection (Shen and Wu 2012), and video denoising (Ji et al. 2010). Most of the problems can be (re)formulated as the following linearly constrained separable convex programFootnote 1:

$$\begin{aligned} \mathop {\min }\limits _{\mathbf{x}_1,\ldots ,\mathbf{x}_n} \sum \limits _{i=1}^n f_i(\mathbf{x}_i),\quad s.t.\quad \sum \limits _{i=1}^n\mathcal {A}_i(\mathbf{x}_i)=\mathbf{b}, \end{aligned}$$
(1)

where \(\mathbf{x}_i\) and \(\mathbf{b}\) could be either vectors or matrices,Footnote 2 \(f_i\) is a closed proper convex function, and \(\mathcal {A}_i:\mathbb {R}^{d_i}\rightarrow \mathbb {R}^{m}\) is a linear mapping. Without loss of generality, we may assume that none of the \(\mathcal {A}_i\)’s is a zero mapping, the solution to \(\sum \nolimits _{i=1}^n \mathcal {A}_i(\mathbf{x}_i)=\mathbf{b}\) is non-unique, and the mapping \(\mathcal {A}(\mathbf{x}_1,\ldots ,\mathbf{x}_n)\equiv \sum \nolimits _{i=1}^n\mathcal {A}_i(\mathbf{x}_i)\) is ontoFootnote 3.

1.1 Exemplar problems in machine learning

In this subsection, we present some examples of machine learning problems that can be formulated as the model problem (1).

1.1.1 Latent low-rank representation

Low-rank representation (LRR) (Liu et al. 2010, 2012) is a recently proposed technique for robust subspace clustering and has been applied to many machine learning and computer vision problems. However, LRR works well only when the number of samples is more than the dimension of the samples, which may not be satisfied when the data dimension is high. So Liu and Yan (2011) proposed latent LRR to overcome this difficulty. The mathematical model of latent LRR is as follows:

$$\begin{aligned} \min \limits _{\mathbf{Z},\mathbf{L},\mathbf{E}}\Vert \mathbf{Z}\Vert _* + \Vert \mathbf{L}\Vert _* + \mu \Vert \mathbf{E}\Vert _{1}, \quad s.t. \quad \mathbf{X}= \mathbf{X}\mathbf{Z}+ \mathbf{L}\mathbf{X}+ \mathbf{E}, \end{aligned}$$
(2)

where \(\mathbf{X}\) is the data matrix, each column being a sample vector, \(\Vert \cdot \Vert _*\) is the nuclear norm (Fazel 2002), i.e., the sum of singular values, and \(\Vert \cdot \Vert _1\) is the \(\ell _1\) norm (Candès et al. 2011), i.e., the sum of absolute values of all entries. Latent LRR is to decompose data into principal feature \(\mathbf{X}\mathbf{Z}\) and salient feature \(\mathbf{L}\mathbf{X}\), up to sparse noise \(\mathbf{E}\).

1.1.2 Nonnegative matrix completion

Nonnegative matrix completion (NMC) (Xu et al. 2011) is a novel technique for dimensionality reduction, text mining, collaborative filtering, and clustering, etc. It can be formulated as:

$$\begin{aligned} \min \limits _{\mathbf{X},\mathbf{e}}\Vert \mathbf{X}\Vert _* + \frac{1}{2\mu }\Vert \mathbf{e}\Vert ^2,\quad s.t. \quad \mathbf{b} = \mathcal {P}_{\varOmega }(\mathbf{X}) + \mathbf{e}, \ \mathbf{X}\ge 0, \end{aligned}$$
(3)

where \(\mathbf{b}\) is the observed data in the matrix \(\mathbf{X}\) contaminated by noise \(\mathbf{e}, \varOmega \) is an index set, \(\mathcal {P}_{\varOmega }\) is a linear mapping that selects those elements whose indices are in \(\varOmega \), and \(\Vert \cdot \Vert \) is the Frobenius norm. NMC is to recover the nonnegative low-rank matrix \(\mathbf{X}\) from the observed noisy data \(\mathbf{b}\).

To see that the NMC problem can be reformulated as (1), we introduce an auxiliary variable \(\mathbf{Y}\) and rewrite (3) as

$$\begin{aligned} \min \limits _{\mathbf{X},\mathbf{Y},\mathbf{e}}\Vert \mathbf{X}\Vert _* + \chi _{\ge 0}(\mathbf{Y}) + \frac{1}{2\mu }\Vert \mathbf{e}\Vert ^2,\quad s.t. \quad \begin{pmatrix} \mathcal {P}_{\varOmega }(\mathbf{X})\\ \mathbf{X}\end{pmatrix} - \begin{pmatrix} \mathbf{0}\\ \mathbf{Y}\end{pmatrix} + \begin{pmatrix} \mathbf{e}\\ \mathbf{0}\end{pmatrix} =\begin{pmatrix} \mathbf{b}\\ \mathbf{0}\end{pmatrix}, \end{aligned}$$
(4)

where

$$\begin{aligned} \chi _{\ge 0}(\mathbf{Y})=\left\{ \begin{array}{ll} 0, &{} \text{ if } \mathbf{Y}\ge 0,\\ +\infty , &{} \text{ otherwise }, \end{array} \right. \end{aligned}$$

is the characteristic function of the set of nonnegative matrices.

1.1.3 Group sparse logistic regression with overlap

Besides unsupervised learning models shown above, many supervised machine learning problems can also be written in the form of (1). For example, using logistic function as the loss function in the group LASSO with overlap (Jacob et al. 2009; Deng et al. 2011), one obtains the following model:

$$\begin{aligned} \min \limits _{\mathbf{w},b}\frac{1}{s}\sum \limits _{i=1}^{s}\log \left( 1+\exp \left( -y_i(\mathbf{w}^T\mathbf{x}_i+b)\right) \right) +\mu \sum \limits _{j=1}^t\Vert \mathbf{S}_j\mathbf{w}\Vert , \end{aligned}$$
(5)

where \(\mathbf{x}_i\) and \(y_i, i=1,\ldots ,s\), are the training data and labels, respectively, and \(\mathbf{w}\) and \(b\) parameterize the linear classifier. \(\mathbf{S}_j, j=1,\ldots ,t\), are the selection matrices, with only one 1 at each row and the rest entries are all zeros. The groups of entries, \(\mathbf{S}_j\mathbf{w}, j=1,\ldots ,t\), may overlap each other. This model can also be considered as an extension of the group sparse logistic regression problem (Meier et al. 2008) to the case of overlapped groups.

Introducing \(\bar{\mathbf{w}}=(\mathbf{w}^T,b)^T, \bar{\mathbf{x}}_i=(\mathbf{x}_i^T,1)^T, \mathbf{z}=(\mathbf{z}_1^T,\mathbf{z}_2^T,\ldots ,\mathbf{z}_t^T)^T\), and \(\bar{\mathbf{S}}=(\mathbf{S},\mathbf{0})\), where \(\mathbf{S}=(\mathbf{S}_1^T,\ldots ,\mathbf{S}_t^T)^T\), (5) can be rewritten as

$$\begin{aligned} \min \limits _{\bar{\mathbf{w}},\mathbf{z}}\frac{1}{s}\sum \limits _{i=1}^{s}\log \left( 1+\exp \left( -y_i(\bar{\mathbf{w}}^T\bar{\mathbf{x}}_i)\right) \right) +\mu \sum \limits _{j=1}^t\Vert \mathbf{z}_j\Vert , \quad s.t. \quad \mathbf{z}= \bar{\mathbf{S}}\bar{\mathbf{w}}, \end{aligned}$$
(6)

which is a special case of (1).

1.2 Related work

Although general theories on convex programs are fairly complete nowadays, e.g., most of them can be solved by the interior point method (Boyd and Vandenberghe 2004), when faced with large scale problems, which are typical in machine learning, the general theory may not lead to efficient algorithms. For example, when using CVX,Footnote 4 an interior point based toolbox, to solve nuclear norm minimization problems [i.e., one of the \(f_i\)’s is the nuclear norm of a matrix, e.g., (2) and (3)], such as matrix completion (Candès and Recht 2009), robust principal component analysis (Candès et al. 2011), and low-rank representation (Liu et al. 2010, 2012), the complexity of each iteration is \(O(q^6)\), where \(q\times q\) is the matrix size. Such a complexity is unbearable for large scale computing.

To address the scalability issue, first order methods are often preferred. The accelerated proximal gradient (APG) algorithm (Beck and Teboulle 2009; Toh and Yun 2010) is popular due to its guaranteed \(O(K^{-2})\) convergence rate, where \(K\) is the iteration number. However, APG is basically for unconstrained optimization. For constrained optimization, the constraints have to be added to the objective function as penalties, resulting in approximated solutions only. The alternating direction method (ADM)Footnote 5 (Fortin and Glowinski 1983; Boyd et al. 2011; Lin et al. 2009a) has regained a lot of attention recently and is also widely used. It is especially suitable for separable convex programs like (1) because it fully utilizes the separable structure of the objective function. Unlike APG, ADM can solve (1) exactly. Another first order method is the split Bregman method (Goldstein and Osher 2008; Zhang et al. 2011), which is closely related to ADM (Esser 2009) and is influential in image processing.

An important reason that first order methods are popular for solving large scale convex programs in machine learning is that the convex functions \(f_i\)’s are often matrix or vector norms or characteristic functions of convex sets, which enables the following subproblems [called the proximal operation of \(f_i\) (Rockafellar 1970)]

$$\begin{aligned} \text{ prox }_{f_i,\sigma }(\mathbf{w})=\mathop {\hbox {argmin }}\limits _{\mathbf{x}_i} f_i(\mathbf{x}_i)+\frac{\displaystyle \sigma }{\displaystyle 2}\Vert \mathbf{x}_i-\mathbf{w}\Vert ^2 \end{aligned}$$
(7)

to have closed form solutions. For example, when \(f_i\) is the \(\ell _1\) norm, \(\text{ prox }_{f_i,\sigma }(\mathbf{w})=\mathcal {T}_{\sigma ^{-1}}(\mathbf{w})\), where \(\mathcal {T}_{\varepsilon }(x)=\text{ sgn }(x)\max (|x|-\varepsilon ,0)\) is the soft-thresholding operator (Goldstein and Osher 2008); when \(f_i\) is the nuclear norm, the optimal solution is: \(\text{ prox }_{f_i,\sigma }(\mathbf{W})=\mathbf{U}\mathcal {T}_{\sigma ^{-1}} (\mathbf{\Sigma })\mathbf{V}^T\), where \(\mathbf{U}\mathbf{\Sigma }\mathbf{V}^T\) is the singular value decomposition (SVD) of \(\mathbf{W}\) (Cai et al. 2010); and when \(f_i\) is the characteristic function of the nonnegative cone, the optimal solution is \(\text{ prox }_{f_i,\sigma }(\mathbf{w})=\max (\mathbf{w},0)\). Since subproblems like (7) have to be solved in each iteration when using first order methods to solve separable convex programs, that they have closed form solutions greatly facilitates the optimization.

However, when applying ADM to solve (1) with non-unitary linear mappings (i.e., \(\mathcal {A}_i^{\dag }\mathcal {A}_i\) is not the identity mapping, where \(\mathcal {A}_i^{\dag }\) is the adjoint operator of \(\mathcal {A}_i\)), the resulting subproblems may not have closed form solutions,Footnote 6 hence need to be solved iteratively, making the optimization process awkward. Some work (Yang and Yuan 2013; Lin et al. 2011) has considered this issue by linearizing the quadratic term \(\Vert \mathcal {A}_i(\mathbf{x}_i)-\mathbf{w}\Vert ^2\) in the subproblems, hence such a variant of ADM is called the linearized ADM (LADM). Deng and Yin (2012) further propose the generalized ADM that makes both ADM and LADM as its special cases and prove its globally linear convergence by imposing strong convexity on the objective function or full-rankness on some linear operators.

Nonetheless, most of the existing theories on ADM and LADM are for the two-block case, i.e., \(n=2\) in (1) (Fortin and Glowinski 1983; Boyd et al. 2011; Lin et al. 2011; Deng and Yin 2012). The number of blocks is restricted to two because the proofs of convergence for the two-block case are not applicable for the multi-block case, i.e., \(n>2\) in (1). Actually, a naive generalization of ADM or LADM to the multi-block case may diverge [see (15) and Chen et al. 2013]. Unfortunately, in practice multi-block convex programs often occur, e.g., robust principal component analysis with dense noise (Candès et al. 2011), latent low-rank representation (Liu and Yan 2011) [see (2)], and when there are extra convex set constraints [see (3) and (26)–(27)]. So it is desirable to design practical algorithms for the multi-block case.

Recently He and Yuan (2013) and Tao (2014) considered the multi-block LADM and ADM, respectively. To safeguard convergence, He and Yuan (2013) proposed LADM with Gaussian back substitution (LADMGB), which destroys the sparsity or low-rankness of the iterates during iterations when dealing with sparse representation and low-rank recovery problems, while Tao (2014) proposed ADM with parallel splitting, whose subproblems may not be easily solvable. Moreover, they all developed their theories with the penalty parameter being fixed, resulting in difficulty of tuning an optimal penalty parameter that fits for different data and data sizes. This has been identified as an important issue (Deng and Yin 2012).

1.3 Contributions and differences from prior work

To propose an algorithm that is more suitable for convex programs in machine learning, in this paper we aim at combining the advantages of He and Yuan (2013), Tao (2014), and Lin et al. (2011), i.e., combining LADM, parallel splitting, and adaptive penalty. Hence we call our method LADM with parallel splitting and adaptive penalty (LADMPSAP). With LADM, the subproblems will have forms like (7) and hence can be easily solved. With parallel splitting, the sparsity and low-rankness of iterates can be preserved during iterations when dealing with sparse representation and low-rank recovery problems, saving both the storage and the computation load. With adaptive penalty, the convergence can be faster and it is unnecessary to tune an optimal penalty parameter. Parallel splitting also makes the algorithm highly parallelizable, making LADMPSAP suitable for parallel or distributed computing, which is important for large scale machine learning. When all the component objective functions have bounded subgradients, we prove convergence results that are stronger than the existing theories on ADM and LADM. For example, the penalty parameter can be unbounded and the sufficient and necessary conditions of the global convergence of LADMPSAP can be obtained as well. We also propose a simple optimality measure and prove the convergence rate of LADMPSAP in an ergodic sense under this measure. Our proof is simpler than those in He and Yuan (2012) and Tao (2014) which relied on a complex optimality measure. When a convex program has extra convex set constraints, we further devise a practical version of LADMPSAP that converges faster thanks to better parameter analysis. Finally, we generalize LADMPSAP to cope with more difficult \(f_i\)’s, whose proximal operation (7) is not easily solvable, by further linearizing the smooth components of \(f_i\)’s. Experiments testify to the advantage of LADMPSAP in speed and numerical accuracy.

Note that Goldfarb and Ma (2012) also proposed a multiple splitting algorithm for convex optimization. However, they only considered a special case of our model problem (1), i.e., all the linear mappings \(\mathcal {A}_i\)’s are identity mappings.Footnote 7 With their simpler model problem, linearization is unnecessary and a faster convergence rate, \(O(K^{-2})\), can be achieved. In contrast, in this paper we aim at proposing a practical algorithm for efficiently solving more general problems like (1).

We also note that Hong and Luo (2012) used the same linearization technique for the smooth components of \(f_i\)’s as well, but they only considered a special class of \(f_i\)’s. Namely, the non-smooth component of \(f_i\) is a sum of \(\ell _1\) and \(\ell _2\) norms or its epigraph is polyhedral. Moreover, for parallel splitting (Jacobi update) Hong and Luo (2012) has to incorporate a postprocessing to guarantee convergence, by interpolating between an intermediate iterate and the previous iterate. Third, Hong and Luo (2012) still focused on a fixed penalty parameter. Again, our method can handle more general \(f_i\)’s, does not require postprocessing, and allows for an adaptive penalty parameter.

A more general splitting/linearization technique can be founded in Zhang et al. (2011). However, the authors only proved that any accumulation point of the iteration is a Kuhn–Karush–Tucker (KKT) point and did not investigate the convergence rate. There was no evidence that the iteration could converge to a unique point. Moreover, the authors only studied the case of fixed penalty parameter.

Although dual ascent with dual decomposition (Boyd et al. 2011) can also solve (1) in a parallel way, it may break down when some \(f_i\)’s are not strictly convex (Boyd et al. 2011), which typically happens in sparse or low-rank recovery problems where \(\ell _1\) norm or nuclear norm are used. Even if it works, since \(f_i\) is not strictly convex, dual ascent becomes dual subgradient ascent (Boyd et al. 2011), which is known to converge at a rate of \(O(K^{-1/2})\)—slower than our \(O(K^{-1})\) rate. Moreover, dual ascent requires choosing a good step size for each iteration, which is less convenient than ADM based methods.

1.4 Organization

The remainder of this paper is organized as follows. We first review LADM with adaptive penalty (LADMAP) for the two-block case in Sect. 2. Then we present LADMPSAP for the multi-block case in Sect. 3. Next, we propose a practical version of LADMPSAP for separable convex programs with convex set constraints in Sect. 4. We further extend LADMPSAP to proximal LADMPSAP for programs with more difficult objective functions in Sect. 5. We compare the advantage of LADMPSAP in speed and numerical accuracy with other first order methods in Sect. 6. Finally, we conclude the paper in Sect. 7.

This paper is an extension of our prior work Lin et al. (2011) and Liu et al. (2013).

2 Review of LADMAP for the two-block case

We first review LADMAP (Lin et al. 2011) for the two-block case of (1). It consists of four steps:

  1. 1.

    Update \(\mathbf{x}_1\):

    $$\begin{aligned} \mathbf{x}_1^{k+1}=\mathop {\hbox {argmin }}\limits _{\mathbf{x}_1} f_1(\mathbf{x}_1) +\frac{\sigma _1^{(k)}}{2}\left\| \mathbf{x}_1-\mathbf{x}_1^{k}+\mathcal {A}_1^{\dag }\left( \tilde{\varvec{\lambda }}_1^{k}\right) /\sigma _1^{(k)}\right\| ^2, \end{aligned}$$
    (8)
  2. 2.

    Update \(\mathbf{x}_2\):

    $$\begin{aligned} \mathbf{x}_2^{k+1}=\mathop {\hbox {argmin }}\limits _{\mathbf{x}_2} f_2(\mathbf{x}_2) +\frac{\sigma _2^{(k)}}{2}\left\| \mathbf{x}_2-\mathbf{x}_2^{k}+\mathcal {A}_2^{\dag }\left( \tilde{\varvec{\lambda }}_2^{k}\right) /\sigma _2^{(k)}\right\| ^2, \end{aligned}$$
    (9)
  3. 3.

    Update \(\varvec{\lambda }\):

    $$\begin{aligned} \varvec{\lambda }^{k+1} = \varvec{\lambda }^k + \beta _k\left( \sum \limits _{i=1}^2\mathcal {A}_i\left( \mathbf{x}_i^{k+1}\right) -\mathbf{b}\right) , \end{aligned}$$
    (10)
  4. 4.

    Update \(\beta \):

    $$\begin{aligned} \beta _{k+1} = \min (\beta _{\max },\rho \beta _k), \end{aligned}$$
    (11)

where \(\lambda \) is the Lagrange multiplier, \(\beta _k\) is the penalty parameter, \(\sigma _i^{(k)}=\eta _i\beta _k\) with \(\eta _i > \Vert \mathcal {A}_i\Vert ^2\) (\(\Vert \mathcal {A}_i\Vert \) is the operator norm of \(\mathcal {A}_i\)),

$$\begin{aligned} \tilde{\varvec{\lambda }}_1^{k}&= \varvec{\lambda }^k + \beta _k\left( \mathcal {A}_1\left( \mathbf{x}_1^{k}\right) +\mathcal {A}_2\left( \mathbf{x}_2^{k}\right) -\mathbf{b}\right) ,\end{aligned}$$
(12)
$$\begin{aligned} \tilde{\varvec{\lambda }}_2^{k}&= \varvec{\lambda }^k + \beta _k\left( \mathcal {A}_1\left( \mathbf{x}_1^{k+1}\right) +\mathcal {A}_2\left( \mathbf{x}_2^{k}\right) -\mathbf{b}\right) , \end{aligned}$$
(13)

and \(\rho \) is an adaptively updated parameter [see (20)]. Please refer to (Lin et al. 2011) for details. Note that the latest \(\mathbf{x}_1^{k+1}\) is immediately used to compute \(\mathbf{x}_2^{k+1}\) [see (13)]. So \(\mathbf{x}_1\) and \(\mathbf{x}_2\) have to be updated alternately, hence the name alternating direction method.

3 LADMPSAP for the multi-block case

In this section, we extend LADMAP for multi-block separable convex programs (1). We also provide the sufficient and necessary conditions for global convergence when subgradients of the objective functions are all bounded. We further prove the convergence rate in an ergodic sense.

3.1 LADM with parallel splitting and adaptive penalty

Contrary to our intuition, the multi-block case is actually fundamentally different from the two-block one. For the multi-block case, it is very natural to generalize LADMAP for the two-block case in a straightforward way, with

$$\begin{aligned} \tilde{\varvec{\lambda }}_i^{k}=\varvec{\lambda }^k + \beta _k\left( \sum \limits _{j=1}^{i-1}\mathcal {A}_j\left( \mathbf{x}_j^{k+1}\right) +\sum \limits _{j=i}^{n}\mathcal {A}_j\left( \mathbf{x}_j^{k}\right) -\mathbf{b}\right) ,\quad i=1,\ldots ,n. \end{aligned}$$
(14)

Unfortunately, we were unable to prove the convergence of such a naive LADMAP using the same proof for the two-block case. This is because their Fejér monotone inequalities (see Remark 4) cannot be the same. That is why He et al. has to introduce an extra Gaussian back substitution (He et al. 2012; He and Yuan 2013) for correcting the iterates. Actually, the above naive generalization of LADMAP may be divergent (which is even worse than converging to a wrong solution), e.g., when applied to the following problem:

$$\begin{aligned} \min \limits _{\mathbf{x}_1,\ldots ,\mathbf{x}_n} \sum \limits _{i=1}^n \Vert \mathbf{x}_i\Vert _1,\quad s.t.\quad \sum \limits _{i=1}^n \mathbf{A}_i\mathbf{x}_i=\mathbf{b}, \end{aligned}$$
(15)

where \(n\ge 5\) and \(\mathbf{A}_i\) and \(\mathbf{b}\) are Gaussian random matrix and vector, respectively, whose entries fulfil the standard Gaussian distribution independently. Chen et al. (2013) also analyzed the naively generalized ADM for the multi-block case and showed that even for three blocks the iteration could still be divergent. They also provided sufficient conditions, which basically require that the linear mappings \(\mathcal {A}_i\) should be orthogonal to each other (\(\mathcal {A}_i^{\dag }\mathcal {A}_j=0, i\ne j\)), to ensure the convergence of naive ADM.

Fortunately, by modifying \(\tilde{\varvec{\lambda }}_i^k\) slightly we are able to prove the convergence of the corresponding algorithm. More specifically, our algorithm for solving (1) consists of the following steps:

  1. 1.

    Update \(\mathbf{x}_i\)’s in parallel:

    $$\begin{aligned} \mathbf{x}_i^{k+1}=\mathop {\hbox {argmin }}\limits _{\mathbf{x}_i} f_i(\mathbf{x}_i) +\frac{\displaystyle \sigma _i^{(k)}}{\displaystyle 2}\left\| \mathbf{x}_i-\mathbf{x}_i^{k}+\mathcal {A}_i^{\dag }\left( \hat{\varvec{\lambda }}^{k}\right) / \sigma _i^{(k)}\right\| ^2,\quad i=1,\ldots ,n, \end{aligned}$$
    (16)
  2. 2.

    Update \(\varvec{\lambda }\):

    $$\begin{aligned} \varvec{\lambda }^{k+1} = \varvec{\lambda }^k + \beta _k\left( \sum \limits _{i=1}^n\mathcal {A}_i\left( \mathbf{x}_i^{k+1}\right) -\mathbf{b}\right) , \end{aligned}$$
    (17)
  3. 3.

    Update \(\beta \):

    $$\begin{aligned} \beta _{k+1} = \min (\beta _{\max },\rho \beta _k), \end{aligned}$$
    (18)

where \(\sigma _i^{(k)}=\eta _i\beta _k\),

$$\begin{aligned} \hat{\varvec{\lambda }}^{k}=\varvec{\lambda }^k + \beta _k\left( \sum \limits _{i=1}^n\mathcal {A}_i\left( \mathbf{x}_i^{k}\right) -\mathbf{b}\right) , \end{aligned}$$
(19)

and

$$\begin{aligned} \rho = \left\{ \begin{array}{ll} \rho _0, &{}\quad \text{ if } \ \beta _{k}\max \left( \left\{ \sqrt{\eta _i}\left\| \mathbf{x}_i^{k+1} -\mathbf{x}_i^k\right\| ,i=1,\ldots ,n\right\} \right) /\left\| \mathbf{b}\right\| < \varepsilon _2,\\ 1, &{} \quad \text{ otherwise }, \end{array}\right. \end{aligned}$$
(20)

with \(\rho _0 > 1\) being a constant and \(0 < \varepsilon _2\ll 1\) being a threshold. Indeed, we replace \(\tilde{\varvec{\lambda }}_i^k\) with \(\hat{\varvec{\lambda }}^k\) as (19), which is independent of \(i\), and the rest procedures of the algorithm, including the scheme (18) and (20) to update the penalty parameter, are all inherited from Lin et al. (2011), except that \(\eta _i\)’s have to be made larger (see Theorem 1). As now \(\mathbf{x}_i\)’s are updated in parallel and \(\beta _k\) changes adaptively, we call the new algorithm LADM with parallel splitting and adaptive penalty (LADMPSAP).

3.2 Stopping criteria

Some existing work (e.g., Liu et al. 2010; Favaro et al. 2011) proposed stopping criteria out of intuition only, which may not guarantee that the correct solution is approached. Recently, Lin et al. (2009a) and Boyd et al. (2011) suggested that the stopping criteria can be derived from the KKT conditions of a problem. Here we also adopt such a strategy. Specifically, the iteration terminates when the following two conditions are met:

$$\begin{aligned}&\left\| \sum \limits _{i=1}^n\mathcal {A}_i\left( \mathbf{x}_i^{k+1}\right) - \mathbf{b}\right\| /\Vert \mathbf{b}\Vert < \varepsilon _1,\end{aligned}$$
(21)
$$\begin{aligned}&\beta _{k}\max \left( \left\{ \sqrt{\eta _i}\left\| \mathbf{x}_i^{k+1} -\mathbf{x}_i^k\right\| ,i=1,\ldots ,n\right\} \right) /\Vert \mathbf{b}\Vert < \varepsilon _2. \end{aligned}$$
(22)

The first condition measures the feasibility error. The second condition is derived by comparing the KKT conditions of problem (1) and the optimality condition of subproblem (23). The rules (18) and (20) for updating \(\beta \) are actually hinted by the above stopping criteria such that the two errors are well balanced.

For better reference, we summarize the proposed LADMPSAP algorithm in Algorithm 1. For fast convergence, we suggest that \(\beta _0=\alpha m\varepsilon _2\) and \(\alpha >0\) and \(\rho _0 > 1\) should be chosen such that \(\beta _k\) increases steadily along with iterations.

figure a

3.3 Global convergence

In the following, we always use \((\mathbf{x}_1^*,\ldots ,\mathbf{x}_n^*,\varvec{\lambda }^*)\) to denote the KKT point of problem (1). For the global convergence of LADMPSAP, we have the following theorem, where we denote \(\{\mathbf{x}_i^k\}=\{\mathbf{x}_1^k,\ldots ,\mathbf{x}_n^k\}\) for simplicity.

Theorem 1

(Convergence of LADMPSAP)Footnote 8 If \(\{\beta _k\}\) is non-decreasing and upper bounded, \(\eta _i > n\Vert \mathcal {A}_i\Vert ^2, i=1,\ldots ,n\), then \(\{(\{\mathbf{x}_i^k\},\varvec{\lambda }^k)\}\) generated by LADMPSAP converge to a KKT point of problem (1).

3.4 Enhanced convergence results

Theorem 1 is a convergence result for general convex programs (1), where \(f_i\)’s are general convex functions and hence \(\{\beta _k\}\) needs to be bounded. Actually, almost all the existing theories on ADM and LADM even assumed a fixed \(\beta \). For adaptive \(\beta _k\), it will be more convenient if a user needs not to specify an upper bound on \(\{\beta _k\}\) because imposing a large upper bound essentially equals to allowing \(\{\beta _k\}\) to be unbounded. Since many machine learning problems choose \(f_i\)’s as matrix/vector norms, which result in bounded subgradients, we find that the boundedness assumption can be removed. Moreover, we can further prove the sufficient and necessary condition for global convergence.

Theorem 2

(Sufficient condition for global convergence) If \(\{\beta _k\}\) is non-decreasing and \(\sum \nolimits _{k=1}^{+\infty }\beta _k^{-1}=+\infty , \eta _i > n\Vert \mathcal {A}_i\Vert ^2, \partial f_i(\mathbf{x})\) is bounded, \(i=1,\ldots ,n\), then the sequence \(\{\mathbf{x}_i^k\}\) generated by LADMPSAP converges to an optimal solution to (1).

Remark 1

Theorem 2 does not claim that \(\{\varvec{\lambda }^k\}\) converges to a point \(\varvec{\lambda }^\infty \). However, as we are more interested in \(\{\mathbf{x}_i^k\}\), such a weakening is harmless.

We also have the following result on the necessity of \(\sum \nolimits _{k=1}^{+\infty }\beta _k^{-1}=+\infty \).

Theorem 3

(Necessary condition for global convergence) If \(\{\beta _k\}\) is non-decreasing, \(\eta _i > n\Vert \mathcal {A}_i\Vert ^2, \partial f_i(\mathbf{x})\) is bounded, \(i=1,\ldots ,n\), then \(\sum \nolimits _{k=1}^{+\infty }\beta _k^{-1}=+\infty \) is also a necessary condition for the global convergence of \(\{\mathbf{x}_i^k\}\) generated by LADMPSAP to an optimal solution to (1).

With the above analysis, when all the subgradients of the component objective functions are bounded we can remove \(\beta _{\max }\) in Algorithm 1.

3.5 Convergence rate

The convergence rate of ADM and LADM in the traditional sense is an open problem (Goldfarb and Ma 2012). Although Hong and Luo (2012) claimed that they proved the linear convergence rate of ADM, their assumptions are actually quite strong. They assumed that the non-smooth part of \(f_i\) is a sum of \(\ell _1\) and \(\ell _2\) norms or its epigraph is polyhedral. Moreover, the convex constraint sets should all be polyhedral and bounded. So although their results are encouraging, for general convex programs the convergence rate is still a mystery. Recently, He and Yuan (2012) and Tao (2014) proved an \(O(1/K)\) convergence rate of ADM and ADM with parallel splitting in an ergodic sense, respectively. Namely \(\frac{1}{K}\sum \nolimits _{k=1}^{K}\mathbf{x}_i\) violates an optimality measure in \(O(1/K)\). Their proof is lengthy and is for fixed penalty parameter only.

In this subsection, based on a simple optimality measure we give a simple proof for the convergence rate of LADMPSAP. For simplicity, we denote \(\mathbf{x}=(\mathbf{x}_1^T,\ldots ,\mathbf{x}_n^T)^T, \mathbf{x}^*=((\mathbf{x}_1^*)^T,\ldots ,(\mathbf{x}_2^*)^T)^T\), and \(f(\mathbf{x})=\sum \nolimits _{i=1}^n f_i(\mathbf{x}_i)\). We first have the following proposition.

Proposition 1

\(\tilde{\mathbf{x}}\) is an optimal solution to (1) if and only if there exists \(\alpha > 0\), such that

$$\begin{aligned} f(\tilde{\mathbf{x}})-f(\mathbf{x}^*)+\sum \limits _{i=1}^n\left\langle \mathcal {A}_i^{\dag }(\varvec{\lambda }^*), \tilde{\mathbf{x}}_i-\mathbf{x}_i^*\right\rangle +\alpha \left\| \sum \limits _{i=1}^n \mathcal {A}_i(\tilde{\mathbf{x}}_i)-\mathbf{b}\right\| ^2 =0. \end{aligned}$$
(24)

Since the left hand side of (24) is always nonnegative and it becomes zero only when \(\tilde{\mathbf{x}}\) is an optimal solution, we may use its magnitude to measure how far a point \(\tilde{\mathbf{x}}\) is from an optimal solution. Note that in the unconstrained case, as in APG (Beck and Teboulle 2009), one may simply use \(f(\tilde{\mathbf{x}})-f(\mathbf{x}^*)\) to measure the optimality. But here we have to deal with the constraints. Our criterion is simpler than that in (He and Yuan 2012; Tao 2014), which has to compare \((\{\mathbf{x}_i^k\},\lambda ^k)\) with all \((\mathbf{x}_1,\ldots ,\mathbf{x}_n,\varvec{\lambda }) \in \mathbb {R}^{d_1}\times \ldots \times \mathbb {R}^{d_n}\times \mathbb {R}^{m}\).

Then we have the following convergence rate theorem for LADMPSAP in an ergodic sense.

Theorem 4

(Convergence rate of LADMPSAP) Define \(\bar{\mathbf{x}}^K=\sum \nolimits _{k=0}^K \gamma _k \mathbf{x}^{k+1}\), where \(\gamma _k=\beta _k^{-1}/\sum \nolimits _{j=0}^K \beta _j^{-1}\). Then the following inequality holds for \(\bar{\mathbf{x}}^K\):

$$\begin{aligned} \begin{array}{rl} &{}f(\bar{\mathbf{x}}^K) - f(\mathbf{x}^*) + \sum \limits _{i=1}^n \left\langle \mathcal {A}_i^{\dag }(\varvec{\lambda }^*),\bar{\mathbf{x}}_i^K-\mathbf{x}_i^*\right\rangle +\frac{\displaystyle \alpha \beta _0}{\displaystyle 2}\left\| \sum \limits _{i=1}^n \mathcal {A}_i\left( \bar{\mathbf{x}}_i^K\right) - \mathbf{b}\right\| ^2 \\ &{}\quad \le C_0/\left( 2\sum \limits _{k=0}^K\beta _k^{-1}\right) , \end{array} \end{aligned}$$
(25)

where

$$\begin{aligned} \alpha ^{-1}=(n+1) \max \Bigg (1, \Bigg \{\frac{\displaystyle \Vert \mathcal {A}_i\Vert ^2}{\displaystyle \eta _i-n\Vert \mathcal {A}_i\Vert ^2}, i=1,\ldots , n\Bigg \}\Bigg ) \end{aligned}$$

and

$$\begin{aligned} C_0=\sum \limits _{i=1}^n\eta _i\left\| \mathbf{x}_i^{0}-\mathbf{x}_i^*\right\| ^2 +\beta _0^{-2}\left\| \varvec{\lambda }^{0}-\varvec{\lambda }^*\right\| ^2. \end{aligned}$$

Theorem 4 means that \(\bar{\mathbf{x}}^K\) is by \(O\left( 1/\sum \nolimits _{k=0}^K\beta _k^{-1}\right) \) from being an optimal solution. This theorem holds for both bounded and unbounded \(\{\beta _k\}\). In the bounded case, \(O\left( 1/\sum \nolimits _{k=0}^K\beta _k^{-1}\right) \) is simply \(O(1/K)\). Theorem 4 also hints that \(\sum \nolimits _{k=0}^K\beta _k^{-1}\) should approach infinity to guarantee the convergence of LADMPSAP, which is consistent with Theorem 3.

4 Practical LADMPSAP for convex programs with convex set constraints

In real applications, we are often faced with convex programs with convex set constraints:

$$\begin{aligned} \min \limits _{\mathbf{x}_1,\ldots ,\mathbf{x}_n} \sum \limits _{i=1}^n f_i(\mathbf{x}_i), \ s.t. \ \sum \limits _{i=1}^n\mathcal {A}_i(\mathbf{x}_i)=\mathbf{b},\quad \ \mathbf{x}_i\in X_i,\quad i=1,\ldots ,n, \end{aligned}$$
(26)

where \(X_i\subseteq \mathbb {R}^{d_i}\) is a closed convex set. In this section, we consider to extend LADMPSAP to solve the more complex convex set constraint model (26). We assume that the projections onto \(X_i\)’s are all easily computable. For many convex sets used in machine learning, such an assumption is valid, e.g., when \(X_i\)’s are nonnegative cones or positive semi-definite cones. In the following, we discuss how to solve (26) efficiently. For simplicity, we assume \(X_i\ne \mathbb {R}^{d_i}, \forall i\). Finally, we assume that \(\mathbf{b}\) is an interior point of \(\sum \nolimits _{i=1}^n\mathcal {A}_i(X_i)\).

We introduce auxiliary variables \(\mathbf{x}_{n+i}\) to convert \(\mathbf{x}_i\in X_i\) into \(\mathbf{x}_i=\mathbf{x}_{n+i}\) and \(\mathbf{x}_{n+i}\in X_i, i=1,\ldots ,n\). Then (26) can be reformulated as:

$$\begin{aligned} \min \limits _{\mathbf{x}_1,\ldots ,\mathbf{x}_{2n}} \sum \limits _{i=1}^{2n} f_i(\mathbf{x}_i),\quad s.t.\quad \sum \limits _{i=1}^{2n}\hat{\mathcal {A}}_i(\mathbf{x}_i) =\hat{\mathbf{b}}, \end{aligned}$$
(27)

where

$$\begin{aligned} f_{n+i}(\mathbf{x})\equiv \chi _{X_i}(\mathbf{x})=\left\{ \begin{array}{ll} 0,&{} \quad \text{ if } \mathbf{x}\in X_i,\\ +\infty , &{}\quad \text{ otherwise }, \end{array} \right. \end{aligned}$$

is the characteristic function of \(X_i\),

$$\begin{aligned} \begin{array}{c} \hat{\mathcal {A}}_i(\mathbf{x}_i)= \left( \begin{array}{c} \mathcal {A}_i(\mathbf{x}_i)\\ 0\\ \vdots \\ \mathbf{x}_i\\ \vdots \\ 0 \end{array} \right) ,\qquad \hat{\mathcal {A}}_{n+i}(\mathbf{x}_{n+i})=\left( \begin{array}{c} 0\\ 0\\ \vdots \\ -\mathbf{x}_{n+i}\\ \vdots \\ 0 \end{array} \right) , \quad \text{ and } \quad \hat{\mathbf{b}}=\left( \begin{array}{c} \mathbf{b}\\ 0\\ \vdots \\ 0\\ \vdots \\ 0 \end{array} \right) , \end{array} \end{aligned}$$
(28)

where \(i=1,\ldots ,n\).

The adjoint operator \(\hat{\mathcal {A}}_i^{\dag }\) is

$$\begin{aligned} \hat{\mathcal {A}}_i^{\dag }(\mathbf{y})=\mathcal {A}_i^{\dag }(\mathbf{y}_{1})+\mathbf{y}_{i+1},\quad \hat{\mathcal {A}}_{n+i}^{\dag }(\mathbf{y})=-\mathbf{y}_{i+1}, \quad i=1,\ldots ,n, \end{aligned}$$
(29)

where \(\mathbf{y}_{i}\) is the \(i\)th sub-vector of \(\mathbf{y}\), partitioned according to the sizes of \(\mathbf{b}\) and \(\mathbf{x}_i, i=1,\ldots ,n\).

Then LADMPSAP can be applied to solve problem (27). The Lagrange multiplier \(\varvec{\lambda }\) and the auxiliary multiplier \(\hat{\varvec{\lambda }}\) are respectively updated as

$$\begin{aligned} \varvec{\lambda }^{k+1}_{1}&= \varvec{\lambda }^{k}_{1} + \beta _k\left( \sum \limits _{i=1}^n \mathcal {A}_i\left( \mathbf{x}_i^{k+1}\right) -\mathbf{b}\right) ,\quad \varvec{\lambda }^{k+1}_{i+1}=\varvec{\lambda }^{k}_{i+1} + \beta _k\left( \mathbf{x}_i^{k+1}-\mathbf{x}_{n+i}^{k+1}\right) , \end{aligned}$$
(30)
$$\begin{aligned} \hat{\varvec{\lambda }}^{k}_{1}&= \varvec{\lambda }^{k}_{1} + \beta _k\left( \sum \limits _{i=1}^n \mathcal {A}_i\left( \mathbf{x}_i^{k}\right) -\mathbf{b}\right) ,\quad \hat{\varvec{\lambda }}^{k}_{i+1}=\varvec{\lambda }^{k}_{i+1} + \beta _k\left( \mathbf{x}_i^{k}-\mathbf{x}_{n+i}^{k}\right) , \end{aligned}$$
(31)

and \(\mathbf{x}_i\) is updated as (see 16)

$$\begin{aligned} \mathbf{x}_i^{k+1}&= \mathop {\hbox {argmin }}\limits _{\mathbf{x}} f_i(\mathbf{x})+\frac{\displaystyle \eta _i\beta _k}{\displaystyle 2}\left\| \mathbf{x}-\mathbf{x}_i^k +\left[ \mathcal {A}_i^{\dag }\left( \hat{\varvec{\lambda }}^k_{1}\right) +\hat{\varvec{\lambda }}^k_{i+1}\right] /(\eta _i\beta _k)\right\| ^2,\ \end{aligned}$$
(32)
$$\begin{aligned} \mathbf{x}_{n+i}^{k+1}&= \mathop {\hbox {argmin }}\limits _{\mathbf{x}\in X_i} \frac{\displaystyle \eta _{n+i}\beta _k}{\displaystyle 2} \left\| \mathbf{x}-\mathbf{x}_{n+i}^k-\hat{\varvec{\lambda }}^k_{i+1}/(\eta _{n+i}\beta _k)\right\| ^2\nonumber \\&= \pi _{X_i} \left( \mathbf{x}_{n+i}^k+\hat{\varvec{\lambda }}^k_{i+1}/(\eta _{n+i}\beta _k)\right) , \end{aligned}$$
(33)

where \(\pi _{X_i}\) is the projection onto \(X_i\) and \(i=1,\ldots ,n\).

As for the choice of \(\eta _i\)’s, although we can simply apply Theorem 1 to assign their values as \(\eta _i> 2n (\Vert \mathcal {A}_i\Vert ^2+1)\) and \(\eta _{n+i} > 2n, i=1,\ldots ,n\), such choices are too pessimistic. As \(\eta _i\)’s are related to the magnitudes of the differences in \(\mathbf{x}_i^{k+1}\) from \(\mathbf{x}_i^k\), we had better provide tighter estimate on \(\eta _i\)’s in order to achieve faster convergence. Actually, we have the following better result.

Theorem 5

For problem (27), if \(\{\beta _k\}\) is non-decreasing and upper bounded and \(\eta _i\)’s are chosen as \(\eta _i > n \Vert \mathcal {A}_i\Vert ^2 + 2\) and \(\eta _{n+i} > 2, i=1,\ldots ,n\), then the sequence \(\{(\{\mathbf{x}_i^k\},\varvec{\lambda }^k)\}\) generated by LADMPSAP converge to a KKT point of problem (27).

Finally, we summarize LADMPSAP for problem (27) in Algorithm 2, which is a practical algorithm for solving (26).

figure b

Remark 2

Analogs of Theorems 2 and 3 are also true for Algorithm 2 although \(\partial f_{n+i}\)’s are unbounded, thanks to our assumptions that all \(\partial f_i, i=1,\ldots ,n\), are bounded and \(\mathbf{b}\) is an interior point of \(\sum \nolimits _{i=1}^n\mathcal {A}_i(X_i)\), which result in an analog of Proposition 4. Consequently, \(\beta _{\max }\) can also be removed if all \(\partial f_i, i=1,\ldots ,n\), are bounded.

Remark 3

Since Algorithm 2 is an application of Algorithm 1 to problem (27), only with refined parameter estimation, its convergence rate in an ergodic sense is also \(O\left( 1/\sum \nolimits _{k=0}^K \beta _k^{-1}\right) \), where \(K\) is the number of iterations.

5 Proximal LADMPSAP for even more general convex programs

In LADMPSAP we have assumed that the subproblems (16) are easily solvable. In many machine learning problems, the functions \(f_i\)’s are often matrix or vector norms or characteristic functions of convex sets. So this assumption often holds. Nonetheless, this assumption is not always true, e.g., when \(f_i\) is the logistic loss function [see (6)]. So in this section we aim at generalizing LADMPSAP to solve even more general convex programs (1).

We are interested in the case that \(f_i\) can be decomposed into two components:

$$\begin{aligned} f_i(\mathbf{x}_i)=g_i(\mathbf{x}_i)+h_i(\mathbf{x}_i), \end{aligned}$$
(34)

where both \(g_i\) and \(h_i\) are convex, \(g_i\) is \(C^{1,1}\):

$$\begin{aligned} \left\| \nabla g_i(\mathbf{x})- \nabla g_i(\mathbf{y})\right\| \le L_i \left\| \mathbf{x}-\mathbf{y}\right\| ,\quad \forall x,y\in \mathbb {R}^{d_i}, \end{aligned}$$
(35)

and \(h_i\) may not be differentiable but its proximal operation is easily solvable. For brevity, we call \(L_i\) the Lipschitz constant of \(\nabla g_i\).

Recall that in each iteration of LADMPSAP, we have to solve subproblem (16). Since now we do not assume that the proximal operation of \(f_i\) (7) is easily solvable, we may have difficulty in solving subproblem (16). By (34), we write down (16) as

$$\begin{aligned} \mathbf{x}_i^{k+1}=\mathop {\hbox {argmin }}\limits _{\mathbf{x}_i} h_i(\mathbf{x}_i)+g_i(\mathbf{x}_i) +\frac{\displaystyle \sigma _i^{(k)}}{\displaystyle 2}\left\| \mathbf{x}_i-\mathbf{x}_i^{k} +\mathcal {A}_i^\dag \left( \hat{\varvec{\lambda }}^{k}\right) /\sigma _i^{(k)}\right\| ^2,\quad i=1,\ldots ,n,\nonumber \\ \end{aligned}$$
(36)

Since \(g_i(\mathbf{x}_i)+\frac{\displaystyle \sigma _i^{(k)}}{\displaystyle 2}\left\| \mathbf{x}_i -\mathbf{x}_i^{k}+\mathcal {A}_i^\dag (\hat{\varvec{\lambda }}^{k})/\sigma _i^{(k)}\right\| ^2\) is \(C^{1,1}\), we may also linearize it at \(\mathbf{x}_i^k\) and add a proximal term. Such an idea leads to the following updating scheme of \({\mathbf{x}}_i\):

$$\begin{aligned} \mathbf{x}_i^{k+1}&= \mathop {\hbox {argmin }}\limits _{\mathbf{x}_i} h_i(\mathbf{x}_i)+g_i\left( \mathbf{x}_i^k\right) +\frac{\displaystyle \sigma _i^{(k)}}{\displaystyle 2} \left\| \mathcal {A}_i^\dag \left( \hat{\varvec{\lambda }}^{k}\right) /\sigma _i^{(k)}\right\| ^2\nonumber \\&\quad +\,\left\langle \nabla g_i\left( \mathbf{x}_i^k\right) +\mathcal {A}_i^\dag \left( \hat{\varvec{\lambda }}^{k}\right) ,\mathbf{x}_i-\mathbf{x}_i^k \right\rangle +\frac{\displaystyle \tau _i^{(k)}}{\displaystyle 2}\left\| \mathbf{x}_i-\mathbf{x}_i^{k}\right\| ^2\nonumber \\&= \mathop {\hbox {argmin }}\limits _{\mathbf{x}_i} h_i(\mathbf{x}_i)+\frac{\tau _i^{(k)}}{2} \left\| \mathbf{x}_i-\mathbf{x}_i^k+\frac{1}{\tau _i^{(k)}}\left[ \mathcal {A}_i^{\dag }\left( \hat{\varvec{\lambda }}^k\right) +\nabla g_i\left( \mathbf{x}_i^k\right) \right] \right\| ^2, \end{aligned}$$
(37)

where \(i=1,\ldots ,n\). The choice of \(\tau _i^{(k)}\) is presented in Theorem 6, i.e. \(\tau _i^{(k)}=T_i+\beta _k \eta _i\), where \(T_i \ge L_i\) and \(\eta _i > n\Vert \mathcal {A}_i\Vert ^2\) are both positive constants.

By our assumption on \(h_i\), the above subproblems are easily solvable. The update of Lagrange multiplier \(\lambda \) and \(\beta \) are still respectively goes as (17) and (18) but with

$$\begin{aligned} \rho = \left\{ \begin{array}{ll} \rho _0, &{}\quad \text{ if } \ \max \left( \left\{ \Vert \mathcal {A}_i\Vert ^{-1}\left\| \nabla g_i\left( \mathbf{x}_i^{k+1}\right) -\nabla g_i\left( \mathbf{x}_i^k\right) - \tau _{i}^{(k)}\left( \mathbf{x}_i^{k+1}-\mathbf{x}_i^k\right) \right\| ,\right. \right. \\ &{}\qquad i=1,\ldots ,n\Big \}\Big )/\Vert \mathbf{b}\Vert < \varepsilon _2,\\ 1, &{}\quad \text{ otherwise }. \end{array}\right. \end{aligned}$$
(38)

The iteration terminates when the following two conditions are met:

$$\begin{aligned}&\left\| \sum \limits _{i=1}^n\mathcal {A}_i(\mathbf{x}_i^{k+1}) - \mathbf{b}\right\| /\Vert \mathbf{b}\Vert < \varepsilon _1, \end{aligned}$$
(39)
$$\begin{aligned}&\max \left( \left\{ \Vert \mathcal {A}_i\Vert ^{-1}\left\| \nabla g_i(\mathbf{x}_i^{k+1})-\nabla g_i(\mathbf{x}_i^k)- \tau _{i}^{(k)}(\mathbf{x}_i^{k+1}-\mathbf{x}_i^k)\right\| ,\right. \right. \nonumber \\&\quad i=1,\ldots ,n\Big \}\Big )/\Vert \mathbf{b}\Vert < \varepsilon _2. \end{aligned}$$
(40)

These two conditions are also deduced from the KKT conditions.

We call the above algorithm as proximal LADMPSAP and summarize it in Algorithm 3.

figure c

As for the convergence of proximal LADMPSAP, we have the following theorem.

Theorem 6

(Convergence of proximal LADMPSAP) If \(\beta _k\) is non-decreasing and upper bounded, \(\tau _i^{(k)}=T_i+\beta _k \eta _i\), where \(T_i \ge L_i\) and \(\eta _i > n\Vert \mathcal {A}_i\Vert ^2\) are both positive constants, \(i=1,\ldots ,n\), then \(\{(\{\mathbf{x}_i^k\},\varvec{\lambda }^k)\}\) generated by proximal LADMPSAP converge to a KKT point of problem (1).

We further have the following convergence rate theorem for proximal LADMPSAP in an ergodic sense.

Theorem 7

(Convergence rate of proximal LADMPSAP) Define \(\bar{\mathbf{x}}_i^K=\sum \nolimits _{k=0}^K \gamma _k \mathbf{x}_i^{k+1}\), where \(\gamma _k=\beta _k^{-1}/\sum \nolimits _{j=0}^K \beta _j^{-1}\). Then the following inequality holds for \(\bar{\mathbf{x}}_i^K\):

$$\begin{aligned}&\sum \limits _{i=1}^n \left( f_i\left( \bar{\mathbf{x}}_i^{K}\right) -f_i\left( \mathbf{x}_i^*\right) +\left\langle \mathcal {A}_i^{\dag } (\lambda ^*),\bar{\mathbf{x}}_i^{K}-\mathbf{x}_i^*\right\rangle \right) +\frac{\alpha \beta _0}{2} \left\| \sum \limits _{i=1}^n\mathcal {A}_i\left( \bar{\mathbf{x}}_i^{K}\right) -\mathbf{b}\right\| ^2\nonumber \\&\quad \le C_0/\sum \limits _{k=0}^K2\beta _k^{-1}, \end{aligned}$$
(42)

where

$$\begin{aligned} \alpha ^{-1}=(n+1) \max \left( 1, \left\{ \frac{\displaystyle \Vert \mathcal {A}_i\Vert ^2}{\displaystyle \eta _i-n\Vert \mathcal {A}_i\Vert ^2}, i=1,\ldots ,n\right\} \right) \end{aligned}$$

and

$$\begin{aligned} C_0=\sum \limits _{i=1}^n\beta _0^{-1}\tau _i^{(0)} \Vert \mathbf{x}_i^{0}-\mathbf{x}_i^*\Vert ^2+\beta _0^{-2}\Vert \lambda ^{0}-\lambda ^*\Vert ^2. \end{aligned}$$

When there are extra convex set constraints, \(\mathbf{x}_i\in X_i, i=1,\ldots ,n\), we can also introduce auxiliary variables as in Sect. 4 and have an analogy of Theorems 5 and 4.

Theorem 8

For problem (27), where \(f_i\) is described at the beginning of Section 5, if \(\beta _k\) is non-decreasing and upper bounded and \(\tau _i^{(k)}=T_i+\eta _i\beta _k\), where \(T_i \ge L_i, T_{n+i}=0, \eta _i>n\Vert \mathcal {A}_i\Vert ^2 + 2\), and \(\eta _{n+i} > 2, i=1,\ldots ,n\), then \(\{(\{\mathbf{x}_i^k\},\varvec{\lambda }^k)\}\) generated by proximal LADMPSAP converge to a KKT point of problem (27). The convergence rate in an ergodic sense is also \(O\left( 1/\sum \nolimits _{k=0}^K \beta _k^{-1}\right) \), where \(K\) is the number of iterations.

6 Numerical results

In this section, we test the performance of LADMPSAP on three specific examples of problem (1), i.e., Latent low-rank representation [see (2)], nonnegative matrix completion [see (3)], and group sparse logistic regression with overlap [see (6)].

6.1 Solving latent low-rank representation

We first solve the latent LRR problem (Liu and Yan 2011) (2). In order to test LADMPSAP and related algorithms with data whose characteristics are controllable, we follow (Liu et al. 2010) to generate synthetic data, which are parameterized as (\(s, p, d, \tilde{r}\)), where \(s, p, d\), and \(\tilde{r}\) are the number of independent subspaces, points in each subspace, and ambient and intrinsic dimensions, respectively. The number of scale variables and constraints is \((sp)\times d\).

As first order methods are popular for solving convex programs in machine learning (Boyd et al. 2011), here we compare LADMPSAP with several conceivable first order algorithms, including APG (Beck and Teboulle 2009), naive ADM, naive LADM, LADMGB, and LADMPS. Naive ADM and naive LADM are generalizations of ADM and LADM, respectively, which are straightforwardly generalized from two variables to multiple variables, as discussed in Sect. 3.1. Naive ADM is applied to solve (2) after rewriting the constraint of (2) as \(\mathbf{X}= \mathbf{X}\mathbf{P} + \mathbf{Q}\mathbf{X}+ \mathbf{E}, \mathbf{P}=\mathbf{Z}, \mathbf{Q}=\mathbf{L}\). For LADMPS, \(\beta _k\) is fixed in order to show the effectiveness of adaptive penalty. The parameters of APG and ADM are the same as those in (Lin et al. 2009b) and (Liu and Yan 2011), respectively. For LADM, we follow the suggestions in (Yang and Yuan 2013) to fix its penalty parameter \(\beta \) at \(2.5/\min (d,sp)\), where \(d\times sp\) is the size of \(\mathbf{X}\). For LADMGB, as there is no suggestion in He and Yuan (2013) on how to choose a fixed \(\beta \), we simply set it the same as that in LADM. The rest of the parameters are the same as those suggested in He et al. (2012). We fix \(\beta = \sigma _{\max }(\mathbf{X})\min (d,sp)\varepsilon _2\) in LADMPS and set \(\beta _0 = \sigma _{\max }(\mathbf{X})\min (d,sp)\varepsilon _2\) and \(\rho _0 = 10\) in LADMPSAP. For LADMPSAP, we also set \(\eta _Z=\eta _L=1.02\times 3\sigma _{\max }^2(\mathbf{X})\), where \(\eta _Z\) and \(\eta _L\) are the parameters \(\eta _i\)’s in Algorithm 1 for \(\mathbf{Z}\) and \(\mathbf{L}\), respectively. For the stopping criteria, \(\Vert \mathbf{X}\mathbf{Z}^k+\mathbf{L}^k\mathbf{X}+ \mathbf{E}^k-\mathbf{X}\Vert /\Vert \mathbf{X}\Vert \le \varepsilon _1\) and \(\max (\Vert \mathbf{Z}^{k}-\mathbf{Z}^{k-1}\Vert ,\Vert \mathbf{L}^{k}-\mathbf{L}^{k-1}\Vert ,\Vert \mathbf{E}^{k} -\mathbf{E}^{k-1}\Vert )/\Vert \mathbf{X}\Vert \le \varepsilon _2\), with \(\varepsilon _1=10^{-3}\) and \(\varepsilon _2=10^{-4}\) are used for all the algorithms. For the parameter \(\mu \) in (2), we empirically set it as \(\mu = 0.01\). To measure the relative errors in the solutions we run LADMPSAP 2,000 iterations with \(\rho _0 = 1.01\) to obtain the estimated ground truth solution (\(\mathbf{Z}^*, \mathbf{L}^*, \mathbf{E}^*\)). The experiments are run and timed on a notebook computer with an Intel Core i7 2.00 GHz CPU and 6 GB memory, running Windows 7 and Matlab 7.13.

Table 1 shows the results of related algorithms. We can see that LADMPS and LADMPSAP are faster and more numerically accurate than LADMGB, and LADMPSAP is even faster than LADMPS thanks to the adaptive penalty. Moreover, naive ADM and naive LADM have relatively poorer numerical accuracy, possibly due to converging to wrong solutions. The numerical accuracy of APG is also worse than those of LADMPS and LADMPSAP because it only solves an approximate problem by adding the constraint to the objective function as penalty. Note that although we do not require \(\{\beta _k\}\) to be bounded, this does not imply that \(\beta _k\) will grow infinitely. As a matter of fact, when LADMPSAP terminates the final values of \(\beta _k\) are \(21.1567, 42.2655\), and \(81.4227\) for the three data settings, respectively.

Table 1 Comparisons of APG, naive ADM (nADM), naive LADM (nLADM), LADMGB, LADMPS, and LADMPSAP on the latent LRR problem (2)

We then test the performance of the above six algorithms on the Hopkins155 database (Tron and Vidal 2007), which consists of 156 sequences, each having 39–550 data vectors drawn from two or three motions. For computational efficiency, we preprocess the data by projecting them to be 5-dimensional using PCA. We test all algorithms with \(\mu = 2.4\), which is the best parameter for LRR on this database (Liu et al. 2010). Table 2 shows the results on the Hopkins155 database. We can also see that LADMPSAP is faster than other methods in comparison. In particular, LADMPSAP is faster than LADMPS, which uses a fixed \(\beta \). This testify to the advantage of using an adaptive penalty.

Table 2 Comparisons of APG, naive ADM (nADM), naive LADM (nLADM), LADMGB, LADMPS, and LADMPSAP on the Hopkins155 database

6.2 Solving nonnegative matrix completion

This subsection evaluates the performance of the practical LADMPSAP proposed in Sect. 4 for solving nonnegative matrix completion (Xu et al. 2011) (3).

We first evaluate the numerical performance on synthetic data to demonstrate the superiority of practical LADMPSAP over the conventional LADMFootnote 9 (Yang and Yuan 2013). The nonnegative low-rank matrix \(\mathbf{X}_0\) is generated by truncating the singular values of a randomly generated matrix. As LADM cannot handle the nonnegativity constraint, it actually solve the standard matrix completion problem, i.e., (3) without the nonnegativity constraint. For LADMPSAP, we follow the conditions in Theorem 5 to set \(\eta _i\)’s and set the rest of the parameters the same as those in Sect. 6.1. The stopping tolerances are set as \(\varepsilon _1 = \varepsilon _2 = 10^{-5}\). The numerical comparison is shown in Table 3, where the relative nonnegative feasibility (FA) is defined as (Xu et al. 2011):

$$\begin{aligned} \text{ FA } {:}= \Vert \min (\hat{\mathbf{X}},0)\Vert /\Vert \mathbf{X}_0\Vert , \end{aligned}$$

in which \(\mathbf{X}_0\) is the ground truth and \(\hat{\mathbf{X}}\) is the computed solution. It can be seen that the numerical performance of LADMPSAP is much better than that of LADM, thus again verifies the efficiency of our proposed parallel splitting and adaptive penalty scheme for enhancing ADM/LADM type algorithms.

Table 3 Comparisons on the NMC problem (3) with synthetic data, averaged on \(10\) runs

We then consider the image inpainting problem, which is to fill in the missing pixel values of a corrupted image. As the pixel values are nonnegative, the image inpainting problem can be formulated as the NMC problem. To prepare a low-rank image, we also truncate the singular values of a \(1{,}024 \times 1{,}024\) grayscale image “man”Footnote 10 to obtain an image of rank 40, shown in Fig. 1a, b. The corrupted image is generated from the original image (all pixels have been normalized in the range of [0, 1]) by sampling 20 % of the pixels uniformly at random and adding Gaussian noise with mean zero and standard deviation 0.1.

Fig. 1
figure 1

Image inpainting by FPCA, LADM and LADMPSAP

Besides LADM, here we also consider another recently proposed fixed point continuation with approximate SVD [FPCA (Ma et al. 2011)] on this problem. Similar to LADM, the code of FPCAFootnote 11 can only solve the standard matrix completion problem without the nonnegativity constraint. This time we set \(\varepsilon _1 = 10^{-3}\) and \(\varepsilon _2 = 10^{-1}\) as the thresholds for stopping criteria. The recovered images are shown in Fig. 1c–e and the quantitative results are in Table 4. One can see that on our test image both the qualitative and the quantitative results of LADMPSAP are better than those of FPCA and LADM. Note that LADMPSAP is faster than FPCA and LADM even though they do not handle the nonnegativity constraint.

Table 4 Comparisons on the image inpainting problem. “PSNR” stands for “Peak Signal to Noise Ratio” measured in decibel (dB)

6.3 Solving group sparse logistic regression with overlap

In this subsection, we apply proximal LADMPSAP to solve the problem of group sparse logistic regression with overlap (5).

The Lipschitz constant of the gradient of logistic function with respect to \(\bar{\mathbf{w}}\) can be proven to be \(L_{\bar{w}}\le \frac{1}{4s}\Vert \bar{\mathbf{X}}\Vert _2^2\), where \(\bar{\mathbf{X}}=(\bar{\mathbf{x}}_1,\bar{\mathbf{x}}_2,\ldots ,\bar{\mathbf{x}}_s)\). Thus (5) can be directly solved by Algorithm 3.

6.3.1 Synthetic data

To assess the performance of proximal LADMPSAP, we simulate data with \(p=9t+1\) variables, covered by \(t\) groups of ten variables with overlap of one variable between two successive groups: \(\{1,\ldots ,10\}, \{10,\ldots ,19\}, \ldots , \{p-9,\ldots ,p\}\). We randomly choose \(q\) groups to be the support of \(\mathbf{w}\). If the chosen groups have overlapping variables with the unchosen groups, the overlapping variables are removed from the support of \(\mathbf{w}\). So the support of \(\mathbf{w}\) may be less than \(10q\). \(\mathbf{y}=(y_1,\ldots ,y_s)^T\) is chosen as \((1,-1,1,-1,\ldots )^T\). \(\mathbf{X}\in \mathbb {R}^{p\times s}\) is generated as follows. For \(\mathbf{X}_{i,j}\), if \(i\) is in the support of \(\mathbf{w}\) and \(\mathbf{y}_j=1\), then \(\mathbf{X}_{i,j}\) is generated uniformly on \([0.5, 1.5]\); if \(i\) is in the support of \(\mathbf{w}\) and \(\mathbf{y}_j=-1\), then \(\mathbf{X}_{i,j}\) is generated uniformly on \([-1.5, -0.5]\); if \(i\) is not in the support of \(\mathbf{w}\), then \(\mathbf{X}_{i,j}\) is generated uniformly on \([-0.5, 0.5]\). Then the rows whose indices are in the support of \(\mathbf{w}\) are statistically different from the remaining rows in \(\mathbf{X}\), hence can be considered as informative rows. We use model (6) to select the informative rows for classification, where \(\mu =0.1\). If the ground truth support of \(\mathbf{w}\) is recovered, then the two groups of data are linearly separable by considering only the coordinates in the support of \(\mathbf{w}\).

We compare proximal LADMPSAP with a series of ADM based methods, including ADM, LADM, LADMPS, and LADMPSAP, where the subproblems for \(\mathbf{w}\) and \(\mathbf{b}\) have to be solved iteratively, e.g., by APG (Beck and Teboulle 2009). We terminate the inner loop by APG when the norm of gradient of the objective function of the subproblem is less than \(10^{-6}\). As for the outer loop, we choose \(\varepsilon _1=2\times 10^{-4}\) and \(\varepsilon _2=2\times 10^{-3}\) as the thresholds to terminate the iterations.

For ADM, LADM, and LADMPS, which use a fixed penalty \(\beta \), as we do not find any suggestion on its choice in the literature (the choice suggested in Yang and Yuan (2013) is for nuclear norm regularized least square problem only) we try multiple choices of \(\beta \) and choose the one that results in the fastest convergence. For LADMPSAP, we set \(\beta _0=0.2\) and \(\rho _0=5\). For proximal LADMPSAP we set \(T_1=\frac{1}{4s}\Vert \bar{\mathbf{X}}\Vert _2^2, \eta _1=2.01\Vert \bar{\mathbf{S}}\Vert _2^2, T_2=0, \eta _2=2.01, \beta _0=1\), and \(\rho _0=5\). To measure the relative errors in the solutions we iterate proximal LADMPSAP for 2,000 times and regard its output as the ground truth solution \((\bar{\mathbf{w}}^*,\mathbf{z}^*)\).

Table 5 shows the comparison among related algorithms. The ground truth support of \(\mathbf{w}\) is recovered by all the compared algorithms. We can see that ADM, LADM, LADMPS, and LADMPSAP are much slower than proximal LADMPSAP because of the time-consuming subproblem computation, although they have much smaller number of outer iterations. Their numerical accuracies are also inferior to that of proximal LADMPSAP. We can also see that LADMPSAP is faster and more numerically accurate than ADM, LADM, and LADMPS. This again testifies to the effectiveness of using adaptive penalty.

Table 5 Comparisons among ADM, LADM, LADMPS, LADMPSAP, and proximal LADMPSAP (pLADMPSAP) on the group sparse logistic regression with overlap problem. The quantities include the computing time (in seconds), number of outer iterations, and relative errors

6.3.2 Pathway analysis on breast cancer data

Then we consider the pathway analysis problem using the breast cancer gene expression data set (Vijver and He 2002), which consists of 8,141 genes in 295 breast cancer tumors (78 metastatic and 217 non-metastatic). We follow Jacob et al. (2009) and use the canonical pathways from MSigDB (Subramanian et al. 2005) to generate the overlapping gene sets, which contains 639 groups of genes, 637 of which involve genes from our study. The statistics of the 637 gene groups are summarized as follows: the average number of genes in each group is 23.7, the largest gene group has 213 genes, and 3,510 genes appear in these 637 groups with an average appearance frequency of about four. We follow Jacob et al. (2009) to restrict the analysis to the 3,510 genes and balance the data set by using three replicates of each metastasis patient in the training set. We use model (6) to select genes, where \(\mu =0.08\). We want to predict whether a tumor is metastatic (\(y_i=1\)) or non-metastatic (\(y_i=-1\)).

We compare proximal LADMPSAP with the active set method, which was adopted in (Jacob et al. 2009),Footnote 12 LADM, and LADMPSAP. In LADMPSAP and proximal LADMPSAP, we both set \(\beta _0=0.8\) and \(\rho _0=1.1\). For LADM, we try multiple choices of \(\beta \) and choose the one that results in the fastest convergence. In LADM and LADMPSAP, we terminate the inner loop by APG when the norm of gradient of the objective function of the subproblem is less than \(10^{-6}\). The thresholds for terminating the outer loop are all chosen as \(\varepsilon _1=10^{-3}\) and \(\varepsilon _2=6\times 10^{-3}\). For the three LADM based methods, we first solve (6) to select genes. Then we use the selected genes to re-train a traditional logistic regression model and use the model to predict the test samples. As in Jacob et al. (2009) we partition the whole data set into three subsets to do the experiment three times. Each time we select one subset as the test set and the other two as the training set (i.e., there are \((78+217)\times 2/3=197\) samples for training). It is worth mentioning that Jacob et al. (2009) only kept the 300 genes that are the most correlated with the output in the pre-processing step. In contrast, we use all the 3,510 genes in the training phase.

Table 6 shows that proximal LADMPSAP is more than ten times faster than the active set method used in Jacob et al. (2009), although it computes with a more than ten times larger training set. Proximal LADMPSAP is also much faster than LADM and LADMPSAP due to the lack of inner loop to solve subproblems. The prediction error and the sparseness at the pathway level by proximal LADMPSAP is also competitive with those of other methods in comparison.

Table 6 Comparisons among the Active Set method (Jacob et al. 2009), LADM, LADMPSAP, and proximal LADMPSAP (pLADMPSAP) on the pathway analysis

7 Conclusions

In this paper, we propose linearized alternating direction method with parallel splitting and adaptive penalty (LADMPSAP) for efficiently solving linearly constrained multi-block separable convex programs, which are abundant in machine learning. LADMPSAP fully utilizes the properties that the proximal operations of the component objective functions and the projections onto convex sets are easily solvable, which are usually satisfied by machine learning problems, making each of its iterations cheap. It is also highly parallel, making it appealing for parallel or distributed computing. Numerical experiments testify to the advantages of LADMPSAP over other possible first order methods.

Although LADMPSAP is inherently parallel, when solving the proximal operations of component objective functions we will still face basic numerical algebraic computations. So for particular large scale machine learning problems, it will be interesting to integrate the existing distributed computing techniques [e.g., parallel incomplete Cholesky factorization (Chang et al. 2007; Chang 2011) and caching factorization techniques (Boyd et al. 2011)] with our LADMPSAP in order to effectively address the scalability issues.