In Sect. 2.1 we formalize the block structure of the problem, establish notationFootnote 1 that will be used in the rest of the paper and list assumptions. In Sect. 2.2 we propose two parallel block coordinate descent methods and comment in some detail on the steps.
Block structure, notation and assumptions
The block structureFootnote 2 of (1) is given by a decomposition of \(\mathbf {R}^N\) into \(n\) subspaces as follows. Let \(U\in \mathbf {R}^{N\times N}\) be a column permutationFootnote 3 of the \(N\times N\) identity matrix and further let \(U= [U_1,U_2,\ldots ,U_n]\) be a decomposition of \(U\) into \(n\) submatrices, with \(U_i\) being of size \(N\times N_i\), where \(\sum _i N_i = N\).
Proposition 1
(Block decompositionFootnote 4) Any vector \(x\in \mathbf {R}^N\) can be written uniquely as
$$\begin{aligned} x = \sum \limits _{i=1}^n U_i x^{(i)}, \end{aligned}$$
(5)
where \(x^{(i)} \in \mathbf {R}^{N_i}\). Moreover, \(x^{(i)}=U_i^T x\).
Proof
Noting that \(UU^T=\sum _i U_i U_i^T\) is the \(N\times N\) identity matrix, we have \(x=\sum _i U_i U_i^T x\). Let us now show uniqueness. Assume that \(x =\sum _i U_i x_1^{(i)} = \sum _i U_i x_2^{(i)}\), where \(x_1^{(i)},x_2^{(i)}\in \mathbf {R}^{N_i}\). Since
$$\begin{aligned} U_j^T U_i = {\left\{ \begin{array}{ll} N_j\times N_j \quad \text {identity matrix,} &{} \text { if } i=j,\\ N_j\times N_i \quad \text {zero matrix,}&{} \text { otherwise,} \end{array}\right. } \end{aligned}$$
(6)
for every \(j\) we get \(0 = U_j^T (x-x) = U_j^T \sum _i U_i (x_1^{(i)}-x_2^{(i)}) = x_1^{(j)}-x_2^{(j)}\).
In view of the above proposition, from now on we write \(x^{(i)}\mathop {=}\limits ^{\text {def}}U_i^T x \in \mathbf {R}^{N_i}\), and refer to \(x^{(i)}\) as the \(i\)
th block of \(x\). The definition of partial separability in the introduction is with respect to these blocks. For simplicity, we will sometimes write \(x = (x^{(1)},\ldots ,x^{(n)})\).
Projection onto a set of blocks
For \(S\subset {[n]}\) and \(x\in \mathbf {R}^N\) we write
$$\begin{aligned} x_{[S]} \mathop {=}\limits ^{\text {def}}\sum \limits _{i\in S} U_i x^{(i)}. \end{aligned}$$
(7)
That is, given \(x\in \mathbf {R}^N\), \(x_{[S]}\) is the vector in \(\mathbf {R}^N\) whose blocks \(i\in S\) are identical to those of \(x\), but whose other blocks are zeroed out. In view of Proposition 1, we can equivalently define \(x_{[S]}\) block-by-block as follows
$$\begin{aligned} (x_{[S]})^{(i)} = {\left\{ \begin{array}{ll}x^{(i)}, \qquad &{} i\in S,\\ 0 \;(\in \mathbf {R}^{N_i}), \qquad &{}\text {otherwise.}\end{array}\right. } \end{aligned}$$
(8)
Inner products
The standard Euclidean inner product in spaces \(\mathbf {R}^N\) and \(\mathbf {R}^{N_i}\), \(i\in {[n]}\), will be denoted by \(\langle \cdot , \cdot \rangle \). Letting \(x,y \in \mathbf {R}^N\), the relationship between these inner products is given by
$$\begin{aligned} \langle x , y \rangle \overset{(5)}{=} \left\langle \sum \limits _{j=1}^n U_j x^{(j)}\sum \limits _{i=1}^n U_i y^{(i)}\right\rangle = \sum \limits _{j=1}^n \sum \limits _{i=1}^n \langle U_i^T U_j x^{(j)} , y^{(i)} \rangle \overset{(6)}{=} \sum \limits _{i=1}^n \langle x^{(i)} , y^{(i)} \rangle . \end{aligned}$$
For any \(w\in \mathbf {R}^n\) and \(x,y\in \mathbf {R}^N\) we further define
$$\begin{aligned} \langle x , y \rangle _w \mathop {=}\limits ^{\text {def}}\sum \limits _{i=1}^n w_i \langle x^{(i)} , y^{(i)} \rangle . \end{aligned}$$
(9)
For vectors \(z=(z_1,\ldots ,z_n)^T \in \mathbf {R}^n\) and \(w = (w_1,\ldots ,w_n)^T \in \mathbf {R}^n\) we write \(w\odot z \mathop {=}\limits ^{\text {def}}(w_1 z_1, \ldots , w_n z_n)^T\).
Norms
Spaces \(\mathbf {R}^{N_i}\), \(i \in {[n]}\), are equipped with a pair of conjugate norms: \(\Vert t\Vert _{(i)} \mathop {=}\limits ^{\text {def}}\langle B_i t , t \rangle ^{1/2}\), where \(B_i\) is an \(N_i\times N_i\) positive definite matrix and \(\Vert t\Vert _{(i)}^* \mathop {=}\limits ^{\text {def}}\max _{\Vert s\Vert _{(i)}\le 1} \langle s , t \rangle = \langle B_i^{-1}t , t \rangle ^{1/2}\), \(t\in \mathbf {R}^{N_i}\). For \(w\in \mathbf {R}^n_{++}\), define a pair of conjugate norms in \(\mathbf {R}^N\) by
$$\begin{aligned} \Vert x\Vert _w= & {} \left[ \sum \limits _{i=1}^n w_i \Vert x^{(i)}\Vert ^2_{(i)}\right] ^{1/2}, \nonumber \\ \Vert y\Vert _w^* \mathop {=}\limits ^{\text {def}}\max _{\Vert x\Vert _w\le 1} \langle y , x \rangle= & {} \left[ \sum \limits _{i=1}^n w_i^{-1} ( \Vert y^{(i)}\Vert _{(i)}^*)^2\right] ^{1/2}. \end{aligned}$$
(10)
Note that these norms are induced by the inner product (9) and the matrices \(B_1,\ldots ,B_n\). Often we will use \(w=L\mathop {=}\limits ^{\text {def}}(L_1,L_2,\ldots ,L_n)^T\in \mathbf {R}^n\), where the constants \(L_i\) are defined below.
Smoothness of \(f\)
We assume throughout the paper that the gradient of \(f\) is block Lipschitz, uniformly in \(x\), with positive constants \(L_1,\ldots ,L_n\), i.e., that for all \(x\in \mathbf {R}^N\), \(i\in {[n]}\) and \(t\in \mathbf {R}^{N_i}\),
$$\begin{aligned} \Vert \nabla _i f(x+U_i t)-\nabla _i f(x)\Vert _{(i)}^* \le L_i \Vert t\Vert _{(i)}, \end{aligned}$$
(11)
where \(\nabla _i f(x) \mathop {=}\limits ^{\text {def}}(\nabla f(x))^{(i)} = U^T_i \nabla f(x) \in \mathbf {R}^{N_i}\). An important consequence of (11) is the following standard inequality [9]:
$$\begin{aligned} f(x+U_i t) \le f(x) + \langle \nabla _i f(x) , t \rangle + \tfrac{L_i}{2}\Vert t\Vert _{(i)}^2. \end{aligned}$$
(12)
Separability of \(\varOmega \)
We assume thatFootnote 5
\(\Omega : \mathbf {R}^N\rightarrow \mathbf {R}\cup \{+\infty \}\) is (block) separable, i.e., that it can be decomposed as follows:
$$\begin{aligned} \Omega (x)=\sum \limits _{i=1}^n \Omega _i(x^{(i)}), \end{aligned}$$
(13)
where the functions \(\Omega _i:\mathbf {R}^{N_i}\rightarrow \mathbf {R}\cup \{+\infty \}\) are convex and closed.
Strong convexity
In one of our two complexity results (Theorem 18) we will assume that either \(f\) or \(\Omega \) (or both) is strongly convex. A function \(\phi :\mathbf {R}^N\rightarrow \mathbf {R}\cup \{+\infty \}\) is strongly convex with respect to the norm \(\Vert \cdot \Vert _w\) with convexity parameter \(\mu _{\phi }(w) \ge 0\) if for all \(x,y \in {{\mathrm{dom}}}\phi \),
$$\begin{aligned} \phi (y)\ge \phi (x) + \langle \phi '(x) , y-x \rangle + \tfrac{\mu _{\phi }(w)}{2}\Vert y-x\Vert _w^2, \end{aligned}$$
(14)
where \(\phi '(x)\) is any subgradient of \(\phi \) at \(x\). The case with \(\mu _\phi (w)=0\) reduces to convexity. Strong convexity of \(F\) may come from \(f\) or \(\Omega \) (or both); we write \(\mu _f(w)\) (resp. \(\mu _\Omega (w)\)) for the (strong) convexity parameter of \(f\) (resp. \(\Omega \)). It follows from (14) that
$$\begin{aligned} \mu _{F}(w) \ge \mu _{f}(w)+ \mu _{\Omega }(w). \end{aligned}$$
(15)
The following characterization of strong convexity will be useful:
$$\begin{aligned} \phi (\lambda x+ (1-\lambda ) y) \le \lambda \phi (x) + (1-\lambda )\phi (y) - \tfrac{\mu _\phi (w)\lambda (1-\lambda )}{2}\Vert x-y\Vert _w^2, \nonumber \\ x,y \in {{\mathrm{dom}}}\phi ,\; \lambda \in [0,1]. \end{aligned}$$
(16)
It can be shown using (12) and (14) that \(\mu _f(w)\le \tfrac{L_i}{w_i}\).
Algorithms
In this paper we develop and study two generic parallel coordinate descent methods. The main method is PCDM1; PCDM2 is its “regularized” version which explicitly enforces monotonicity. As we will see, both of these methods come in many variations, depending on how Step 3 is performed.
Let us comment on the individual steps of the two methods.
Step 3. At the beginning of iteration \(k\) we pick a random set (\(S_k\)) of blocks to be updated (in parallel) during that iteration. The set \(S_k\) is a realization of a random set-valued mapping \(\hat{S}\) with values in \(2^{[n]}\) or, more precisely, it the sets \(S_k\) are iid random sets with the distribution of \(\hat{S}\). For brevity, in this paper we refer to such a mapping by the name sampling. We limit our attention to uniform samplings, i.e., random sets having the following property: \(\mathbf {P}(i \in \hat{S})\) is independent of \(i\). That is, the probability that a block gets selected is the same for all blocks. Although we give an iteration complexity result covering all such samplings (provided that each block has a chance to be updated, i.e., \(\mathbf {P}(i \in \hat{S}) > 0\)), there are interesting subclasses of uniform samplings (such as doubly uniform and nonoverlapping uniform samplings; see Sect. 4) for which we give better results.
Step 4. For \(x\in \mathbf {R}^N\) we defineFootnote 6
$$\begin{aligned} h(x) \mathop {=}\limits ^{\text {def}}\arg \min _{h \in \mathbf {R}^N} H_{\beta ,w}(x,h), \end{aligned}$$
(17)
where
$$\begin{aligned} H_{\beta ,w}(x,h) \mathop {=}\limits ^{\text {def}}f(x) + \langle \nabla f(x) , h \rangle + \tfrac{\beta }{2}\Vert h\Vert _w^2 + \Omega (x+h), \end{aligned}$$
(18)
and \(\beta >0\), \(w=(w_1,\ldots ,w_n)^T \in \mathbf {R}^n_{++}\) are parameters of the method that we will comment on later. Note that in view of (5, 10) and (13), \(H_{\beta ,w}(x,\cdot )\) is block separable;
$$\begin{aligned} H_{\beta ,w}(x,h) = f(x) + \sum \limits _{i=1}^n \left\{ \langle \nabla _i f(x) , h^{(i)} \rangle + \tfrac{\beta w_i}{2}\Vert h^{(i)}\Vert _{(i)}^2 + \Omega _i(x^{(i)} + h^{(i)})\right\} . \end{aligned}$$
Consequently, we have \(h(x) = (h^{(1)}(x),\ldots , h^{(n)}(x)) \in \mathbf {R}^N\), where
$$\begin{aligned} h^{(i)}(x) = \arg \min _{t\in \mathbf {R}^{N_i}} \{\langle \nabla _i f(x) , t \rangle + \tfrac{\beta w_i}{2}\Vert t\Vert _{(i)}^2 + \Omega _i(x^{(i)}+t)\}. \end{aligned}$$
We mentioned in the introduction that besides (block) separability, we require \(\Omega \) to be “simple”. By this we mean that the above optimization problem leading to \(h^{(i)}(x)\) is “simple” (e.g., it has a closed-form solution). Recall from (8) that \((h(x_k))_{[S_k]}\) is the vector in \(\mathbf {R}^N\) identical to \(h(x_k)\) except for blocks \(i \notin S_k\), which are zeroed out. Hence, Step 4 of both methods can be written as follows:
$$\begin{aligned} \hbox {In parallel for}\, i\in S_k\, \hbox {do}: \; x_{k+1}^{(i)} \leftarrow x_k^{(i)} + h^{(i)}(x_k). \end{aligned}$$
Parameters \(\beta \) and \(w\) depend on \(f\) and \(\hat{S}\) and stay constant throughout the algorithm. We are not ready yet to explain why the update is computed via (17) and (18) because we need technical tools, which will be developed in Sect. 4, to do so. Here it suffices to say that the parameters \(\beta \) and \(w\) come from a separable quadratic overapproximation of \(\mathbf {E}[f(x+h_{[\hat{S}]})]\), viewed as a function of \(h\in \mathbf {R}^N\). Since expectation is involved, we refer to this by the name Expected Separable Overapproximation (ESO). This novel concept, developed in this paper, is one of the main tools of our complexity analysis. Section 5 motivates and formalizes the concept, answers the why question, and develops some basic ESO theory.
Section 6 is devoted to the computation of \(\beta \) and \(w\) for partially separable \(f\) and various special classes of uniform samplings \(\hat{S}\). Typically we will have \(w_i=L_i\), while \(\beta \) will depend on easily computable properties of \(f\) and \(\hat{S}\). For example, if \(\hat{S}\) is chosen as a subset of \({[n]}\) of cardinality \(\tau \), with each subset chosen with the same probability (we say that \(\hat{S}\) is \(\tau \)-nice) then, assuming \(n>1\), we may choose \(w=L\) and \(\beta =1+ \tfrac{(\omega -1)(\tau -1)}{n-1}\), where \(\omega \) is the degree of partial separability of \(f\). More generally, if \(\hat{S}\) is any uniform sampling with the property \(|\hat{S}|=\tau \) with probability 1, then we may choose \(w=L\) and \(\beta =\min \{\omega ,\tau \}\). Note that in both cases \(w=L\) and that the latter \(\beta \) is always larger than (or equal to) the former one. This means, as we will see in Sect. 7, that we can give better complexity results for the former, more specialized, sampling. We analyze several more options for \(\hat{S}\) than the two just described, and compute parameters \(\beta \) and \(w\) that should be used with them (for a summary, see Table 4).
Step 5. The reason why, besides PCDM1, we also consider PCDM2, is the following: in some situations we are not able to analyze the iteration complexity of PCDM1 (non-strongly-convex \(F\) where monotonicity of the method is not guaranteed by other means than by directly enforcing it by inclusion of Step 5). Let us remark that this issue arises for general \(\Omega \) only. It does not exist for \(\Omega =0\), \(\Omega (\cdot ) = \lambda \Vert \cdot \Vert _1\) and for \(\Omega \) encoding simple constraints on individual blocks; in these cases one does not need to consider PCDM2. Even in the case of general \(\Omega \) we sometimes get monotonicity for free, in which case there is no need to enforce it. Let us stress, however, that we do not recommend implementing PCDM2 as this would introduce too much overhead; in our experience PCDM1 works well even in cases when we can only analyze PCDM2.