1 Introduction

In this paper, we consider a block-coordinate incremental gradient algorithm, hereafter called BIG, for minimizing a finite-sum function

$$\begin{aligned} \underset{\mathbf {w} \in {{\mathbf {R}}^n}}{\text {minimize }} \quad f(\mathbf {w})= \sum _{h=1}^H f_h(\mathbf {w}). \end{aligned}$$
(1)

and study its convergence when \(f:{\mathbf {R}}^n\rightarrow {\mathbf {R}}\) is a continuously differentiable function.

Problem (1) is a well-known optimization problem that arises in many practical applications including the regularized empirical risk minimization (ERM) where \(f_h\) represents the loss function of the \(h-\)th data block and constitutes a standard approach when training machine learning models (see e.g. Bertsekas 2011; Bottou et al. 2018; Goodfellow et al. 2016 and reference therein). We focus on the case where both the number of components H and the dimension of the space n are very large, which arises in machine learning training problems when tackling Big Data applications by means of over-parametrized models such as Deep Networks. Indeed, one of the main issues when solving problem (1) through standard batch methods, namely methods that use all the terms \(f_h\) at each iteration, is related to the high computational effort needed for each objective function and gradient evaluations. The per-iteration cost depends on the size of H and n, so that when both of them are large there is an incentive to use less expensive per-iteration methods that exploit the structure of the objective function to reduce the computational burden and avoid slowing down the optimization process.

In order to overcome this computational burden, problem (1) has been mainly tackled by means of online algorithms, namely methods that at each iteration k use one or more terms \(f_h\) in the objective function to compute an update of the current solution \(\mathbf {w}^k\). The reason for the great success of online methods lies mainly in the different balance of per-iteration costs and expected per-iteration improvement in the objective function, particularly in the Big Data setting when the size of H becomes very large [see e.g. comments in Bottou et al. (2018)]. Online methods can be roughly distinguished in two kinds: incremental gradient (IG) methods where the order in which the elements \(f_h\) are considered is fixed a priori and not changed over the iterations; stochastic gradient (SG) methods where elements \(f_h\) are chosen according to some probability distribution. IG methods can be applied only to finite-sum functions, while SG methods also apply to functions with infinite terms \(f_h\) (e.g. function representing expected values). Concerning the convergence theory, while incremental methods can be considered and analysed as deterministic methods, stochastic frameworks are usually analysed recurring to probabilistic analysis. The former method and its convergence have been deeply investigated in, e.g. Bertsekas (1996), Bertsekas (2015), Bertsekas and Tsitsiklis (2000) and Solodov (1998), while the latter in, e.g. Bertsekas and Tsitsiklis (2000), Bottou (2010) and Robbins and Monro (1951).

As pointed out in Palagi and Seccia (2020), even though online methods can effectively tackle optimization problems where the dimension of H is very large, they still suffer when the search space n becomes large as well, namely when the number of variables increases. It is often the case when dealing with applications where deep learning models are employed (e.g. image recognition applications) that the number of parameters to be estimated can go above hundreds of millions (Simonyan and Zisserman 2014). An efficient solution to tackle optimization problems with a large number of variables n is represented by Block-Coordinate Descent (BCD) methods, which update at each iteration only a subset of the whole variables, keeping the other fixed to the current value. By exploiting the structure of the objective function (e.g. fixing some variables makes the subproblem convex or allows for parallel updates) and thanks to the lower cost of calculating the block component of the gradient, these methods lend themselves well to efficient implementations and can greatly improve optimization performance and reduce the computational effort (see e.g. Bertsekas and Tsitsiklis 1989; Buzzi et al. 2001; Grippo et al. 2016; Palagi and Seccia 2020). Their convergence has already been analysed in many works with different assumptions on both the block selection rule and the properties of the update (Beck and Tetruashvili 2013; Bertsekas and Tsitsiklis 1989; Grippo and Sciandrone 1999; Lu and Xiao 2015; Nesterov 2012; Wright 2015).

In order to leverage both the sample decomposition with respect to the elements \(f_h\) composing the objective function, typical of online methods, and the block-wise decomposition with respect to the variables, typical of BCD frameworks, an effective solution is represented by block-coordinate online methods. Block-coordinate online methods aim to reduce the per-iteration costs by operating along a twofold line: updating only on a subset of the variables \(\mathbf {w}\) as in BCD methods, and on a subset (i.e. mini-batch) of the components \(f_h\) as in online methods. The behaviour of block-coordinate online methods has already been investigated in Wang and Banerjee (2014) where the strongly convex case is considered and a geometric rate of convergence in expectation has been established. Moreover, in Zhao et al. (2014) and Chauhan et al. (2017) the effectiveness of this approach has already been tested in strongly convex sparse problems such as LASSO or sparse logistic regression, respectively. In Bravi and Sciandrone (2014), a two-block decomposition method is applied for training a neural network where the objective function is assumed to be convex with respect to one of the block components (the output weights) so that exact optimization can be used for the convex block update while the other block (hidden weights) is updated using an incremental gradient update. In Palagi and Seccia (2020), the layered structure of a deep neural network has been explored to define a block layer incremental gradient (BLInG) algorithm which uses an incremental approach for updating the weights over each single layer. Numerical effectiveness of embedding block-coordinate modifications in online frameworks has already been tested and turned out to be a promising approach (Chauhan et al. 2017; Palagi and Seccia 2020; Wang and Banerjee 2014; Zhao et al. 2014).

In this paper, we present a block-coordinate incremental gradient method (BIG), which generalizes the BLInG algorithm presented in Palagi and Seccia (2020) for the deep networks training problem, and we focus on its convergence analysis. BIG can be seen as a deterministic gradient method with errors, since the selection of both the elements \(f_h\) and the blocks of variables is fixed a priori and not changed over the iterations so that the algorithm can be analysed as a gradient method with deterministic errors. Thus, taking steps from the deterministic convergence results for gradient methods with errors reported in Bertsekas and Tsitsiklis (2000) and Solodov (1998), we prove convergence of BIG towards stationary points and to an \(\epsilon \)-approximate solution, respectively, when a diminishing and a bounded away from zero stepsizes are employed. We do not report numerical results that can be found in Palagi and Seccia (2020) where the optimization problem arising in training deep neural networks is considered. Overall, the numerical results in Palagi and Seccia (2020) suggest the effectiveness of BIG in exploiting the finite-sum objective function and the inherent block layer structure of deep neural networks.

The paper is organized as follows: in Sect. 2, preliminary results on the convergence theory of gradient methods with errors are recalled and the convergence analysis of IG is provided following standard analysis of gradient methods with errors from Bertsekas and Tsitsiklis (2000) and Solodov (1998). In Sect. 3, we show how BIG can be regarded as a gradient method with errors and prove its convergence properties. In Sect. 4, we discuss numerical performance of BIG when compared with its non-decomposed counterpart IG. Finally, in Sect. 5 conclusions are drawn and in the “Appendix” supporting material is provided.

Notation. We use boldface for denoting vectors, e.g. \(\mathbf {w}=(w_1,\ldots ,w_n)\) and \(\Vert \cdot \Vert \) to denote the euclidean norm of a vector. Given a set of indexes \(\ell \subseteq \{1,\ldots ,n\}\), we denote by \(\mathbf {w}_{\ell }\) the subvector of \(\mathbf {w}\) made up of the components \(i\in \ell \), namely \(\mathbf {w}_{\ell }=({w}_{i})_{i\in \ell }\in {\mathbf {R}}^{|\ell |}\). The gradient of the function is denoted by \(\nabla f(\mathbf {w}) \in {\mathbf {R}}^n\) and, given a subvector \(\mathbf {w}_{\ell }\) of \(\mathbf {w}\), we use the short notation \(\nabla _{\ell } f(\mathbf {w}) \in {\mathbf {R}}^{|\ell |}\) to denote the partial derivative with respect to the block \(w_\ell \), i.e. \(\nabla _{\mathbf {w}_{\ell }} f(\mathbf {w})\).

Given a partition \(\mathcal{L}= \{\ell _1,\ldots ,\ell _L\}\) of the indexes \(\{1,\ldots ,n\}\), namely \(\cup _{i=1}^L \ell _i= \{1,\ldots ,n\}\) and \(\ell _i\cap \ell _j=\emptyset \) for all \(i\ne j\), a vector \(\mathbf {w}\) can be written, by reordering its components, as \(\mathbf {w}=(\mathbf {w}_{\ell _1},\ldots ,\mathbf {w}_{\ell _L})\) and correspondingly \(\nabla f(\mathbf {w})=(\nabla f(\mathbf {w})_{\ell _1}\ldots ,\nabla f(\mathbf {w})_{\ell _L})\). Further, we use the notation \([\cdot ]_{\ell }\) to define a vector in \({\mathbf {R}}^n\) where all the components are set to zero except those corresponding to the block \(\ell \), namely given a vector \(\mathbf {w}\in {\mathbf {R}}^n\) the vector \([\mathbf {w}]_{\ell }\) is defined component-wise as

$$\begin{aligned} ( [\mathbf {w}]_{\ell })_k = {\left\{ \begin{array}{ll} {w}_{k} &{} \text { if } k\in \ell \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Thanks to this notation, we have \(\mathbf {w}=\sum _{i=1}^L[\mathbf {w}]_{\ell _i}\) and \(\nabla f(\mathbf {w})=\sum _{i=1}^L[\nabla f(\mathbf {w})]_{\ell _i}\). Moreover note that \([\mathbf {w}]_{\ell _i}\in {\mathbf {R}}^n\) while \(\mathbf {w}_{\ell _i}\in {\mathbf {R}}^{|\ell _i|}\).

2 Background on Gradient method with errors

In this section, we report two main results concerning convergence of gradient methods with errors which will be useful for proving the convergence properties of BIG in Sect. 3. In particular, we consider two results, one concerning the adoption of a diminishing stepsize and the other considering a bounded away from zero stepsize, respectively, from Bertsekas and Tsitsiklis (2000) and Solodov (1998). To the best of author knowledge, these two results are among the most significant, with the former being among the ones with the less restrictive assumptions when a diminishing stepsize is employed (as highlighted by the authors neither convexity of the function nor boundedness conditions on the function or the sequence generated \(\{\mathbf {w}^k\}\) are required to prove convergence), and the latter being the first convergence result for incremental methods with bounded away from zero stepsizes. After having introduced and briefly discussed these two results, in the remainder of this section we recall their implications for the standard incremental gradient method.

In both the next two propositions, it is assumed that the function f is continuously differentiable with M-Lipschitz continuous gradient, that is

$$\begin{aligned} \Vert \nabla f(\mathbf {u})-\nabla f(\mathbf {v}) \Vert \le M \Vert \mathbf {u}-\mathbf {v}\Vert \quad \forall \mathbf {u},\mathbf {v} \in {\mathbf {R}}^n. \end{aligned}$$
(2)

We start by considering the work done by Bertsekas and Tsitsiklis (2000) where a diminishing stepsize is considered. The main idea is that a gradient method with errors where the error is proportional to the stepsize converges to a stationary point provided that the stepsize goes to zero but not too fast.

Proposition 1

(Proposition 1 in Bertsekas and Tsitsiklis 2000) Let f be continuously differentiable over \({\mathbf {R}}^n\) satisfying (2). Let \(\{\mathbf {w}^k\}\) be a sequence generated by the method

$$\begin{aligned} \mathbf {w}^{k+1}=\mathbf {w}^k+\alpha ^k(\mathbf {d}^k+\mathbf {e}^k) \end{aligned}$$

where \(\mathbf {d}^k\) is a descent direction satisfying for some positive scalars \(c_1\) and \(c_2\) and all k,

$$\begin{aligned} c_1 \Vert \nabla f(\mathbf {w}^k)\Vert ^2\le - \nabla f(\mathbf {w}^k)^T\mathbf {d}^k \qquad \Vert \mathbf {d}^k\Vert \le c_2(1+\Vert \nabla f(\mathbf {w}^k)\Vert ) \end{aligned}$$

and \(\mathbf {e}^k\) is an error vector satisfying for all k,

$$\begin{aligned} \Vert \mathbf {e}^k\Vert \le \alpha ^k(p+q\Vert \nabla f(\mathbf {w}^k)\Vert ) \end{aligned}$$
(3)

where p and q are positive scalars. Assume that the stepsize \(\alpha ^k\) is chosen according to a diminishing stepsize condition, that is

$$\begin{aligned} \sum _{k=0}^\infty \alpha ^k =\infty \qquad \sum _{k=0}^\infty (\alpha ^k)^2 <\infty . \end{aligned}$$
(4)

Then either \(lim_{k\rightarrow \infty }f(\mathbf {w}^k)= -\infty \) or \(\{f(\mathbf {w}^k)\}\) converges to a finite value and \(\lim _{k\rightarrow \infty }\nabla f(\mathbf {w}^k)=0\). Furthermore every accumulation point of \(\mathbf {w}^k\) is a stationary point of f.

On the other hand, when it comes to the case of stepsizes bounded away from zero, Solodov proves in Solodov (1998) that a gradient method with errors has at least an accumulation point \(\bar{\mathbf {w}}\) that is an \(\epsilon \)-approximate solution, with the tolerance value \(\epsilon \) at least linearly depending on the limiting value of the stepsize \(\bar{\alpha }>0\).

Proposition 2

(Proposition 2.2 in Solodov 1998) Let f be continuously differentiable over a bounded set D and let f satisfying condition (2). Let \(\{\mathbf {w}^k\}\subset D\) be a sequence generated by

$$\begin{aligned} \mathbf {w}^{k+1} = \mathbf {w}^k-\alpha ^k(\nabla f(\mathbf {w}^k)- \mathbf {e}^k). \end{aligned}$$

Assume \(\lim _{k \rightarrow \infty } \alpha ^k=\bar{\alpha }>0 \) with \(\alpha ^k \in (\theta , 2 / M-\theta )\) with \(\theta \in (0,1 / M],\) and

$$\begin{aligned} \Vert \mathbf {e}^k\Vert \le \alpha ^k \bar{B} \end{aligned}$$
(5)

with \(\bar{B}>0\).

Then there exist a constant \(C>0\) (independent of \(\bar{\alpha }\)) and an accumulation point \(\bar{\mathbf {w}}\) of the sequence \(\left\{ \mathbf {w}^k\right\} \) such that

$$\begin{aligned} \Vert \nabla f(\bar{\mathbf {w}})\Vert \le C \bar{\alpha } \end{aligned}$$
(6)

Furthermore, if the sequence \(\left\{ f\left( \mathbf {w}^k\right) \right\} \) converges then every accumulation point \(\bar{\mathbf {w}}\) of the sequence \(\left\{ \mathbf {w}^k\right\} \) satisfies (6).

Comparing the hypothesis in Propositions 1 and 2, we have that the former result considers a gradient related direction while the latter is stated only with respect to the antigradient. Moreover, the former does not require the sequence \(\{w^k\}\) to stay within a bounded set, thing that instead is needed by the latter Proposition. Finally, Proposition 2 makes a stronger assumption on the error term compared to Proposition 1, which, however, can be relaxed so to consider the same bound as in (3) (see Solodov 1998 and the discussion in the following Sect. 3.2).

2.1 Incremental Gradient as Gradient method with error

The incremental gradient framework updates the point \(\mathbf {w}^k\) by moving along the gradient direction of one or few terms \(f_h\), which are used in a fixed order. Once all the elements H composing the function in problem (1) have been considered, the outer iteration counter k is increased and the stepsize \(\alpha ^k\) is updated. The inner iteration starts with the current iterate \(\mathbf {w}^k\), and it loops over the indexes \(h=1\ldots ,H\) using a fixed stepsize \(\alpha ^k\); that is

$$\begin{aligned} \begin{array}{l}\mathbf {y}^k_{0}=\mathbf {w}^k\\ \mathbf {y}^k_{h}=\mathbf {y}^k_{h-1}-\alpha ^k \nabla f_h(\mathbf {y}^k_{h-1}) \qquad h=1,\ldots ,H\\ \mathbf {w}^{k+1}=\mathbf {y}^k_{H} \end{array} \end{aligned}$$
(7)

Thus, an iteration of the IG method can be written as

$$\begin{aligned} \mathbf {w}^{k+1}=\mathbf {w}^k+\alpha ^k \mathbf {d}^k, \end{aligned}$$
(8)

with the direction \(\mathbf {d}_k\) defined through the intermediate updates \(\mathbf {y}^k_h\) defined as in (7), i.e.

$$\begin{aligned} \mathbf {d}^k=-\sum _{h=1}^H \nabla f_h(\mathbf {y}^k_{h-1}) . \end{aligned}$$
(9)

For the sake of notation, in the following we do not report explicitly the number of terms \(f_h\) used to determine the direction, but without loss of generality we assume that the index h can represent either one index or a set of indexes. In both of the two cases, the same arguments directly apply. A general scheme of an IG method is reported in Algorithm 1.

figure a

Convergence of IG has been proved both in the case a diminishing stepsize and a bounded away from zero stepsize is employed, respectively, in Bertsekas and Tsitsiklis (2000) and Solodov (1998), by showing that it satisfies the assumptions of Propositions 1 and 2. Since we follow a similar approach to prove convergence of BIG in the next section, below we report the main convergence result for the IG method when a diminishing stepsize is employed.

Proposition 3

(Proposition 2 in Bertsekas and Tsitsiklis 2000) Let \(\{\mathbf {w}^k\}\) be a sequence generated by (7), (8) and (9). Assume that for all \(h=1,\ldots , H\) there exist positive constants Mab such that

$$\begin{aligned}&\Vert \nabla f_h(\mathbf {u}) -\nabla f_h(\mathbf {w})\Vert \le M\Vert \mathbf {u}-\mathbf {w}\Vert \qquad \forall \ \mathbf {u},\mathbf {w} \in {\mathbf {R}}^n \end{aligned}$$
(10)
$$\begin{aligned}&\Vert \nabla f_h(\mathbf {w}) \Vert \le a+b \Vert \nabla f(\mathbf {w}) \Vert \qquad \forall \ \mathbf {w}\in {\mathbf {R}}^n. \end{aligned}$$
(11)

Then the direction defined in (9) can be written as

$$\begin{aligned} \mathbf {d}^k=-\nabla f(\mathbf {w}^k)+\mathbf {e}^k \end{aligned}$$

with \(\mathbf {e}^k\) satisfying for some positive constants pq

$$\begin{aligned} \Vert \mathbf {e}^k\Vert \le \alpha ^k (p+q \Vert \nabla f(\mathbf {w}^k) \Vert ). \end{aligned}$$

Proposition 3 states that an IG iteration satisfies the hypothesis of Proposition 1 and convergence follows. Namely, it shows that IG can be viewed as a batch gradient method where the gradient is perturbed by an error term that is proportional to the stepsize. Thus, roughly speaking, driving the stepsize to zero will drive the error to zero as well, allowing to prove convergence. The proof of Proposition 3 provided in Bertsekas and Tsitsiklis (2000) is only shown for the case of \(H=2\), for reasons of simplicity. For the sake of completeness and to help the reader with the following convergence result, we provide in “Appendix A” the proof of Proposition 3 in the more generic case of any number of elements H. Finally, we note that condition (10) could be stated using a different Lipschitz constant for each h. However, for the sake of simplicity, we omit this detail.

We do not report the convergence result of IG in case a bounded away from zero stepsize is applied since it is not useful for proving convergence of BIG in case of a bounded away from zero stepsize. (Its convergence can be proved by following a similar reasoning to the one applied in the following Proposition 4). However, we remark that in Solodov (1998) to ensure that the error term in IG satisfies assumption (5) it is assumed that each \(\nabla f_h\) is Lipschitz continuous and that the norm of each \(\nabla f_h\) is bounded above by some positive constant [cfr Proposition 2.1 in Solodov (1998)].

3 The block-coordinate incremental gradient method

As already discussed in the introduction, incremental methods rely on the reduction of the complexity of a single iteration by exploiting the sum in the objective function. However, they still suffer when the dimension of the space n is large. On the other hand, BCD methods resort to simpler optimization problems by working only on a subset of variables.

Given a partition \(\mathcal{L} = \{\ell _1,\ldots ,\ell _L\}\) of the indexes \(\{1,\ldots ,n\}\) with \(w_{\ell _i}\in {\mathbf {R}}^{N_{i}}\) and \(\sum _{i=1}^L {N_{i}}=n\), a standard BCD method selects at a generic iteration one block \({\ell _i}\) (we omit the possible dependence on k) and updates only the block \(\mathbf {w}^{k}_{\ell _i}\) while keeping all the other blocks fixed at the current iteration, i.e. \(\mathbf {w}^{k+1}_{\ell _j}=\mathbf {w}^k_{\ell _j}\) for \(j\ne i\). By fixing some variables, the obtained subproblem, besides being smaller, might have a special structure in the remaining variables that can be conveniently exploited. Further, these methods might allow a distributed/parallel implementation that can speed up the overall process (Bertsekas and Tsitsiklis 1989; Grippo and Sciandrone 1999; Wright 2015).

In order to leverage the structure of the objective function and mitigate the influence of both the number of variables n and the number of terms H, a solution is to embed the online framework into a block-coordinate decomposition scheme. Following this idea, the block-coordinate incremental gradient (BIG) method proposed here consists in updating each block of variables \(\mathbf {w}_{\ell _j}\) using only one or a few terms \(f_h\) of the objective function. As done for the presentation of the IG methods, for the sake of notation, we do not report explicitly the number of terms \(f_h\) used in the updating rule, and without loss of generality we assume that the index h represents either one single term or a batch of terms. All the next arguments apply with only slight changes in the notation.

More formally, given a partition \(\mathcal{L} = \{\ell _1,\ldots ,\ell _L\} \) of the indexes \(\{1,\ldots ,n\}\), the BIG method selects a term \(h\in \{1,\ldots , H\}\) and updates all the blocks \(\mathbf {w}_{\ell _j}\) sequentially with \(j=1,\ldots ,L\) by moving with a fixed stepsize along the gradient of \(f_h\) evaluated in successive points. Once all the elements H have been selected, the outer iteration counter k is increased and the stepsize \(\alpha ^k\) is updated. Similarly to the IG method, the BIG iteration from \(\mathbf {w}^{k}\) to \(\mathbf {w}^{k+1}\) can be described by using vectors \(\mathbf {y}^k_{h,j}\) obtained in the inner iterations by sequentially using in a fixed order both the terms h and blocks j. For the sake of simplicity, we omit in the description below the dependence on k. For any fixed value of h, the inner iteration on the blocks \(\ell _j\) is defined as

$$\begin{aligned} \mathbf {y}_{h,0}&=\mathbf {y}_{h-1,L} \\ \mathbf {y}_{h,j}&=\mathbf {y}_{h,j-1}-\alpha [\nabla f_h(\mathbf {y}_{h,j-1})]_{\ell _j} \quad \text {for } j=1,\ldots , L \end{aligned}$$

where \(\mathbf {y}_{1,0}=\mathbf {w}^{k}\). Applying iteratively, we get for any h

$$\begin{aligned} \mathbf {y}_{h,j}= \mathbf {y}_{h-1,L}-\alpha \sum _{i=1}^{j} [\nabla f_h(\mathbf {y}_{h,i-1})]_{\ell _i} \quad \text {for } j=1,\ldots , L. \end{aligned}$$

Developing now iteratively on h, we get

$$\begin{aligned} \begin{array}{rl} \mathbf {y}_{h,j } =&{} \displaystyle \mathbf {y}_{h-2,L}-\alpha \sum _{i=1}^{L} [\nabla f_{h-1}(\mathbf {y}_{h-1,i-1})]_{\ell _i}\\ &{}\quad -\alpha \sum _{i=1}^{j} [\nabla f_h(\mathbf {y}_{h,i-1})]_{\ell _i}\\ =&{} \displaystyle \mathbf {w}^{k}-\alpha \sum _{t=1}^{h-1}\sum _{i=1}^{L} [\nabla f_t(\mathbf {y}_{t,i-1})]_{\ell _i}\\ &{}\qquad -\alpha \sum _{i=1}^{j} [\nabla f_h(\mathbf {y}_{h,i-1})]_{\ell _i} \end{array} \end{aligned}$$

and we finally set \(\mathbf {w}^{k+1}=\mathbf {y}_{H,L}\).

Hence, an iteration of BIG method can be written as

$$\begin{aligned} \mathbf {w}^{k+1}=\mathbf {w}^k+\alpha ^k \mathbf {d}^k \end{aligned}$$
(12)

where the direction \(\mathbf {d}_k\) is defined through the intermediate updates \(\mathbf {y}^k_{h,j}\in {\mathbf {R}}^n\) as

$$\begin{aligned} \mathbf {d}^k=-\sum _{h=1}^H\sum _{j=1}^L [\nabla f_h(\mathbf {y}^k_{h,j-1})]_{\ell _j} \end{aligned}$$
(13)

with

$$\begin{aligned} \begin{array}{rl} &{}\mathbf {y}^k_{1,0}=\mathbf {w}^k, \ \mathbf {y}^k_{h,0}=\mathbf {y}^k_{h-1,L}\\ &{}\mathbf {y}^k_{h,j } = \displaystyle \mathbf {w}^{k}-\alpha ^k\\ &{}\left( \sum _{t=1}^{h-1}\sum _{i=1}^{L} [\nabla f_t(\mathbf {y}^k_{t,i-1})]_{\ell _i}+\sum _{i=1}^{j} [\nabla f_h(\mathbf {y}^k_{h,i-1})]_{\ell _i}\right) \\ &{}\mathbf {w}^{k+1}=\mathbf {y}^k_{H,L}.\end{array} \end{aligned}$$
(14)

The scheme of BIG is reported in Algorithm 2.

figure b

3.1 Convergence of BIG with diminishing stepsize

Convergence of BIG can be proved under suitable assumptions by looking at the iteration generated by the algorithm as a gradient method with errors. Below we report the main convergence result in case a diminishing stepsize is employed, namely when \(\alpha ^k\) is updated according to (4).

Proposition 4

(Convergence of BIG - Diminishing stepsize) Let \(\{\mathbf {w}^k\}\) be a sequence generated by (12), (13) and (14). Assume that (10) and (11) hold for each \(h=1,\ldots ,H\), i.e. there exist positive constants Mab such that

$$\begin{aligned}&\Vert \nabla f_h(\mathbf {u}) -\nabla f_h(\mathbf {w})\Vert \le M\Vert \mathbf {u}-\mathbf {w}\Vert \qquad \forall \ \mathbf {u},\mathbf {w} \in {\mathbf {R}}^n \\&\Vert \nabla f_h(\mathbf {w}) \Vert \le a+b \Vert \nabla f(\mathbf {w}) \Vert \qquad \forall \ \mathbf {w}\in {\mathbf {R}}^n. \end{aligned}$$

Further assume that the stepsize \(\alpha ^k\) satisfies (4), i.e.

$$\begin{aligned} \sum _{k=0}^\infty \alpha ^k =\infty \qquad \sum _{k=0}^\infty (\alpha ^k)^2 <\infty . \end{aligned}$$

Then we have that either \(lim_{k\rightarrow \infty }f(\mathbf {w}^k)= -\infty \) or \(f(\mathbf {w}^k)\) converges to a finite value and \(\lim _{k\rightarrow \infty } \nabla f(\mathbf {w}^k)=0\). Furthermore every accumulation point of \(\mathbf {w}^k\) is a stationary point of f.

Proof

We show that the assumptions of Proposition 1 are satisfied.

First of all, note that by the definition of norm we have \(\Vert [\mathbf {w}]_{\ell _j}\Vert \le \Vert \mathbf {w}\Vert \) for all \(\mathbf {w}\in {\mathbf {R}}^n\) and \(j=\{1,\ldots ,L\}\). In turn, this yields for each \( h=1,\ldots ,H\) and \(j=1,\ldots ,L\)

$$\begin{aligned} \Vert [\nabla f_h(\mathbf {u})-\nabla f_h(\mathbf {v})]_{\ell _j}\Vert \le \Vert \nabla f_h(\mathbf {u})-\nabla f_h(\mathbf {v})\Vert \le M\Vert \mathbf {u}-\mathbf {v}\Vert \quad \forall \mathbf {u},\mathbf {v}\in {\mathbf {R}}^n \end{aligned}$$

and

$$\begin{aligned} \Vert [\nabla f_h(\mathbf {u})]_{\ell _j}\Vert \le \Vert \nabla f_h(\mathbf {u})\Vert \le a+b\Vert \nabla f(\mathbf {u})\Vert \quad \forall \mathbf {u}\in {\mathbf {R}}^n. \end{aligned}$$

We start by remarking that (10) implies (2). Indeed, \(\forall \mathbf {u},\mathbf {v}\in {\mathbf {R}}^n\) we have that

$$\begin{aligned} \Vert \nabla f(\mathbf {u})-\nabla f(\mathbf {v})\Vert&= \left\| \sum _{h=1}^H \Big ( \nabla f_h(\mathbf {u})-\nabla f_h(\mathbf {v})\Big )\right\| \\&\le \sum _{h=1}^H\left\| \nabla f_h(\mathbf {u})-\nabla f_h(\mathbf {v})\right\| \\&\le M \sum _{h=1}^H\left\| \mathbf {u}-\mathbf {v}\right\| \le \widetilde{M}\Vert \mathbf {u}-\mathbf {v}\Vert . \end{aligned}$$

For the sake of simplicity, we report the proof for the case \(H=2,L =2\). The proof in the case of generic values HL is reported in “Appendix B”. The BIG iteration can be written as

$$\begin{aligned} \mathbf {w}^{k+1}&= \mathbf {w}^k -\alpha ^k \left( \sum _{h=1}^2 \sum _{j=1}^2\left[ \nabla f_h(\mathbf {y}^k_{h,j-1}) \right] _{\ell _j} \right) \end{aligned}$$

which can be seen as

$$\begin{aligned} \mathbf {w}^{k+1}&= \mathbf {w}^k -\alpha ^k \left( \nabla f(\mathbf {w}^k) + \mathbf {e}^k \right) \end{aligned}$$

with the error

$$\begin{aligned} \mathbf {e}^k = \sum _{h=1}^2 \sum _{j=1}^2\left[ \nabla f_h(\mathbf {y}^k_{h,j-1}) - \nabla f_h(\mathbf {w}^k) \right] _{\ell _j}. \end{aligned}$$

Then we have

$$\begin{aligned} \Vert \mathbf {e}^k \Vert&= \left\| \sum _{h=1}^2 \sum _{j=1}^2\left[ \nabla f_h(\mathbf {y}^k_{h,j-1}) - \nabla f_h(\mathbf {w}^k) \right] _{\ell _j} \right\| \\&\le \sum _{h=1}^2 \sum _{j=1}^2 \left\| \nabla f_h(\mathbf {y}^k_{h,j-1}) - \nabla f_h(\mathbf {w}^k) \right\| \\&\le M \sum _{h=1}^2 \sum _{j=1}^2 \left\| \mathbf {y}^k_{h,j-1}-\mathbf {w}^k \right\| \\&= M\left( \left\| \mathbf {y}^k_{1,0}-\mathbf {w}^k \right\| +\left\| \mathbf {y}^k_{1,1}-\mathbf {w}^k \right\| \right. \\&\left. \quad + \left\| \mathbf {y}^k_{2,0}-\mathbf {w}^k \right\| + \left\| \mathbf {y}^k_{2,1}-\mathbf {w}^k \right\| \right) . \end{aligned}$$

Let focus on the terms in the last inequality one by one taking into account the inner iterations which are written as

$$\begin{aligned} \mathbf {y}^k_{1,0}&= \mathbf {w}^k\\ \mathbf {y}^k_{1,1}&= \mathbf {y}^k_{1,0} -\alpha ^k \left[ \nabla f_1(\mathbf {y}^k_{1,0}) \right] _{\ell _1}\\ \mathbf {y}^k_{1,2}&= \mathbf {y}^k_{1,1} -\alpha ^k \left[ \nabla f_1(\mathbf {y}^k_{1,1}) \right] _{\ell _2}\\ \mathbf {y}^k_{2,0}&= \mathbf {y}^k_{1,2}\\ \mathbf {y}^k_{2,1}&= \mathbf {y}^k_{2,0} -\alpha ^k \left[ \nabla f_2(\mathbf {y}^k_{2,0}) \right] _{\ell _1}\\ \mathbf {y}^k_{2,2}&= \mathbf {y}^k_{2,1} -\alpha ^k \left[ \nabla f_2(\mathbf {y}^k_{2,1}) \right] _{\ell _2}\\ \mathbf {w}^{k+1}&= \mathbf {y}^k_{2,2}. \end{aligned}$$

Hence we have

$$\begin{aligned} \left\| \mathbf {y}^k_{1,0}-\mathbf {w}^k \right\|&= 0 \\ \left\| \mathbf {y}^k_{1,1}-\mathbf {w}^k \right\|&\le \alpha ^k \left\| \left[ \nabla f_1(\mathbf {w}^k) \right] _{\ell _1}\right\| \le \alpha ^k \left( a+b\left\| \nabla f(\mathbf {w}^k) \right\| \right) \\ \left\| \mathbf {y}^k_{2,0}-\mathbf {w}^k \right\|&\le \left\| \mathbf {y}^k_{2,0}-\mathbf {y}^k_{1,1} \right\| +\left\| \mathbf {y}^k_{1,1}-\mathbf {w}^k \right\| \\&=\alpha ^k \left( \left\| \left[ \nabla f_1(\mathbf {y}^k_{1,1}) \right] _{\ell _2} \right\| + \left\| \left[ \nabla f_1(\mathbf {w}^k) \right] _{\ell _1}\right\| \right) \\&\le \alpha ^k \left( \left\| \left[ \nabla f_1(\mathbf {y}^k_{1,1}) -\nabla f_1(\mathbf {w}^k) \right] _{\ell _2} \right\| \right. \\&\left. \quad + \left\| \left[ \nabla f_1(\mathbf {w}^k) \right] _{\ell _2} \right\| + \left\| \left[ \nabla f_1(\mathbf {w}^k) \right] _{\ell _1}\right\| \right) \\&\le \alpha ^k \left( M \left\| \mathbf {y}^k_{1,1}-\mathbf {w}^k \right\| + \left\| \left[ \nabla f_1(\mathbf {w}^k) \right] _{\ell _2} \right\| \right. \\&\left. \quad + \left\| \left[ \nabla f_1(\mathbf {w}^k) \right] _{\ell _1}\right\| \right) \\&\le \alpha ^k \left( M \left\| \left[ \nabla f_1(\mathbf {w}^k) \right] _{\ell _1}\right\| + \left\| \left[ \nabla f_1(\mathbf {w}^k) \right] _{\ell _2} \right\| \right. \\&\left. \quad + \left\| \left[ \nabla f_1(\mathbf {w}^k) \right] _{\ell _1}\right\| \right) \\&\le \alpha ^k \left( M+2 \right) \left( a + b\left\| \nabla f(\mathbf {w}^k)\right\| \right) \\&\le \alpha ^k \left( a + b\left\| \nabla f(\mathbf {w}^k)\right\| \right) \end{aligned}$$

where without loss of generality we have redefined \(a(M+2)\) and \(b(M+2)\) as a and b.

$$\begin{aligned} \left\| \mathbf {y}^k_{2,1}-\mathbf {w}^k \right\|&\le \left\| \mathbf {y}^k_{2,1} -\mathbf {y}^k_{2,0} \right\| +\left\| \mathbf {y}^k_{2,0} -\mathbf {w}^k \right\| \\&= \alpha ^k\left\| \left[ \nabla f_2(\mathbf {y}^k_{2,0}) \right] _{\ell _1} \right\| +\left\| \mathbf {y}^k_{2,0} -\mathbf {w}^k \right\| \\&\le \alpha ^k\left\| \left[ \nabla f_2(\mathbf {y}^k_{2,0})-\nabla f_2(\mathbf {w}^k) \right] _{\ell _1} \right\| \\&\quad + \alpha ^k\left\| \left[ \nabla f_2(\mathbf {w}^k) \right] _{\ell _1} \right\| +\left\| \mathbf {y}^k_{2,0} -\mathbf {w}^k \right\| \\&\le \alpha ^k M \left\| \mathbf {y}^k_{2,0}-\mathbf {w}^k\right\| + \alpha ^k\left\| \left[ \nabla f_2(\mathbf {w}^k) \right] _{\ell _1} \right\| \\&\quad +\left\| \mathbf {y}^k_{2,0} -\mathbf {w}^k \right\| \\&\le \left( \alpha ^k M +1\right) \left\| \mathbf {y}^k_{2,0}-\mathbf {w}^k\right\| \\&\quad + \alpha ^k\left\| \left[ \nabla f_2(\mathbf {w}^k) \right] _{\ell _1} \right\| \\&\le \left( \alpha ^k M +1\right) \alpha ^k \left( a+ b\left\| \nabla f(\mathbf {w}^k)\right\| \right) \\&\quad + \alpha ^k\left( a+ b \left\| \nabla f(\mathbf {w}^k)\right\| \right) \\&\le \alpha ^k\left( \widehat{a} +\widehat{b} \left\| \nabla f(\mathbf {w}^k)\right\| \right) . \end{aligned}$$

This implies there exist positive constants AB such that

$$\begin{aligned} \Vert \mathbf {e}^k \Vert \le \alpha ^k\left( A+B \left\| \nabla f(\mathbf {w}^k)\right\| \right) . \end{aligned}$$

Then all the hypothesis of Proposition 1 hold and the thesis follows. \(\square \)

The assumptions done in Proposition 4 are the same of those done when proving convergence of IG in Proposition 3. Overall, the Lipschitz condition (10) is quite natural when studying convergence analysis of finite-sum problems and directly implies that the whole objective function has a Lipschitz gradient. On the other hand, condition (11) is a stronger and less usual assumption, requiring the gradient of each term to be linearly bounded by the real gradient. As observed in Bertsekas and Tsitsiklis (2000), this assumption is guaranteed to hold when the functions \(f_h\) are quadratic convex as in the case of linear least squares.

3.2 Convergence of BIG with bounded away from zero stepsize

So far we have shown that a block incremental gradient method converges to a stationary point as long as a diminishing stepsize is employed. However, the diminishing stepsize rule could be cumbersome to implement leading to slow convergence in case it is not properly tuned. As a consequence, a more practical updating rule commonly used when dealing with incremental gradient methods is to keep the stepsize fixed for a certain number of iterations and then reduce it by a small factor. This updating rule is straightforward to be implemented and can be controlled more easily than the diminishing one.

We have already seen that BIG can be written as a gradient method with error, i.e.

$$\begin{aligned} \mathbf {w}^{k+1}= \mathbf {\mathbf {w}^k}-\alpha ^k(\nabla f(\mathbf {\mathbf {w}^k})-\mathbf {e}^k). \end{aligned}$$

Hence, in order to apply the results of Proposition 2 to the BIG method, we need to show that under some standard assumptions the error term in BIG satisfies condition (5). This is the aim of the following proposition.

Proposition 5

Let \(\{\mathbf {w}^k\}\) be a sequence generated by (12), (13) and (14). Assume that for each \(h\in \{1,\ldots , H\}\) condition (10) is satisfied, namely

$$\begin{aligned} \Vert \nabla f_h(\mathbf {u}) -\nabla f_h(\mathbf {w})\Vert \le M\Vert \mathbf {u}-\mathbf {w}\Vert \qquad \forall \ \mathbf {u},\mathbf {w} \in {\mathbf {R}}^n \end{aligned}$$

and there exist a positive constant \(\bar{B}\) such that it holds

$$\begin{aligned} \Vert \nabla f_h(\mathbf {w}^k)\Vert \le \bar{B} \quad \forall \; h\in \{1,\ldots ,H\}. \end{aligned}$$
(15)

Further assume that the stepsize \(\alpha ^k\) satisfies

$$\begin{aligned} \lim _{k\rightarrow \infty }\alpha ^k=\bar{\alpha }>0. \end{aligned}$$

Then the error term \(\mathbf {e}^k\) satisfies condition (5).

Proof

Similarly to what done in Proposition 4 (cfr the extended proof in “Appendix B”) the general BIG iteration can be written as

$$\begin{aligned} \mathbf {w}^{k+1} = \mathbf {w}^k-\alpha ^k(\nabla f(\mathbf {w}^k)-\mathbf {e}^k) \end{aligned}$$

where

$$\begin{aligned} \mathbf {e}^k = \sum _{h=1}^H\sum _{j=1}^L\left( [\nabla f_h(\mathbf {y}^k_{h,j-1})]_{\ell _j} - [\nabla f_h(\mathbf {w}^k)]_{\ell _j}\right) . \end{aligned}$$

As done in “Appendix B” to prove (19), thanks to the sample Lipschitz condition (10), we obtain the bound

$$\begin{aligned} \left\| \mathbf {e}^k\right\| \le M \sum _{h=1}^H\sum _{j=0}^{L-1} \left\| \mathbf {y}^k_{h,j} - \mathbf {w}^k \right\| . \end{aligned}$$

Now reasoning in a similar way to what done in Eq. (21) and thanks to the hypothesis (15), we can get the following bound on two iterates

$$\begin{aligned} \left\| \mathbf {y}^k_{h,j} - \mathbf {w}^k \right\| \le (\alpha ^k M +1) \left\| \mathbf {y}^k_{p(h,j-1)}-\mathbf {w}^k \right\| +\alpha ^k \bar{B} \end{aligned}$$

where \(p(h,j-1)\) is used to denote the estimate before considering the term \(\mathbf {y}^k_{h,j}\), as described in (20). Thus, by iteratively applying this bound, we obtain

$$\begin{aligned} \Vert \mathbf {e}^k\Vert \le \alpha ^k \bar{B} \end{aligned}$$

for some positive constant \(\bar{B}\). \(\square \)

Then the following convergence result for BIG with a bounded away from zero stepsize directly applies by considering the results from Propositions 2 and 5.

Proposition 6

(Convergence of BIG - Bounded away from zero stepsize) Let \(\{\mathbf {w}^k\}\) be a sequence generated by (12), (13) and (14). Assume that all the iterates \(\mathbf {w}^k\) and \(\mathbf {y}_{h,j}^k\) belong to some bounded set D.

Assume that for each \(h\in \{1,\ldots , H\}\) conditions (10) and (15) hold and that the stepsize \(\alpha ^k\) satisfies

$$\begin{aligned} \lim _{k\rightarrow \infty }\alpha ^k=\bar{\alpha }>0, \end{aligned}$$

where \(\alpha ^k \in (\theta , 2 / L-\theta )\) with \(\theta \in (0,1 / L]\). Then there exist a constant \(C>0\) (independent of \(\bar{\alpha }\) ) and an accumulation point \(\bar{\mathbf {w}}\) of the sequence \(\{\mathbf {w}^k\}\) such that

$$\begin{aligned} \Vert \nabla f(\mathbf {\bar{\mathbf {w}}})\Vert \le C \bar{\alpha } \end{aligned}$$
(16)

Furthermore, if the sequence \(\{f(\mathbf {w}^k)\}\) converges, then every accumulation point \(\bar{\mathbf {w}}\) of the sequence \(\{\mathbf {w}^k\}\) satisfies (16).

Thus Proposition 6 implies that BIG with a bounded away from zero stepsize can only achieve a neighbourhood of a stationary point. Overall, it was a predictable result. Indeed, since the error term satisfies (5), if we consider the scalar product \(\nabla f(\mathbf {w}^k)^T\mathbf {d}^k\) and assume that \(\alpha ^k=\bar{\alpha }>0\), it yields

$$\begin{aligned} \nabla f(\mathbf {w}^k)^T\mathbf {d}^k&= \nabla f(\mathbf {w}^k)^T\left( -\nabla f(\mathbf {w}^k)+\mathbf {e}^k\right) \\&\le -\Vert \nabla f(\mathbf {w}^k)\Vert ^2 +\Vert \nabla f(\mathbf {w}^k)\Vert \Vert \mathbf {e}^k\Vert \\&\le -\Vert \nabla f(\mathbf {w}^k)\Vert ^2 +\Vert \nabla f(\mathbf {w}^k)\Vert \bar{\alpha } \bar{B}. \end{aligned}$$

This shows how within the region

$$\begin{aligned} \left\{ \mathbf {w}\in {\mathbf {R}}^n\;:\;\Vert \nabla f(\mathbf {w})\Vert >\bar{\alpha } \bar{B}\right\} \end{aligned}$$

BIG computes directions which are actually descent directions, while in the complementary region the behaviour is unpredictable. Moreover the size of this region linearly depends on the constant stepsize \(\bar{\alpha }\) employed.

It is interesting to note that, by fixing \(\theta =0\) we get the stepsize \(\alpha \in (0,\frac{2}{\bar{L}}]\), which is the same stepsize needed to prove convergence towards exact stationary points for the standard gradient descent method. That is, if BIG has a cost per iteration much cheaper than the batch gradient descent, up to H times, the price to pay is that it does not converge towards stationary points, but lends in a \(\epsilon \)-accurate solution.

As underlined in Solodov (1998), the assumptions on the norm of the error term in Proposition 2 (namely condition (5)) could be relaxed so to consider the more general case \(\Vert \mathbf {e}^k\Vert \le \alpha ^k(a+b\nabla f(\mathbf {w}^k))\) for some positive constants ab. However, this would lead to a third-degree inequality to determine the allowed interval for the stepsize \(\alpha ^k\) which is not trivial to solve.

Moreover, note that the boundedness assumption on the iterates is not very restrictive. Indeed, it is satisfied as long as the level set \(\{\mathbf {w}\;|\; f(\mathbf {w})\le \rho _1\}\) is bounded for some \(\rho _1>f(\mathbf {w}^0)\) and the iterates stay within that region, as is usually the case in the optimization problem behind training a Deep Neural Network (Solodov 1998; Zhi-Quan and Paul 1994). Note that also the Lipschitz and boundedness conditions on the gradient of the objective function are satisfied whenever each term \(f_h\) is twice continuously differentiable and the iterates stay within a compact set.

Finally, we remark that as a further example of an optimization problem where conditions (10) and (15) are satisfied (and consequently (11) holds as well), we can consider the LogitBoost algorithm (Collins et al. 2002). Indeed, given a classification problem, in the nonlinearly separable case and when the features are linearly independent on the training set, then the objective function has a sample Lipschitz gradient and each gradient can be bounded above [see Remark 3 on Blatt et al. (2007) for a deeper discussion on the properties of the LogitBoost algorithm].

4 Discussion on numerical performance

As a block-coordinate descent method BIG can lead to improvements in performance in all those cases where the structure of the objective function can be leveraged to define problems easier to solve (e.g. subproblems might be less computationally expensive to treat or might become separable). On the other hand, as an incremental method, BIG owns good properties when dealing with large-scale finite-sum problems, namely in all those cases where the function can be expressed as a large sum of similar terms so that each gradient and objective function computations might require an excessive computational effort. Thus, BIG might be employed in all those cases where the objective function has both some block structure that can be exploited and a finite-sum structure.

With the aim to provide the reader with an application of BIG to a real problem, we can consider an estimation problem where given some data \(\{\mathbf {x}_h,y_h\}_{h=1}^H\), where \(\mathbf {x}_h\in {\mathbf {R}}^d\) represents the input features and \(y_h\in {\mathbf {R}}\) is the output we want to estimate, we want to determine the relation between the input and the output by means of a nonlinear least-square function of the form

$$\begin{aligned} \underset{\mathbf {w},\mathbf {v}}{\text { minimize }}\quad \sum _{h=1}^H\left( \phi (\mathbf {w};\mathbf {x}_h)^T\mathbf {v}-y_h\right) ^2 +\rho \Vert \mathbf {w}\Vert ^2+\rho \Vert \mathbf {v}\Vert ^2 \end{aligned}$$
(17)

where \(\phi (\mathbf {w};\mathbf {x}_h)\) represents a nonlinear transformation of the input \(\mathbf {x_h}\). This formulation is quite general and includes several applications such as some kernel methods (Shawe-Taylor and Cristianini 2004) and neural networks (Goodfellow et al. 2016). Problem (17) presents a finite-sum with two-block structure that perfectly fit with the advantages led by BIG. Indeed, when considering the general term \(f_h\) only with respect to the block \(\mathbf {v}\) it is strictly convex while when considering the block \(\mathbf {w}\) the problem is still nonlinear but has a smaller size and other interesting properties might come out according to the type of nonlinear transformation \(\phi \).

As a particular instance of problem (17), in Palagi and Seccia (2020) extensive numerical results have been reported when dealing with the mean squared error optimization problem behind the training phase of a deep neural network. In this class of problems, indeed, a natural block decomposition with respect to the weights of each layer appears. Then the performance of BIG has been analysed when the layered structure of the model is exploited. In particular, the standard IG method is compared to the application of BIG when each layer of the model defines a block of variables \(\mathbf {w_\ell }\) and several numerical results are discussed. We do not report numerical results here which can be found in Palagi and Seccia (2020). However, we remark how numerical results in Palagi and Seccia (2020) suggest that BIG outperform IG, especially when considering deeper and wider models, namely neural networks with a larger number of layers or neurons per layer. Moreover, from a machine learning perspective, it is interesting to underline how BIG seems to lead to better performance compared to IG also on the generalization error, namely the error on new samples never seen before by the estimation model.

5 Conclusion

In this paper, we have extended the convergence theory of incremental methods by providing convergence results of a block-coordinate incremental gradient method under two different stepsize updating rules. The analysis has shown how the BIG algorithm can be seen as a gradient method with errors; thus, its convergence can be proved by recalling known convergence results (Bertsekas and Tsitsiklis 2000; Solodov 1998).