1 Introduction

Stochastic gradient descent (SGD) is a general iterative algorithm for solving large-scale optimisation problems such as minimising a differentiable objective function \(f({\varvec{v}})\) parameterised in \({\varvec{v}} \in {\mathcal {V}}\):

$$\begin{aligned} \min _{{\varvec{v}}} f({\varvec{v}}) \end{aligned}$$
(1)

In several statistical models and machine learning problems, \(f({\varvec{v}})=\frac{1}{n}\sum _{i=1} ^{n}f_i({\varvec{v}})\) is the empirical average loss where each \(f_i({\varvec{v}})\) indicates that the function has been evaluated at a data instance \(\varvec{x}_i\). SGD updates \({\varvec{v}}\) using the gradients computed on some (or single) data points. In this work, we are interested in problems involving non-convex objective functions such as variational inference (Blei et al., 2017; Hoffman et al., 2013; Wainwright et al., 2008; Jordan et al., 1999) and artificial neural networks (Chilimbi et al., 2014; Dean et al., 2012; Li et al., 2014a; Xing et al., 2015; Zhang et al., 2015) amongst many others. Non-convex problems abound in ML and are often characterised by a very large number of parameters (e.g., deep neural nets), which hinders their optimisation. This challenge is often compounded by the sheer size of the training datasets, which can be in order of millions of data points. As the size of the available data increases, it becomes more essential to boost SGD scalability by distributing and parallelising its sequential computation. The need for scalable optimisation algorithms is shared across application domains.

Many studies have proposed to scale-up SGD by distributing the computation over different computing units taking advantage of the advances in hardware technology. The main existing paradigms exploit either shared memory or distributed memory architectures. While shared memory is usually used to run an algorithm in parallel on a single multi-core machine (Recht et al., 2011; Zhao et al., 2017; Lian et al., 2015; Huo & Huang, 2016), distributed memory, on the other hand, is used to distribute the algorithm on multiple machines (Agarwal & Duchi, 2011; Lian et al., 2015; Zinkevich et al., 2010; Langford et al., 2009). Distributed SGD (DSGD) is appropriate for very large-scale problems where data can be distributed over massive (theoretically unlimited) number of machines each with its own computational resources and I/O bandwidth. However DSGD efficiency is bounded by the communication latency across machines. Parallel SGD (PSGD) takes advantage of the multiple and fast processing units within a single machine with higher bandwidth communication; however, the computational resources and I/O bandwidth are limited.

In this paper, we set out to explore the potential gains that can be achieved by leveraging the advantages of both the distributed and parallel paradigms in a unified approach. The proposed algorithm, DPSGD, curbs the communication cost by updating a local copy of the parameter vector being optimised, \({\varvec{v}}\), multiple times during which each machine performs parallel computation. The distributed computation of DPSGD among multiple machines is carried out in an asynchronous fashion whereby workers compute their local updates independently (Lian et al., 2015). A master aggregates these updates to amend the global parameters. The parallel implementation of DPSGD on each local machine is lock-free whereby multiple cores are allowed equal access to the shared memory to read and update the variables without locking (Zhao et al., 2017) (i.e., they can read and write the shared memory simultaneously). We provide a theoretical analysis of the convergence rate of DPSGD for non-convex optimisation problems and prove that linear speed-up with respect to the number of cores and workers is achievable while they are bounded by \(T^{1/4}\) and \(T^{1/2}\), respectively, where T is the total number of iterations. Furthermore, we empirically validate these results by developing two inferential algorithms relying on DPSGD. The first one is an asynchronous lock-free stochastic variational inference algorithm (DPSVI) that can be deployed on a wide family of Bayesian models (see the Appendix); its potential is demonstrated here on a Latent Dirichlet Allocation problem. The second one is DPSGD-based Deep Reinforcement Learning (DRL) that can be used to scale up the training of DRL networks for multiple tasks (see the Appendix).

The rest of the paper is organised as follows. Section 2 presents the related work. Section 3 presents the proposed algorithm and the theoretical study. In Sect. 4, we carry out experiments and discuss the empirical results. Finally, Sect. 5 draws some conclusions and suggests future work. The appendices present the proofs, the asynchronous distributed lock-free parallel SVI and the highly-scalable actor critic algorithm.

2 Related work

We divide this section into two parts. In the first part, we discuss the relevant literature to distributed and parallel SGD focusing on the theoretical aspects (Lian et al., 2015; Zhao et al., 2017; Fang & Lin, 2017; Huo & Huang, 2017; Bottou, 2010; Niu et al., 2011; Tsitsiklis et al., 1986; Elgabli et al., 2020; Wang et al., 2019; Recht et al., 2011; Leblond et al., 2017; Dean et al., 2012; Zhou & Cong, 2017; Yu et al., 2019; Lin et al., 2018; Stich, 2018). To keep the paper centred and due to space limitation, only SGD-based methods for non-convex problems are covered. The second part covers related work focusing on the implementation/application of the distributed and parallel algorithms (Hoffman et al., 2013; Mohamad et al., 2018; Neiswanger et al., 2015; Dean et al., 2012; Paine et al., 2013; Ruder, 2016; Li et al., 2014a; Abadi et al., 2016; Paszke et al., 2017; Babaeizadeh et al., 2016; Mnih et al., 2016; Clemente et al., 2017; Horgan et al., 2018; Espeholt et al., 2018; Nair et al., 2015; Adamski et al., 2018).

2.1 Theoretical aspects

A handful of SGD-based methods have been proposed recently for large-scale non-convex optimisation problems (De Sa et al., 2015; Lian et al., 2015; Zhao et al., 2017; Fang & Lin, 2017; Huo & Huang, 2017), which embrace either a distributed or parallel paradigm. HOGWILD presented by Niu et al. (2011) proposes several asynchronous parallel SGD variants with locks and lock-free shared memory. Theoretical convergence analysis for convex objectives presented in that study was inspiring and adopted for most of the recent literature on asynchronous parallel optimisation algorithms. Similarly Leblond et al. (2017), De Sa et al. (2015) provided convergence analysis for SGD asynchronous lock-free parallel optimisation with shared memory for convex objectives where they provide convergence analysis with relaxed assumptions on the sparsity of the problem. De Sa et al. (2015) also analysed the HOGWILD convergence for non-convex objectives. Asynchronous distributed and lock-free parallel SGD algorithms for non-convex objectives have also been studied in Lian et al. (2015) showing that linear speed-up with respect to the number of workers is achievable when bounded by \(O(\sqrt{T})\). Improved versions using variance reduction techniques have recently been proposed in Huo and Huang (2017), Fang and Lin (2017) to accelerate the convergence rate with a linear rate being achieved instead of the sub-linear one of SGD. Although these and other algorithmic implementations are lock-free (Lian et al., 2015; Huo & Huang, 2017; Fang & Lin, 2017; De Sa et al., 2015), the theoretical analysis of the convergence was based on the assumption that no over-writing happens. Hence, write-lock or atomic operation for the memory are needed to prove the convergence. In contrast, Zhao et al. (2017) proposed a completely parallel lock-free implementation and analysis.

Different implementations exploiting both parallelism with shared memory and distributed computation across multiple machines have been proposed (Dean et al., 2012; Zhou & Cong, 2017; Yu et al., 2019; Lin et al., 2018; Stich, 2018). Except for  Stich (2018), Dean et al. (2012), these methods adopted synchronous SGD implementation where Dean et al. (2012), Lin et al. (2018) focused on the implementation aspects, providing extensive empirical study on deep learning models. While the implementation ideas are very similar to ours, we consider lock-free local parallelism with asynchronous distribution and we provide theoretical analysis. We also evaluate our approach on different ML problems i.e., SVI and DRL (to be discussed in the next part). Instead of using a parameter server, the local learners in Zhou and Cong (2017) compute the average of their copies of parameters at regular intervals through global reduction. Communication overhead is controlled by introducing a communication interval parameter into the algorithm. However, the provided theoretical analysis in Zhou and Cong (2017) does not establish a speedup and synchronisation is required for global reduction. Authors in Yu et al. (2019) provide theoretical study for the model averaging introduced in Zhou and Cong (2017) showing that linear speedup of local SGD on non-convex objectives can be achieved as long the averaging interval is carefully controlled. Similar study for convex problems with asynchronous worker communication by Stich (2018) shows linear speedup in the number of workers and the mini-batch size with reduced communication. The multiple steps local SGD update by Stich (2018) aimed at reducing the communication overhead is similar to our proposed algorithm. Nonetheless, we adopt local lock-free parallelism and asynchronous distribution with parameter server scheme instead of model averaging. Finally, we point out that although we focus on SGD first-order method (Bottou, 2010; Niu et al., 2011; Tsitsiklis et al., 1986; Elgabli et al., 2020; Wang et al., 2019), our study can be extended to second-order methods (Shamir et al., 2014; Jahani et al., 2020a; Ba et al., 2016; Crane & Roosta, 2019; Jahani et al., 2020b) and variance reduction methods (Huo & Huang, 2017; Fang & Lin, 2017), where the high noise of our local multiple-steps update can be reduced contributing to further speed-up. We leave this for future work.

2.2 Implementation aspects

The first effort to scale-up variational inference is described in  Hoffman et al. (2013) where gradient descent updates are replaced with SGD updates. Inspired by this work, Mohamad et al. (2018) replaces the SGD updates with asynchronous distributed SGD ones. This was achieved by computing the SVI stochastic gradient on each worker based on few (mini-batched or single) data points acquired from distributed sources. The update steps are then aggregated to form the global update. In Neiswanger et al. (2015), the strategy consists of distributing the entire dataset across workers and letting each one of them perform VI updates in parallel. This requires that, at each iteration, the workers must be synchronised to combine their parameters. However, this synchronisation requirement limits the scalability so the maximum speed achievable is bounded by the slowest worker. Approaches for scaling up VI that rely on Bayesian filtering techniques have been reviewed in Mohamad et al. (2018).

Asynchronous SGD (ASYSG) (Lian et al., 2015), an implementation of SGD that distributes the training over multiple workers, has been adopted by DistBelief (Dean et al., 2012) (a parameter server-based algorithm for training neural networks) and Project Adam (Chilimbi et al., 2014) (another DL framework for training neural networks). Paine et al. (2013) showed that ASYSG can achieve noticeable speedups on small GPU clusters. Other similar work (Ruder, 2016; Li et al., 2014a) have also employed ASYSG to scale up deep neural networks. The two most popular and recent DL frameworks TensorFlow (Abadi et al., 2016) and Pytorch (Paszke et al., 2017) have embraced the Hogwild (Recht et al., 2011; Zhao et al., 2017) and ASYSG (Lian et al., 2015) implementations to scale-up DL problems.Footnote 1

Distributed and parallel SGD have also been employed in deep reinforcement learning (DRL) (Babaeizadeh et al., 2016; Mnih et al., 2016; Clemente et al., 2017; Horgan et al., 2018; Espeholt et al., 2018; Nair et al., 2015; Adamski et al., 2018). In Babaeizadeh et al. (2016), a hybrid CPU/GPU version of the Asynchronous Advantage ActorCritic (A3C) algorithm (Mnih et al., 2016) was introduced. The study focused on mitigating the severe under-utilisation of the GPU computational resources in DRL caused by its sequential nature of data generation. Unlike (Mnih et al., 2016), the agents in RL do not compute the gradients themselves. Instead, they send data to central learners that update the network on the GPU accordingly. However, as the number of core increases, the central GPU learner becomes unable to cope with the data. Furthermore, large amount of data requires large storage capacity. Also the internal communication overhead can affect the speed-up when the bandwidth reaches its ceiling. We note that a similar way for paralleling DRL is proposed by Clemente et al. (2017). Similarly, Horgan et al. (2018) propose to generate data in parallel using multi-cores CPU’s where experiences are accumulated in a shared experience replay memory. Along the same trend, Espeholt et al. (2018) proposed to accumulate data by distributed actors and communicate it to the centralised learner where the computation is done. The architecture of these studies (Horgan et al., 2018; Espeholt et al., 2018) allows the distribution of the generation and selection of data instead of distributing locally computed gradients as in Nair et al. (2015). Hence, it requires sending large size information over the network in case of large size batch of data making the communication more demanding. Furthermore, the central learner has to perform most of the computation limiting the scalability.

The work in  Adamski et al. (2018) is the most similar to ours, where SGD based hybrid distributed-parallel actor critic is studied. The parallel algorithm of Mnih et al. (2016) is combined with parameter server architecture of Nair et al. (2015) to allow parallel distributed implementation of A3C on a computer cluster with multi-core nodes. Each node applies the algorithm in Babaeizadeh et al. (2016) to queue data in batches, which are used to compute local gradients. These gradients are then gathered from all workers, averaged and applied to update the global parameters. To reduce the communication overhead, authors carried out careful reexamination of Adam optimiser’s hyper-parameters allowing large-size batches to be used. Detailed discussion of these methods and comparison to our implementation is provided in the appendix.

3 The DPSGD algorithm and its properties

Before delving into the details of the proposed algorithm, we introduce the list of symbols in Table 1 that are used in the rest of the text.

Table 1 List of symbols

3.1 Overview of the algorithm

The proposed DPSGD algorithms assumes a star-shaped computer network architecture: a master maintains the global parameter \({\varvec{v}}\) (Algorithm 1) and the other machines act as workers which independently and simultaneously compute local parameters \({\varvec{u}}\) (Algorithm 2). The workers communicate only with the master in order to access the state of the global parameter (line 3 in Algorithm 2) and provide the master with their local updates (computed based on local parameters) (line 10 in Algorithm  2). Each worker is assumed to be a multi-core machine, and the local parameter are obtained by running a lock-free parallel SGD (see Algorithm  2). This is achieved by allowing all cores equal access to the shared memory to read and update at any time with no restriction at all (Zhao et al., 2017). The master aggregates M predefined amounts of local updates coming from the workers (line 3 in Algorithm 1), and then computes its global parameter. The update step is performed as an atomic operation such that the workers are locked out and cannot read the global parameter during this step (see Algorithm 1).

figure a
figure b

Note that the local distributed computations are done in an asynchronous style, i.e., DPSGD does not lock the workers until the master starts updating the global parameter. That is, the workers might compute some of the stochastic gradients based on early values of the global parameter. Similarly, the lock-free parallel implies that local parameter can be updated by other cores in the time after being read and before being used for the update computation. Given this non-synchronisation among workers and among cores, the results of parameter update seem to be totally disordered, which makes the convergence analysis very difficult.

Following Zhao and Li (2016), we introduce a synthetic process to generate the final value of local and global parameters after all threads, workers have completed their updates as shown in Algorithm 1 and 2. That is, we generate a sequence of synthetic values of \({\varvec{v}}\) and \({\varvec{u}}\) with some order to get the final value of \({\varvec{v}}\). These synthetic values are used for DPSGD convergence proof. The synthetic generation process is explained in the following section.

3.2 Synthetic process

Let t be the global unique iterate attached to the loop in Algorithm 1; b is the local unique iterate attached to the inner loop in Algorithm 2 and m is an index referring to the update vector computed by a worker \(n_m\in \{1,.., nW\}\). If we omit the outer loop of Algorithm 2, the key steps in Algorithm 2 are those related to the writing (updating) or reading the local parameter.

3.2.1 Local synthetic write (update) sequence

As in Zhao and Li (2016), we assume all threads will update the elements in \({{\textbf {u}}}\) in the order from 1 to \({{\tilde{B}}}\), where \({{\tilde{B}}}= B*p\) with p is the number of threads. Thus, \(\{u_1,\ldots u_{{{{\tilde{B}}}}-1}\}\) is the synthetics sequence which may never occur in the shared memory. However, it is employed to obtain the final value \(u_{{{\tilde{B}}}}\) after all threads have completed their updates in the inner-loop of Algorithm 2. In other terms, this ordered synthetic update sequence generating the same final value as that of the disordered lock-free update process. At iterate b, the synthetic update done by a thread can be written as follows:

$$\begin{aligned} {\varvec{u}}_{b}={\varvec{u}}_{0}-\sum _{j=0}^{b-1}\eta S_j \nabla f_{i_{j}}(\varvec{{{\hat{u}}}}_{j}) \end{aligned}$$
(2)

where \(S_j\) is a diagonal matrix whose entries are 0 or 1, determining which dimensions of the parameter vector \(u_b\) have been successfully updated by the \(j_{th}\) gradient computed on the shared local parameter \({\hat{u}}_j\). That is \(S_j(k,k)=0\) if dimension k is over-written by another thread and \(S_j(k,k)=1\) if dimension k is successfully updated by \(\nabla f_{i_{j}}(\varvec{{{\hat{u}}}}_{j})\) without over-writing. Equation 2 can be rearranged in an iterative form as:

$$\begin{aligned} {\varvec{u}}_{b+1}={\varvec{u}}_{b}-\eta S_{b} \nabla f_{i_{b}}(\varvec{{{\hat{u}}}}_{b}) \end{aligned}$$
(3)

Including the outer loop and the global update in Algorithm 1, we define the synthetic sequence \(\{{\varvec{u}}_{t,m,b}\}\) equivalent to the updates for the \(b^{th}\) per-worker loop of the \(m^{th}\) update vector associated with the \(t^{th}\) master loop:

$$\begin{aligned}&\text {Algorithm~2, line 3 refers to: } {\varvec{u}}_{t,m,0}={\varvec{v}}_{t-1}\nonumber \\&\text {Algorithm~2, line 8 refers to: } {\varvec{u}}_{t,m,b+1}={\varvec{u}}_{t,m,b}-\eta S_{t+\tau _{t,m},m,b}\nabla f_{i_{t+\tau _{t,m},m,b}}(\varvec{{{\hat{u}}}}_{t,m,b})\nonumber \\&\text {Algorithm~1, line 4 refers to: } \varvec{v}_{t}={\varvec{v}}_{t-1}+\rho _{t-1}(\sum _{m=1}^{M}\varvec{u}_{t-\tau _{t,m},m,{{\tilde{B}}}}-{\varvec{v}}_{t-1-\tau _{t,m}}) \end{aligned}$$
(4)

where \(\tau _{t,m}\) is the delay of the \(m^{th}\) global update for the \(t^{th}\) iteration caused by the asynchronous distribution. To compute \(\nabla f_{i_{t+\tau _{t,m},m,b}}(\varvec{{{\hat{u}}}}_{t,m,b})\), \(\varvec{{{\hat{u}}}}_{t,m,b}\) is read from the shared memory by a thread.

3.2.2 Local memory read

As denoted earlier, \({\hat{u}}_b\) is the local parameter read from the shared memory which is used to compute \(\nabla f_{i_{b}}(\varvec{{{\hat{u}}}}_{b})\) by a thread. Using the synthetic sequence \(\{u_1,..u_{{{\tilde{B}}}-1}\}\), \({\hat{u}}_b\) can be written as:

$$\begin{aligned} {\hat{u}}_b =u_{a( b)}+\sum _{j=a(b)}^{b-1}P_{b,j-a(b)}\nabla f_{i_j}({\hat{u}}_j) \end{aligned}$$
(5)

where \(a(b) <b\) is the step in the inner-loop whose updates have been completely written in the shared memory. \(P_{b,j-a(b)}\) are diagonal matrices whose diagonal entries are 0 or 1. \(\sum _{j=a(b)}^{b-1}P_{b,j-a(b)}\nabla f_{i_j}({\hat{u}}_j)\) determine what dimensions of the new gradient updates, \(\nabla f_{i_j}({\hat{u}}_j)\), from time a(b) to \(b-1\) have been added to \(u_{a(b)}\) to obtain \({\hat{u}}_b\). That is, \({\hat{u}}_b\) may read some dimensions of new gradients between time a(b) to \(b-1\) including those which might have been over-written by some other threads. Including the outer loop and the global update in Algorithm 1, the local read becomes:

$$\begin{aligned} \varvec{{{\hat{u}}}}_{t,m,b}=\varvec{ u}_{t,m,a(b)}-\eta \sum _{j=a(b)}^{b-1}P^{t+\tau _{t,m},m}_{b,j-a(b)}\nabla f_{i_{t+\tau _{t,m},m,j}}(\varvec{{{\hat{u}}}}_{t,m,j}) \end{aligned}$$
(6)

The partial updates of the remaining steps between a(b) and \(b-1\) are now defined by \(\{P^{t,m}_{b,j-a(b)}\}_{a(b)}^{b-1}\).

3.3 Convergence analysis

Using the synthetic sequence, we develop the theoretical results of DPSGD showing that under some assumptions, we can guarantee linear speed-up with respect to the number of cores (threads) and number of nodes (workers). Before presenting the studies, we introduce and explain the require assumptions:

Assumption 1

The function f(.) is smooth, that is to say, the gradient of f(.) is Lipschitzian: there exists a constant \(L>0\), \(\forall {\varvec{x}}, {\varvec{y}}\),

$$\begin{aligned} ||\nabla f({\varvec{x}})-\nabla f({\varvec{y}})||\le L||{\varvec{x}}-{\varvec{y}}|| \end{aligned}$$

or equivalently,

$$\begin{aligned} f({\varvec{y}})\le f({\varvec{x}})+ \nabla f(\varvec{x})^T({\varvec{y}}-{\varvec{x}})+ \frac{L}{2}||\varvec{y}-{\varvec{x}}||^2. \end{aligned}$$

Assumption 2

The per-dimension over-writing defined by \(S_{t,m,b}\) is a random variate, independent of \(i_{t,m,j}\).Footnote 2

This assumption is reasonable since \(S_{t,m,b}\) is affected by the hardware, while \(i_{t,m,j}\) is independent thereof.

Assumption 3

The conditional expectation of the random matrix \(S_{t,m,b}\) on \({\varvec{u}}_{t,m,b}\) and \(\varvec{{{\hat{u}}}}_{t,m,b}\) is a strictly positive definite matrix, i.e., \({\mathbb {E}}[S_{t,m,b}| {\varvec{u}}_{t,m,b}, \varvec{{{\hat{u}}}}_{t,m,b}] =S\succ 0\) with the minimum eigenvalue \(\alpha >0\).

Assumption 4

The gradients are unbiased and bounded: \(\nabla f(\varvec{x})={\mathbb {E}}_i[\nabla f_i({\varvec{x}})]\) and \(||\nabla f_i({\varvec{x}})||\le V\), \(\forall i \in \{1,...n\}\).

Then, it follows that the variance of the stochastic gradient is bounded. \({\mathbb {E}}_i[||\nabla f_i({\varvec{x}})-\nabla f({\varvec{x}})||^2]\le \sigma ^2\), \(\forall {\varvec{x}}\), where \(\sigma ^2=V^2-||\nabla f(x)||\)

Assumption 5

Delays between old local stochastic gradients and the new ones in the shared memory are bounded: \(0\le b-a(b)\le D\) and the delays between stale distributed update vectors and the current ones are bounded \(0\le \max _{t,m}\tau _{t,m}\le D'\)

Assumption 6

All random variables in \(\{i_{t,m,j}\}_{\forall t, \forall m, \forall j}\) are independent of each other.

Note that we are aware that this independence assumption is not fully accurate due to the potential dependency between selected data samples for computing gradients at the same shared parameters. For example, samples with fast computation of gradients for the same shared variable leads to more frequent selection of these samples as they likely to finish their gradient computation while the shared memory has not been overwritten. Hence, the selected samples can be correlated. This can also affect the independence assumption between the overwriting matrix and the selected sample (Assumption 2). However, we follow existing studies (Zhao & Li, 2016; Zhao et al., 2017; Lian et al., 2015; Reddi et al., 2015; Duchi et al., 2015; De Sa et al., 2015; Lian et al., 2018; Hsieh et al., 2015), assuming DPSGD maintains the required conditions for independence via Assumptions 2 and 6.

We are now ready to state the following convergence rate for any non-convex objective:

Theorem 1

If Assumptions 1 to 6 hold and the following inequalities are true:

$$\begin{aligned}&M^2{{\tilde{B}}}^2\eta ^2L^2\rho _{t-1}D'\sum _{n=1}^{D'}\rho _{t+n}\le 1 \end{aligned}$$
(7)
$$\begin{aligned}&\frac{1}{1-\eta -\frac{9\eta (D+1)L^2(\mu ^{D+1}-1)}{\mu -1}}\le \mu \end{aligned}$$
(8)

then, we can obtain the following results:

$$\begin{aligned}&\frac{1}{\sum _{t=1}^{T}\rho _{t-1}}\sum _{t=1}^{T}\rho _{t-1} \mathop {{\mathbb {E}}}[||\nabla f({\varvec{v}}_{t-1})||^2]\le \frac{2(f({\varvec{v}}_{0})-f({\varvec{v}}_{*}))}{ M{{\tilde{B}}}\eta \alpha \sum _{t=1}^{T}\rho _{t-1}} +\\&\frac{\eta ^2 L^2}{{{\tilde{B}}}\sum _{t=1}^{T}\rho _{t-1}} \sum _{t=1}^{T}\rho _{t-1}\bigg [V^2\bigg (\sum _{b=0}^{{{\tilde{B}}} -1}\frac{\mu (\mu ^b-1)}{\mu -1}+\tilde{B}\frac{\mu (\mu ^D-1)}{\mu -1}\bigg )+\\&M\tilde{B}^2\sigma ^2\sum _{j=t-1-D'}^{t-2}\rho _{j-1}^2\bigg ]+\frac{L\eta V^2}{\alpha \sum _{t=1}^{T}\rho _{t-1}} \sum _{t=1}^{T}\rho _{t-1}^2 \end{aligned}$$

where \({{\tilde{B}}}=pB\) and \({\varvec{v}}_*\) is the global optimum of the objective function in Eq. 1.

We denote the expectation of all random variables in Algorithm 2 by \(\mathop {{\mathbb {E}}}[.]\). Theorem 1 shows that the weighted average of the \(l_2\) norm of all gradients \(||\nabla f({\varvec{v}}_{t-1})||^2\) can be bounded, which indicates an ergodic convergence rate. It can be seen that it is possible to achieve speed-up by increasing the number of cores and workers. Nevertheless to reach such speed-up, the learning rates \(\eta\) and \(\rho _t\) have to be set properly (see Corollary 1).

Corollary 1

By setting the learning rates to be equal and constant:

$$\begin{aligned} \rho ^2=\eta ^2=\frac{\sqrt{(f({\varvec{v}}_{0})-f(\varvec{v}_{*}))}}{A\alpha \sqrt{TM{{\tilde{B}}}}} \end{aligned}$$
(9)

such that \(A=L V^2\bigg (\frac{1}{\alpha }+\frac{1}{\alpha ^2}+ \frac{2\,L\mu }{(1-\mu )\alpha }\bigg )\), \(V>0\) and \(\mu\) is a constant where \(0<\mu <1\), then the bound in Eqs. 7 and  8 can lead to the following bound:

$$\begin{aligned} T&\ge max\bigg \{\frac{M{{\tilde{B}}} L^2D'^2(f(\varvec{v}_{0})-f(\varvec{v}_{*}))}{A^2\alpha ^2},\nonumber \\&\frac{\big (f(\varvec{v}_{0})-f(\varvec{v}_{*})\big )\big (\mu (\mu -1)+9L^2\mu (D+1)(\mu ^{D+1}-1)\big )^4}{M\tilde{B}A^2\alpha ^2(\mu -1)^8}\bigg \} \end{aligned}$$
(10)

and Theorem 1 gives the following convergence rate:

$$\begin{aligned} \frac{1}{T}\sum _{t=1}^{T} \mathop {{\mathbb {E}}}[||\nabla f(\varvec{v}_{t-1})||^2] \le 3A\sqrt{\frac{f({\varvec{v}}_{0})-f(\varvec{v}_{*})}{TM{{\tilde{B}}}}} \end{aligned}$$
(11)

This corollary shows that by setting the learning rates to certain values and setting the number of iterations T to be greater than a bound depending on the maximum delay allowed, a convergence rate of \(O(1/\sqrt{TMpB})\) can be achieved and this is delay-independent. The negative effect of using old parameters (asynchronous distribution) and over-writing the shared memory (lock-free parallel) vanish asymptotically. Hence, to achieve speed-up, the number of iterations has to exceed a bound controlled by the maximum delay parameters, the number of iterations B (line 5 in Algorithm 2), the number of global updates M (line 3 in Algorithm 1) and the number of parallel threads (cores) p.

3.4 Discussion

Using Corollary 1, we can derive the result of lock-free parallel optimisation algorithm (Zhao et al., 2017) and the asynchronous distributed optimisation algorithm (Lian et al., 2015) as particular cases. By setting the number of threads \(p=1\) and the number of local update \(B=1\), we end up with the distributed asynchronous algorithm presented in Lian et al. (2015). The convergence bound of Corollary 1 then becomes \(O(1/\sqrt{TM})\) which is equivalent to that of Corollary 2 in Lian et al. (2015). By synchronising the global learning \(D'=0\), setting the master batch size \(M=1\) and the number of global iteration \(T=1\), we end up with the parallel lock free algorithm presented in Zhao et al. (2017). The convergence bound of Corollary 1 then becomes \(O(1/\sqrt{pB})\) which is equivalent to that of Theorem 1 in Zhao et al. (2017). The experiments below will empirically demonstrate these two parallel and distributed particular cases of DPSGD.

Since \(D'\) and D are related to the number of workers and cores (threads) respectively, bounding the latter allows speed-up with respect to the number of workers and cores with no loss of accuracy. The satisfaction of Eq. 10 is guaranteed if:

$$\begin{aligned} T\ge \frac{M{{\tilde{B}}} L^2D'^2(f(\varvec{v}_{0})-f({\varvec{v}}_{*}))}{A^2\alpha ^2} \end{aligned}$$

and

$$\begin{aligned} T\ge \frac{\big (f({\varvec{v}}_{0})-f(\varvec{v}_{*})\big )\big (\mu (\mu -1)+9L^2\mu (D+1)(\mu ^{D+1}-1)\big )^4}{M\tilde{B}A^2\alpha ^2(\mu -1)^8} \end{aligned}$$

The first inequality leads to \(O(T^{1/2})>D'\). Thus, the upper bound on the number of workers is \(O(T^{1/2})\). Since \(0<\mu <1\), the second inequality can be written as follows: \(O(T^{1/4})\ge \big (\mu (1-\mu )+9\,L^2\mu (D+1)(1-\mu ^{D+1})\big )\). Hence, \(O(T^{1/4})\ge D\). Thus, the upper bound on the number of number of cores (threads) is \(O(T^{1/4})\). The convergence rate for serial and synchronous parallel stochastic gradient (SG) is consistent with \(O(1/\sqrt{T})\) (Ghadimi & Lan, 2013; Dekel et al., 2012; Nemirovski et al., 2009). While the workload for each worker running DPSGD is almost the same as the workload of the serial or synchronous parallel SG, the progress done by DPSVG is Mp times faster than that of serial SG.

In addition to the speed-up, DPSGD allows one to steer the trade-off between multi-core local computation and multi-node communication within the cluster. This can be done by controlling the parameter B. Traditional methods reduce the communication cost by increasing the batch size which decreases the convergence rate, increase local memory load and decrease local input bandwidth. On the contrary, increasing B for DPSGD can increase the speed-up if some assumptions are met (see Theorem 1 and Corollary 1). This ability makes DPSGD easily adaptable to diverse spectrum of large-scale computing systems with no loss of speed-up.

Denote Tc the communication time need for each master-worker exchange. For simplification, we assume that Tc is fixed and is the same for all nodes. If the time needed for computing one update \(Tu\le Tc\), then the total time needed by the distributed algorithm \(DTT=T*(Tu+Tc)\) could be higher than that of the sequential SGD \(STT=M*T*Tu\). In such cases, existing distributed algorithms increases the local batch size so that Tu increases, resulting in lower stochastic gradient variance and allowing higher learning rate to be used, hence better convergence rate. This introduces a trade-off between computational efficiency and sample efficiency. Increasing the batch size by a factor of k increases the time need for local computation by O(k) and reduces the variance proportionally to 1/k (Bottou et al., 2018). Thus, higher learning rate can be used. However, there is a limit on the size of the learning rate. In another words, maximising the learning speed with respect to the learning rate and the batch size has a global solution. This maximum learning speed can be improved using DPSGD, performing B times less communication steps. For the mini-batch SGD with minibatch size G, the convergence rate can be written as \(O(1/\sqrt{GT})\). Since the total number of examples examined is GT and there is only \(\sqrt{G}\) times improvement, the convergence speed degrades as mini-batch size increases. The convergence rate of DPSGD with mini-batch G can be easily deduced from Theorem 1 as \(O(1/\sqrt{BMGT})\). Hence, \(\sqrt{BM}\) better convergence rate than mini-batch SGD and \(\sqrt{BM}\) better convergence rate than standard asynchronous SGD with B times less communication. These improvements are studied in the following.

4 Experiments

In this section, we empirically verify the potential speed-up gains expected from the theoretical analysis. First, we apply distributed parallel stochastic variational inference (DPSVI) algorithm on a Latent Dirichlet Allocation (LDA) analysis problem. DPSVI is derived from DPSGD by replacing the SG of SVI by DPSG to scale up the inference computation over a multi-core cluster (see appendix for more details). For the Latent Dirichlet Allocation analysis problem, we use the SVI algorithm (Hoffman et al., 2013) as benchmark. The evaluation is done on 300, 000 news articles from the New York Times corpus.

Furthermore, we use DPSGD to scale up the training of DRL algorithm, namely Advantage Actor Critic (A2C) algorithm, implementing highly scalable A2C (HSA2C) (details in the appendix). We compare HSA2C against other distributed A2C implementations using a testbed of six Atari games and demonstrate an average training time of 21.95 min compared to over 13.75 h by the baseline A3C. In particular, HSA2C shows a significant speed-up on Space invaders with learning time below 10 min compared to the 30 min achieved by the best competitor.

4.1 Variational inference

The development of the proposed DPSVI algorithm follow from DPSGD, but in the context of VI. In Appendix 7, we characterise the entire family of models where DPSVI is applicable, which is shown to be equivalent to the models for which SVI applies. Next, DPSVI is derived from DPSGD. Finally, we derive an asynchronous distributed lock-free parallel inference algorithm for LDA as a case study for DPSVI.

Datasets We use the NYTimes corpus  (Lichman, 2013) containing 300, 000 news articles from the New York Times corpus. The data is pre-processed by removing all the words not found in a dictionary containing 102, 660 most frequent words - see (Lichman, 2013) for more information. We reserve 5, 000 documents from NYTimes data as a validation set and another 5, 000 documents as a testing set.

Performance The performance of the LDA model is assessed using a model fit measure, perplexity, which is defined as the geometric mean of the inverse marginal probability of each word in the held-out set of documents (Blei et al., 2003). We also compute the running time speed-up (TSP)  (Lian et al., 2015) defined as

$$\begin{aligned} TSP&=\frac{ T( \text {SVI}) }{ T( \text {DPSVI}) } \end{aligned}$$
(12)

where \(T(\cdot )\) denotes the running time and is taken when both models achieve the same final held-out perplexity of 5000 documents.

Parameters In all experiments, the LDA number of topics is \(K = 50\). SVI LDA is run on the training set for \(\kappa \in \{0.5, 0.7, 0.9\}\), \(\tau _0\in \{1, 24, 256, 1024\}\), and \(batch\in \{16, 64, 256, 1024\}\). The best performing parameters \(batch=1024\), \(\kappa =0.5\) and \(\tau _0=1\) providing preplexity of 5501 are used (Table 1 in Mohamad et al. (2018) summarises the best settings with the resulting perplexity on the test set). As for the DPSGD LDA version, the local learning rate G (see Eq. 52) is set to 64 and M equal to 16. We evaluate a range of learning rates \(\eta =\rho \in \{0.2,0.1,0.05, 0.01\}\) where M, p and B are set to 1. The best learning rate 0.1 providing held-out perplexity of 5501 was used. For different B, M and p, the learning rate is changed according to Corollary 1:

$$\begin{aligned} \rho '=\rho \bigg (\frac{pBM}{p'B'M'}\bigg )^{0.25} =\frac{0.1}{(p'B'M')^{0.25}} \end{aligned}$$
(13)

All DPSGD LDA experiments were performed on a high-performance computing (HPC) environment using message passing interface (MPI) for python (MPI4py). The cluster consists of 10 nodes, including the head node, with each node being a 1-sockets-6-cores-2-thread processor.

4.1.1 Node speed-up

Here, we study the speed-up of DPSVI with respect to the number of workers where \(p=1\) and \(B=1\). DSPSVI LDA is then compared against serial SVI (\(B=1\), \(p=1\) and \(nW=1\)). We run DPSVI for various numbers of workers \(nW\in \{4, 9, 14, 19 \}\). The number of nodes is nW as long as nW is less than 9. As nW becomes higher than the available nodes, the processors’ cores of nodes are employed as workers until all cores (threads) of each node are used i.e., \(9\times 12=108\). The batch size M is fixed to 36. Figure 1 summarises the total speed-up (i.e., TSP measured at the end of the algorithm) with respect to the number of workers where the achieved pre-perplexity is almost the same. The result shows linear speed-up as long as the number of workers is less than 14. Then, linear speed-up slowly converts to sub-linear and is expected to drop for higher number of workers due to reaching the maximum communication bandwidth.

Fig. 1
figure 1

LDA analysis: Running time speed-up (TSP) with respect to the number of workers

4.1.2 Thread speed-up

In this section, we study the speed-up of DPSVI with respect to the number of threads where \(nW=1\). We empirically set B to 15. Similar to the node-related speed-up analysis, experiments are run for different \(p\in \{3,5,8,10\}\). Then, DPSVI is compared against serial SVI. The results are shown in Fig. 2. The outcome shows linear speed-up as long as the number of threads is less than 8. Then, the speed-up slowly converts to sub-linear and is expected to become worse for higher number of threads. This drop in the speed-up is due to hardware communication and other factors affecting the CPU power.

Fig. 2
figure 2

LDA analysis: Running time speed-up (TSP) with respect to the number of threads

4.1.3 Node-thread speed-up

Finally, we study the speed-up of DPSVI with respect to the number of nodes and threads. To simplify the experiments, we take the number of cores to be equal to the number of nodes. Experiments are run for different \(p=nW\in \{2,4,6,8\}\). We also present results with different \(B\in \{5,10,15,20\}\) in order to show the effect of steering the trade-off between local computation and communication. DPSVI is compared against serial SVI and the results are shown in Fig. 3. The result shows speed-up whose speed slows down as the number of threads and nodes exceed 6. This is due to communication and other hardware factors. However, the rate of this slowing down for higher B is less significant which illustrates the advantage of reducing the communication overhead when reaching its ceiling point. Note that for very high number of workers, increasing B might not be very helpful as our theoretical results show that high B tightens the bound on the number of workers allowed for the speed-up to holds. Figure 4 reports the perplexity on the training set with respect to running time in seconds (logarithmic scale) with \(B=15\). Five curves are drawn for different nodes-threads number, where DPSVI-n denotes our DPSVI with n nodes and threads. The convergence and speed-up of DPSVI are clearly illustrated.

Fig. 3
figure 3

LDA analysis: Running time speed-up (TSP) with respect to the number of workers and threads

Fig. 4
figure 4

LDA analysis using DPSVI: perplexity (model fit) with respect to running logarithmic time in seconds

4.2 Deep reinforcement learning

We use six different Atari games to study the performance gains that can be achieved by the proposed HSA2C algorithm using the Atari 21600 emulator (Bellemare et al., 2013) provided by the OpenAI Gym framework (Brockman et al., 2016). This emulator is one of the most commonly used benchmark environments for RL algorithms. Here, we use Pong, Boxing, Seaquest Space invaders, Amidar and Qbert which have been included in related work (Mnih et al., 2016; Adamski et al., 2017, 2018; Babaeizadeh et al., 2016). These games are used to evaluate the effects of reducing the communication bottleneck when using an increasingly higher number of steps, B, with different numbers of nodes. We also study the speed-up achieved by HSA2C with respect to the number of nodes. Finally, we compare the performance reported by various state-of-the-art algorithms  (Mnih et al., 2016; Adamski et al., 2017, 2018; Babaeizadeh et al., 2016).

4.2.1 Implementation details

HSA2C has been implemented and tested on a high-performance computing (HPC) environment using message passing interface (MPI) for Python (MPI4py 3.0.0) and Pytorch 0.4.0. Our cluster consists of 60 nodes consisting of 28 2.4 GHz CPUs per node. In our experiments, we used the same input pre-processing as Mnih et al. (2015). Each experiment was repeated 5 times (each with an action repeat of 4) and the average results are reported. The agents used the neural network architectures described in  Mnih et al. (2013): a convolutional layer with 16 filters of size 8 x 8 with stride 4, followed by a convolutional layer with with 32 filters of size 4 x 4 with stride 2, followed by a fully connected layer with 256 hidden units. All three hidden layers were followed by a rectifier nonlinearity. The network has two sets of outputs – a softmax output with one entry per action representing the probability of selecting the action and a single linear output representing the value function. Local learning rate, mini-batch size and the optimiser setting are contrasted with those reported in Adamski et al. (2018) in order to provide a fair comparison with their asynchronous mini-batch implementation. The global learning rate was set to 0.01 for the SGD optimiser with 0.5 momentum. The global batch was set to the number of utilised nodes.

4.2.2 Speed-up analysis

In this section, we study the effect of B on various aspects of the scalability of HSA2C with respect to the number of nodes. Figure 5 shows the average speed of data generated by the distributed actors measured in data-points per seconds on the Space invader game. Comparable figures can be obtained for other games, but they are not reported here. It is noticeable that at \(B=1\), the speed of data generation is about the same as the number of nodes increases. This is due to the communication cost which increases with the number of nodes used. That is, the expected waiting time of each node’s exchange with the master increases. Increasing B will reduce the number of exchanges while performing the same number of updates locally. This is illustrated in Fig. 5 where the data generation speed increases with the number of nodes as B increases.

Fig. 5
figure 5

The average data generation speed of HSA2C measured in points per second within 30 min run on Space invader game

Figure 6 shows the time (in seconds) required to reach the highest score of Pong, Boxing, Seaquest, Space invader, Amidar and Qbert reached with \(B= 1\) in a 30-minute run over different number of nodes and B. The aim of these figures is to demonstrate the potential performance gains achieved by HSA2C as B increases in comparison to distributed deep reinforcement learning (DDRL) A3C, which is algorithmically equivalent to HSA2C when \(B=1\) (our baseline). In order to produce these figures, we initially carried out a search to empirically determine the highest score that can be achieved by HSA2C within a time period of 30 minutes when \(B=1\). The search for the highest score is performed on four different cluster sizes: 20, 30, 40, and 60. The figures present the time required for various HSA2C parameters to reach that benchmark score. These experimental results clearly show the impact of B on the communication costs and confirm the findings of Fig. 5. By using a larger number of nodes, more communication exchanges are required for each update and performing more local update (i.e., increasing B) reduces the communication exchanges needed to reach certain score without much sacrificing the learning performance. Thus, a better speed-up can be achieved. On the other hand, with a smaller number of nodes, increasing B does not make a significant difference in reducing the communication whilst the negative effect of an increased variance becomes significant as the size of the learning batch becomes smaller (depending on the number of nodes). Overall, there is evidence for a variance-communication trade-off controlled by B.

Fig. 6
figure 6

The time (in seconds) required to reach reference solution (the highest score with \(B=1\) in a 30 min run) over a range of node numbers and B

The multiple local updates B mitigates the speed-up limit caused by higher communication cost with higher number of nodes. Choosing the right B for different numbers of nodes allows HSA2C to scale better than HSA2C when \(B=1\). For all the six games, we have found that increasing the number of nodes over 40 does not lead to better performance. This is due to the higher variance entailed by a higher B. That is, the performance improvement coming from the communication reduction is overtaken by the entailed variance when using more than 40 nodes. This limit could be overcome in two different ways, either by decreasing the communication without further increasing the variance or by directly mitigating the variance problems.

4.2.3 Comparison

We show the effectiveness of the proposed approach by comparing it against similar scalable Actor-critic optimisation approaches. The most similar work in speeding up Atari games training is presented in Adamski et al. (2018), Mnih et al. (2016), Adamski et al. (2017), Babaeizadeh et al. (2016). The algorithm in Adamski et al. (2018) (DDRL A3C) is a particular case of HSA2C, where the communication is synchronised and the number of iterations of per-worker loop is set to one (\(B=1\)). GA3C is a hybrid GPU/CPU algorithm which is a flavour of A3C focusing on batching the data points in order to better utilise the massively parallel nature of GPU computations. This is similar to the single-node algorithm called BA3C (Adamski et al., 2017).

Table 2 presents the best score and time (in minutes) HSA2C archives using the best B values found in Fig. 6 in comparison to the competitors. The reported scores are taken from the original papers. As GA3C, BA3C and A3C are parallel single-node algorithms, their experimental settings are not comparable to ours. This comparison shows that our approach achieves a better score than all competitors. In particular, we achieve an average score of 665 in 30.63 minutes average time using 560 total CPU cores compared to the DDRL A3C score of 650 in 82.5 average time with 778 total CPU cores. Most importantly, this comparison validates the effectiveness of our proposed approach to reduce communication while preserving performance. This is clearly shown in the comparison between our approach and our implementation of DDRL A3C (the top competitor in Table 2) using the same setting (see Sect. 4.2.2).

For this study, we have decided not to include GPU-based implementations such as Stooke and Abbeel (2018) as our focus here is on CPU-enabled methods. However, HSA2C is generic and lends itself to GPU-based implementations whereby each node consists of multiple CPUs and a GPU. In such a case, local computation and simulation can be done using CPUs/GPU units, where our multiple local update approach can further speed up the standard DA3C communication-based (Stooke & Abbeel, 2018). The empirical work reported here provides an initial validation of the underlying idea.

Table 2 Best scores and the corresponding time in minutes achieved by HSA2C using the best B from Table 2 and considering 20, 30, 40 and 60 nodes compared to the best reported results by competitors

5 Conclusion

We have proposed a novel asynchronous distributed and lock-free parallel optimisation algorithm. The algorithm has been implemented on a computer cluster with multi-core nodes. Both theoretical and empirical results have shown that DPSGD leads to speed-up on non-convex problems. The paper shows how DPSVI and HSA2C have been derived from DPSGD. Both are an asynchronous distributed and lock-free parallel implementation for respectively stochastic variational inference (SVI) and advantage actor critic (A2C). Empirical results have allowed to validate the theoretical findings and to compare against similar state-of-the-art methods.

Going forward, further improvements and validations could be achieved by pursuing research along five directions: (1) employing variance reduction techniques to improve the convergence rate (from sub-linear to linear) while guaranteeing multi-node and multi-core speed-up; (2) proposing a framework enabling dynamic trade-offs between local computation and communication; (3) proposing techniques to improve the local optimum of the distributed parallel algorithms; (4) applying DPSVI to other members of the family of models stated in the appendix; (5) applying DPSGD to other large-scale deep learning problems.