Distributed computing has been playing a significant role in current smart applications in various fields. In this chapter, we first briefly give a bird’s-eye view of this topic, introducing various programming paradigms. Next, we introduce Actor, an OCaml-based distributed computing engine, and how it works together with Owl. We then focus on one key element in distributed computing: the synchronization. We introduce four different types of synchronization methods or “barriers” that are commonly used in current systems. Next, we elaborate how these barriers are designed and provide illustrations from the theoretical perspective. Finally, we use evaluations to show the performance trade-offs in using different barriers.

10.1 Distributed Machine Learning

Machine learning has achieved phenomenal breakthroughs in various fields, such as image recognition, language processing, gaming industry, product management, healthcare, etc. The power of machine learning lies in utilizing the growing size of training data as well as models so as to achieve high accuracy. As a large amount of data is increasingly generated from mobile and edge devices (smart homes, mobile phones, wearable devices, etc.), it becomes essential for many applications to train machine learning models in parallel across many nodes. In distributed learning, a model is trained via the collaboration of multiple workers. One of the most commonly used training methods is the stochastic gradient descent (SGD), which iteratively optimizes the given objective function until it converges by following the gradient direction of the objective. In each iteration of SGD, typically a descent gradient is calculated using a batch of training data, and then the model parameters are updated by changing along the direction of the gradient at a certain step. There are mainly three types of paradigms to perform distributed machine learning: parameter servers (PS), All-Reduce, and decentralized approaches (or peer-to-peer).

Parameter server [34] (PS) is a frequently used distributed training architecture. The server keeps the model parameters; the workers pull the parameters, compute the gradients, and push them back to the server for aggregation. It is commonly implemented as a key-value store. Another paradigm is All-Reduce. In this paradigm, each computing node expects each participating process to provide an equally sized tensor, collectively applies a given arithmetic operation to input tensors from all processes, and returns the same result tensor to each participant. A naive implementation could simply let every process broadcast its input tensor to all peers and then apply the arithmetic operation independently. The Ring All-Reduce architecture organizes workers as a ring structure to utilize the bandwidth effectively. The Horovod framework provides a high-performance implementation of the All-Reduce.

Besides these two, the decentralized architecture has drawn more attention. It allows point-to-point communication between nodes according to a communication graph. The peer-to-peer approach can effectively solve problems such as communication bottlenecks, unfairness caused by information concentration, etc., and provides more opportunities for optimization. One of the most commonly used algorithms in the P2P training paradigm is the decentralized parallel SGD (D-PSGD) [36], where each node has its own set of parameters and only synchronizes with its neighbors in the graph. In time, the local information during training propagates across the whole graph gradually. A lot of challenges remain to be addressed to train a model with good performance in a decentralized system.

These training paradigms are (perhaps partly) supported by various popular learning frameworks, such as TensorFlow, PyTorch, etc. Normally, they rely on high-performance computing backends to provide efficient communication in these paradigms. For example, NCCL is a stand-alone library of standard communication routines for GPUs, and it implements various communication patterns. It has been optimized to achieve high bandwidth. Another communication backend is the Intel MPI Library, a multifabric message-passing library that implements the open source MPICH specification. It aims to create, maintain, and test advanced, complex applications that perform better on high-performance computing clusters.

One of the emerging distributed training paradigms is the Federated Learning [6]. Federated Learning allows machine learning tasks to take place without requiring data to be centralized. There are a variety of motivations behind, for example, maintaining privacy by preventing individuals from revealing their personal data to others or latency by allowing data processing to take place closer to where and when the data is generated. Due to these reasons, it has been gaining increasing popularity in various research and application fields. Federated Learning emphasizes the training data are not always IID. That is, a device’s local data cannot be simply regarded as samples drawn from the overall distribution. The data distribution has an enormous impact on model training. Some research work provide theoretical analysis distributed training with non-IID data. Some works are proposed to address the imbalanced data problem. Besides data enhancement, its strategies include a combination of sequential update and BSP in updating, given how biased the data is.

10.2 The Actor Distributed Engine

Actor is an OCaml language–based distributed data processing system. It is developed to support the aforementioned distributed computing paradigms in Owl. It has implemented core APIs in both map-reduce and parameter server engines. Both map-reduce and parameter server engines need a (logical) centralized entity to coordinate all the nodes’ progress. We also extended the parameter server engine to the peer-to-peer (p2p) engine. The p2p engine can be used to implement both data and model parallel applications; both data and model parameters can be, although not necessarily, divided into multiple parts and then distributed over different nodes. Orthogonal to these paradigms, Actor also implements all four types of synchronization barriers.

Each engine has its own set of APIs. For example, the map-reduce engine includes map, reduce, join, collect, etc., while the peer-to-peer engine provides four major APIs: push, pull, schedule, and barrier. It is worth noting there is one function shared by all the engines, that is, the barrier function which implements various barrier control mechanisms. Next, we will introduce these three different kinds of engines of Actor.

10.2.1 Map-Reduce Engine

Following the MapReduce programming model, nodes can be divided by tasks: either map or reduce. A map function processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function aggregates all the intermediate key/value pairs with the same key. Execution of this model can automatically be paralleled. Mappers compute in parallel while reducers receive the output from all mappers and combine to produce the accumulated result. This parameter update is then broadcast to all nodes. Details such as distributed scheduling, data divide, and communication in the cluster are mostly transparent to the programmers so that they can focus on the logic of mappers and reducers in solving a problem within a large distributed system. This simple functional style can be applied to a surprisingly wide range of applications. For example, the following code shows an example of using the map-reduce engine to implement the classic wordcount task:

    module Ctx = Actor.Mapre     let print_result x = List.iter (fun (k,v) ->         Printf.printf "%s : %i\n" k v) x     let stop_words = ["a";"are";"is";"in";"it";"that";         "this";"and";"to";"of";"so"; "will";"can";"which";         "for";"on";"in";"an";"with";"the";"-"]     let wordcount () =       Ctx.init Sys.argv.(1) "tcp://localhost:5555";       Ctx.load "unix://data/wordcount.data"       |> Ctx.flatmap Str.(split (regexp "[ \t\n]"))       |> Ctx.map String.lowercase_ascii       |> Ctx.filter (fun x -> (String.length x) > 0)       |> Ctx.filter (fun x -> not (List.mem x stop_words))       |> Ctx.map (fun k -> (k,1))       |> Ctx.reduce_by_key (+)       |> Ctx.collect       |> List.flatten |> print_result;       Ctx.terminate ()     let _ = wordcount ()

10.2.2 Parameter Server Engine

The parameter server module is similar. Nodes are divided into servers, holding the shared global view of the up-to-date model parameters, and workers, each holding its own view of the model and executing training. The workers and servers communicate in the format of key-value pairs. It mainly consists of four APIs for the users:

  • schedule: Decide what model parameters should be computed to update in this step. It can be either a local decision or a central decision.

  • pull: Retrieve the updates of model parameters from somewhere and then apply them to the local model. Furthermore, the local updates will be computed based on the scheduled model parameter.

  • push: Send the updates to the model plane. The updates can be sent to either a central server or to individual nodes depending on which engine is used (e.g., map-reduce, parameter server, or peer-to-peer).

  • barrier: Decide whether to advance the local step. Various synchronization methods can be implemented. Besides the classic BSP, SSP, and ASP, we also implement the proposed PSP within this interface.

The following code shows the interfaces of the parameter server engine:

open Actor_types type barrier =   | ASP    (* Asynchronous Parallel *)   | BSP    (* Bulk Synchronous Parallel *)   | SSP    (* Stale Synchronous Parallel *)   | PSP    (* Probabilistic Synchronous Parallel *) (** core interfaces to parameter server *) val register_barrier : ps_barrier_typ -> unit val register_schedule : ('a, 'b, 'c) ps_schedule_typ -> unit val register_pull : ('a, 'b, 'c) ps_pull_typ -> unit val register_push : ('a, 'b, 'c) ps_push_typ -> unit val register_stop : ps_stop_typ -> unit val get : 'a -> 'b * int val set : 'a -> 'b -> unit val keys : unit -> 'a list val start : ?barrier:barrier -> string -> string -> unit val worker_num : unit -> int

The interface is intuitive. It implements the key-value mechanism of parameter servers in the get, set, and keys functions. The user can define a barrier function, scheduler, pull function executed at master, and push function executed at worker in the register_* functions. This design provides a flexible distributed computing framework.

Based on these interfaces, here is a simple example using the parameter server engine to assign random numbers as tasks to participating workers in the system:

    module PS = Actor_param     let schedule workers =       let tasks = List.map (fun x ->         let k, v = Random.int 100, Random.int 1000 in (x, [(k,v)])       ) workers in tasks     let push id vars =       let updates = List.map (fun (k,v) ->         Owl_log.info "working on %i" v;         (k,v) ) vars in       updates     let test_context () =       PS.register_schedule schedule;       PS.register_push push;       PS.start Sys.argv.(1) Actor_config.manager_addr;       Owl_log.info "do some work at master node"     let _ = test_context ()

Figure 10-1
A structural diagram illustrates the combination of different modules of the owl and actor frameworks, with bold arrows.

Combine Owl and Actor frameworks

10.2.3 Compose Actor with Owl

One of the most notable advantages of Actor lies in that it can compose with Owl. Parallel and distributed computing in Owl is achieved by composing the different data structures in Owl’s core library with specific engines in the Actor system.

As shown in Figure 10-1, all the three distributed engines can be used to compose with the Ndarray module in Owl. And the composition is quite straightforward:

module M = Owl.Parallel.Make (Dense.Ndarray.S) (Actor.Mapre)

That’s all it takes. By using a functor provided in Owl, it builds up a distributed version of the n-dimensional array module. In this functor, we choose to use the single-precision dense Ndarray module and the MapReduce engine as parameters. Using this distributed Ndarray module is also easy. The following code shows an example. You can see that the composed Ndarray module provides all the normal ndarray operations, including initialization, map, fold, sum, slicing, adding, etc. And these computations perform on a distributed cluster.

module M1 = Owl_parallel.Make_Distributed (Owl.Dense.Ndarray.D) (Actor.Mapre) let test_owl_distributed () =   Actor.Mapre.init Sys.argv.(1) "tcp://localhost:5555";   let y = M1.init [|2;3;4|] float_of_int in   let _ = M1.set y [|1;2;3|] 0. in   let y = M1.add x y in   Owl.Dense.Ndarray.D.print (M1.to_ndarray y); flush_all ();   let x = M1.ones [|200;300;400|] in   let x = M1.map (fun a -> a +. 1.) x in   let a = M1.fold (+.) x 0. in   let b = M1.sum x in   Owl_log.info "fold vs. sum ===> %g, %g" a b;   Owl_log.info "start retrieving big x";   let x = M1.to_ndarray x in   Owl_log.info "finsh retrieving big x";   Owl_log.info "sum x = %g" (Owl.Arr.sum' x)

Similarly, this composition also applies to more advanced and complicated data structures such as neural networks. Remember that Ndarray is the core data structure in Owl, which the neural network module relies on. Therefore, we can create a distributed version neural network module using the same functor:

module M = Owl.Parallel.Make (Owl.Neural.S.Graph) (Actor.Param)

Here, we use the single-precision neural network graph module and the parameter server distributed engine to parameterize the new module. It enables parallel training on a computer cluster. The following code shows an example. Most of the code stays unchanged. All it requires is to use the M2.train function instead of the original one to train a network.

module M2 = Owl_neural_parallel.Make (Owl.Neural.S.Graph) (Actor.Param) let test_neural_parallel () =   let open Owl.Neural.S in   let open Graph in   let nn =     input [|32;32;3|]     |> normalisation ~decay:0.9     |> conv2d [|3;3;3;32|] [|1;1|] ~act_typ:Activation.Relu     |> conv2d [|3;3;32;32|] [|1;1|] ~act_typ:Activation.Relu ~padding:VALID     |> max_pool2d [|2;2|] [|2;2|] ~padding:VALID     |> dropout 0.1     |> conv2d [|3;3;32;64|] [|1;1|] ~act_typ:Activation.Relu     |> conv2d [|3;3;64;64|] [|1;1|] ~act_typ:Activation.Relu ~padding:VALID     |> max_pool2d [|2;2|] [|2;2|] ~padding:VALID     |> dropout 0.1     |> fully_connected 512 ~act_typ:Activation.Relu     |> linear 10 ~act_typ:Activation.(Softmax 1)     |> get_network   in   let x, _, y = Owl.Dataset.load_cifar_train_data 1 in   let chkpt state =     if Checkpoint.(state.current_batch mod 1 = 0) then (       Checkpoint.(state.stop <- true);     )   in   let params = Params.config     ~batch:(Batch.Sample 100)     ~learning_rate:(Learning_Rate.Adagrad 0.001)     ~checkpoint:(Checkpoint.Custom chkpt)     ~stopping:(Stopping.Const 1e-6) 10.   in   let url = Actor_config.manager_addr in   let jid = Sys.argv.(1) in   M2.train ~params nn x y jid url

10.3 Synchronization: Barrier Control Methods

One critical component of distributed and federated machine learning systems is barrier synchronization: the mechanism by which participating nodes coordinate in the iterative distributed computation. As noted earlier, the statistical and iterative nature of machine learning means that errors are incrementally removed from the system. To be perfectly consistent, where every node proceeds to the next iteration together risks reducing throughput. Relaxing consistency can improve system performance without ultimately sacrificing accuracy. This trade-off is embodied in the barrier control mechanism. In the rest of this chapter, we will focus on this aspect in distributed computing.

In parallel computing, a barrier is used for synchronization. If in the source code a barrier is applied on a group of threads or processes, at this point a thread or process cannot proceed until all others have finished their workload before the barrier. With this, it is guaranteed that certain calculations are finished. For example, the following code shows the barrier pragma in OpenMP, an application programming interface that supports shared memory multiprocessing programming. Here, the calculation is distributed among multiple threads and executed in parallel, but the computation of y cannot proceed until the other threads have computed their own values of x. Execution past the barrier point continues in parallel.

#pragma omp parallel {   x = some_calculation(); #pragma omp barrier   y = x + 1 }

The preceding example shows a strict version of barrier methods. In distributed training, there exist multiple forms of barrier control methods for synchronization. Current barrier control mechanisms can be divided into four types, discussed in detail later. These barrier methods provide different trade-offs between system performance and model accuracy. They can be applied in different distributed machine learning systems we have discussed in the previous section, including parameter servers, peer-to-peer, etc. In the rest of this section, we will introduce them.

The Bulk Synchronous Parallel (BSP) is the most strict, which requires all workers to proceed in lockstep moving to the next iteration only when all the workers are ready. Bulk Synchronous Parallel (BSP) is a deterministic scheme where workers perform a computation phase followed by a synchronization/communication phase to exchange updates, under control of a central server [54]. BSP programs are often serializable, that is, they are equivalent to sequential computations, if the data and model of a distributed algorithm have been suitably scheduled, making BSP the strongest barrier control method [30]. Numerous variations of BSP exist, for example, allowing a worker to execute more than one iteration in a cycle [14]. Federated Learning also uses BSP for its distributed computation [6]. Moreover, BSP requires centralized coordination.

The Asynchronous Parallel (ASP) [41] is the least strict barrier control, since it allows each worker to proceed at its own pace without waiting for the others. Asynchronous Parallel (ASP) takes the opposite approach to BSP, allowing computations to execute as fast as possible by running all workers completely asynchronously [41]. ASP can result in fast convergence because it permits the highest possible rate of iteration [54]. However, the lack of any coordination means that updates are calculated based on old model state, resulting in reduced accuracy. There are no theoretical guarantees as to whether algorithms converge. The Hogwild scheme proposed in [41] has many limits, for example, it requires a convex function and sparse update. Many work have tried to extend these limits in application and theoretical analysis [35]. These studies often lead to carefully tuned step size in training. [59] proposes a delay-compensated SGD that mitigates delayed updates in ASP by compensating the gradients received at the parameter server. [32] introduces another variant of ASP specifically for wide area networks: as communication is a dominant factor, it advocates allowing insignificant updates to be delayed indefinitely in WAN.

Table 10-1 Classification of Synchronization Methods Used by Different Systems

The third one is the Stale Synchronous Parallel (SSP) [30], which relaxes BSP by allowing workers to proceed to the next iteration once all workers’ iterations are within a certain limit with each other. Stale Synchronous Parallel (SSP) is a bounded asynchronous model that balances between BSP and ASP. Rather than requiring all workers to proceed to the next iteration together, it requires only that the iteration of any two workers in the system differs by at most s, a predefined staleness bound. The staleness parameter limits error and allows SSP to provide deterministic convergence guarantees [30, 15, 54]. Built on SSP, [58] investigates the n-softsync, the synchronization method that makes the parameter server updating its weight after collecting certain number of updates from any workers. [9] proposes to remove a small amount of “longtail” workers or add a small amount of backup nodes to mitigate this effect while avoiding asynchronous noise.

The final one is called Probabilistic Synchronous Parallel (PSP). Its basic idea is to introduce a sampling primitive in the system and to use a sampled subset of participating workers to estimate progress of the entire system. PSP introduces a second dimension to this trade-off: from how many nodes must we receive updates before proceeding to the next iteration. By composing our sampling primitive with traditional barrier controls, it obtains a family of barrier controls better suited for supporting iterative learning in the heterogeneous networks.

The core idea behind PSP is simple yet powerful: we require that only some proportion, not all, of the working nodes be synchronized in progress. By “progress” we mean the number of updates pushed to the server at the client’s side and the total number of updates collected at the server’s side. In a centralized training framework, the server builds this subset of the training nodes based on system information, such as their current local progress. This subset can be sampled by various approaches. One common and intuitive way is to sample randomly among all the workers.

The parameter in PSP, the sampling size, therefore controls how precise this estimation is. Assuming this random subset is not biased with respect to nodes’ performance, the server can use the resulting distribution to derive an estimate of the percentage of nodes which have passed a given progress. This estimation depends on the specific method used within the subset, as will be discussed in Section 10.4. Accordingly, the server can decide whether to allow the trainers to pass the barrier and advance their local progress.

Figure 10-2 illustrates the difference among these four types of barriers. Here, the computing progress is measured by super steps, or iterations. Communication may happen at the barrier to ensure consistency through the global state. A central server may also be required in order to maintain the global state, denoted by the clock symbol.

Table 10-1 summarizes the barrier synchronization methods used by different machine learning systems. You can see that, regardless if it is a classic system or a new one, the barrier synchronization has been an important component in the system.

Figure 10-2
An illustration of 4 types of barrier control mechanisms in distributed machine learning systems for A S P, B S P, S S P, and P S P. Workers are at the same level for p hashtag 0, 1, 2, and n in B S P.

Illustration of various barrier control methods

10.4 System Design Space and Parameters

In the previous section, we have introduced the existing barrier control methods. In this section, we will further explain how they are designed and their relationship with each other.

In a distributed training system that utilizes an iterative learning algorithm, the main target is to achieve faster convergence, a state in which loss keeps to be within an error range around the final value of training. The convergence of training is positively correlated with two factors: consistency and iteration rate. Consistency is the agreement between multiple nodes in a distributed training system to achieve convergence. It can be indicated by the difference of training iterations of the nodes. Weak consistency can be detrimental to the update quality of the model in each iteration. On the other hand, the iteration rate is how fast the training processes. This relationship can be captured by Eq. 10.1.

$$ \textrm{Convergence}\ \alpha\ \textrm{consistency}\times \textrm{iteration}\ \textrm{rate} $$
(10.1)

BSP and ASP are good examples to illustrate these two factors. In BSP, workers must wait for others to finish in a training round, and all the workers are of the same progress. Therefore, of all barrier methods the BSP can offer the best consistency and highest accuracy in each update. BSP is a deterministic algorithm. As a price, if there are stragglers in the training nodes, the system progress will be bottlenecked by the slowest node. On the other hand, ASP allows nodes to execute as fast as possible, with no need to consider the progress of other nodes. As a result, ASP leads to the highest possible rate of iteration. However, the lack of any coordination means that updates are calculated based on out-of-date model state, resulting in reduced consistency.

The design of SSP clearly shows a good trade-off between these two extremes. As shown in Figure 10-3, SSP attempts to exploit this trade-off by bounding the difference in iterations between participating nodes. On one hand, it does not require all the nodes to have exactly the same progress as in BSP and thus improves its iteration rate. On the other hand, its stale bound provides a more strict consistency bound on nodes than ASP. As a result, it achieves a balance between these two ends, hence leading to a higher rate of convergence. The parameter staleness covers the spectrum on this one-dimensional tuning space.

But is that all about the design space of a barrier control method? Let’s look deep into the current model again. We start by visualizing the iterative update process, as shown in Figure 10-4.

The model is simple. A sequence of updates is applied to an initial global state x0. Here, u(p,t) denotes update(node id, timestamp), that is, the updates are generated for all the nodes on all its clock ticks. In this example, there are three nodes. Ideally, in clock tick ti we expect to have three updates: u(0, ti), u(1, ti), and u(2, ti). However, due to the noisy environment, these updates are divided into two sets. The deterministic ones are those we expect if everything goes well as stated earlier. The probabilistic ones are those out-of-order updates due to packet loss, network delay, node failure, etc. Although it is simple, this model can represent most iterative learning algorithms.

Figure 10-3
A graph illustrates the trade-off among A S P, S S P, and B S P. The curve plotted between convergence rate and weak, strong consistency is of a parabola shape.

The trade-off among ASP, SSP, and BSP

Figure 10-4
A model diagram illustrates the iterative update process in a distributed learning system.

Analytical model of the iterative update process in distributed learning

Then we use the analytical model to express each barrier method as in Figure 10-5. The left part deals with the consistency. The += operator is the server logic about how to incorporate updates submitted to the central server into the global state. The right part deals with synchronization; computers either communicate to each other or contact the central server to coordinate their progress. As discussed earlier, the right side can be divided into two types of updates: deterministic and probabilistic.

The formulation reveals some very interesting structures from a system design perspective. For BSP and SSP, the central server couples the control logic of both consistency and synchronization. That is to say, if you choose tight consistency, you must also choose global synchronization control by a logic central server. For both BSP and SSP, one logical server is assigned to update model parameters and coordinate the progress of all nodes in an iterative learning algorithm. In a distributed system, this is often the bottleneck and single point of failure. ASP avoids such coupling by giving up synchronization and consistency completely.

The PSP, however, decouples consistency and synchronization. It achieves so by finding out that there exists another dimension in the design space of barrier methods: the completeness. Note that, for the previous three barriers, the degree of consistency is enforced upon the whole population. All the participating nodes are equally consistent.

Figure 10-5
A model diagram demonstrates the decoupling of consistency and synchronization. Deterministic and probabilistic are the 2 updates on the left. Global state is included in consistency and deterministic.

Decoupling consistency and synchronization

PSP exploits this dimension of the degree of completeness in a sample, which renders a distribution of degree of consistency. In PSP, each computer synchronizes with a small group of others, and the consistency is only enforced within the group. The completeness presents the level of coordination among the nodes. By changing the sample size, this dimension goes from fully complete (all working nodes are considered in synchronization) to not complete (each single node is considered separately).

The sampling strategy of PSP has a profound implication on the design of barrier control methods. By adding in this extra dimension, the convergence is now impacted by three factors, and Eq. 10.1 now becomes

$$ \textrm{Convergence}\ \upalpha\ \textrm{consistency}\ \textrm{of}\ \textrm{sample}\times \textrm{completeness}\ \textrm{of}\ \textrm{sample}\times \textrm{iteration}\ \textrm{rate} $$
(10.2)

In Eq. 10.2, the consistency is thus further decomposed into the consistency degree in a sample and the completeness of this sample.

Thus, PSP shows a tuning space that incorporates all the other barriers. As shown in Figures 10-6 and 10-7, in the refined design space, the ASP is placed at the bottom left, since it shows the weakest consistency (no control on the progress of other nodes) and completeness (each node only considers itself). On the other hand, BSP and SSP show full completeness, since they require a central server to synchronize the progress of all nodes. Similar to Figure 10-3, they show different levels of consistency.

10.4.1 Compatibility

As a more general framework, one noteworthy advantage of PSP lies in that it is straightforwardly compatible with existing synchronization methods, which provides the tuning dimension of consistency. In classic BSP and SSP, their barrier control mechanisms are invoked by a central server to test the synchronization condition with the given inputs. For BSP and SSP to use the sampling primitive, they simply need to use the sampled states rather than the global states when evaluating the barrier control condition. Within each sampled subset, these traditional mechanisms can then be applied. Users can thus easily derive probabilistic versions of BSP and SSP, namely, pBSP and pSSP. For example, Figure 10-8 shows that PSP can be applied to other synchronous machines as a higher-order function to derive probabilistic versions.

Formally, at the barrier control point, a worker samples β out of P workers without replacement. If one lags more than s updates behind the current worker, then the worker waits. This process is pBSP (based on BSP) if the staleness parameter s = 0 and pSSP (based on SSP) if s > 0. If s = ∞, PSP reduces to ASP.

Figure 10-6
A model of two dimensions in the design of four barrier control methods. A S P at the bottom left of the quadrant chart indicates weak consistency, fast iteration rate, and fully distributed. S S P at the top middle and B S P at the top right indicate strong consistency, slow iteration rate, and fully centralized.

Two dimensions in the design of barrier control methods

Figure 10-7
An illustration of a large design space to ease the process of finding a good trade-off among the barriers. The 3 dimensions are convergence rate, degree of consistency, and degree of completeness.

The new dimension allows us to explore a larger design space, which further makes it possible to find a better trade-off to achieve better convergence rate

Figure 10-8
4 tables of algorithm barrier function for B S P, p B S P, S S P, and p S S P. The input and output function of each algorithm is provided

PSP can be applied to other synchronous machines as a higher-order function to further derive a fully distributed version

Figure 10-9
An illustration of the composition of B S P with P S P.

Compose PSP with Bulk Synchronous Parallel

As an illustration, Figure 10-9 depicts how to compose BSP with PSP, namely, a subset of the population of nodes is chosen, and then the BSP is applied within the subset (pBSP). The composition of PSP and SSP (pSSP) follows the same idea.

Besides existing barrier control methods, PSP is also compatible with both centralized and decentralized training approaches. As described earlier, the extra completeness dimension decouples consistency and synchronization. The other full complete synchronization control methods require a centralized node to hold the global state. By using a sampling primitive, they can be transformed into fully distributed solutions. In a decentralized setting, based on the information it gathers from its neighboring nodes, a trainer node may either decide to pass the barrier control by advancing its local progress or wait until the threshold is met.

The benefits of exploring this two-dimensional design space are thus multitude. First, it enables constructing fully distributed barrier control mechanisms that are more scalable. As illustrated in Figure 10-2d, each node depends only on several other nodes to decide its own barrier, not on all other nodes. Second, it allows exploring barriers that can achieve better convergence. To ignore the status of the other workers with impunity, it relies on the fact that, in practice, many iterative learning algorithms can tolerate a certain degree of error as they converge to their final answers [12]. By controlling the sampling method and size, PSP reduces the impact of lagging nodes while also limiting the error introduced by nodes returning updates based on stale information. Third, in an unreliable environment, using the sampling primitive can minimize the impact of outliers and stragglers by probabilistically choosing a subset of the total workers as estimation. In summary, by tuning the sampling size and staleness parameters carefully, the generated barrier control methods can be robust against the effect of stragglers while also ensuring a degree of consistency between iterations as the algorithm progresses. In Section 10.6, we will investigate the performance in more detail.

Table 10-2 Notation Table
Figure 10-10
Three graphs. Graphs a and b represent the bounds on the mean and variance of a sampling distribution as a function of F of r superscript beta. Graph c is the sequence inconsistency in empirical training.

(a–b) Bounds on mean and variance of sampling distribution as a function of F(r)β. The staleness r is set to 4 with T equal to 10000. (c) Sequence inconsistency observed in empirical training

10.5 Convergence Analysis

In this section, we present a theoretical analysis of PSP and show how it affects the convergence of ML algorithms (SGD used in the analysis). The analysis mainly shows that (1) under PSP, the algorithm only has a small probability not to converge, and the upper limit of this probability decreases with the training iterations; (2) instead of choosing large sampling size, it is proved that a small number is already sufficient to provide a good performance. The notations used in the following analysis are presented in Table 10-2.

The analysis is based on the model shown in Figure 10-4. In a distributed machine learning process, these N workers keep generating updates, and a shared model is updated with them continuously. We count these updates by first looping over all workers at one iteration and then across all the iterations. In this process, each one is incrementally indexed by integer t. The total length of this update sequence is T. We apply an analysis framework similar to that of [15]. At each barrier control point, every worker A samples β out of N workers without replacement. If one of these sampled workers lags more than s steps behind worker A, it waits. The probabilities of a node lagging r steps are drawn from a distribution with a probability mass function f (r) and cumulative distribution function (CDF) F(r). Without loss of generality, we set both staleness r and sample size β parameters to be constants.

Ideally, in a fully deterministic barrier control system such as BSP, the ordering of updates in this sequence should be deterministic. We call it a true sequence. However, in reality, what we get is often a noisy sequence, where updates are reordered irregularly due to sporadic and random network and system delays. These two sequences share the same length. We define sequence inconsistency as the number of index difference between these two sequences and denote it by γt. It shows how much a series of updates deviate from an ideal case. If the sequence inconsistency is bounded, it means that what a true sequence achieves, in time, a noisy sequence can also achieve, regardless of the order of updates. This metric is thus a key instrument in theoretically proving the convergence property of an asynchronous barrier method.

Let \( R\left[X\right]=\sum \limits_t^T{f}_t\left({\overset{\sim }{\textbf{x}}}_t\right)-{f}_t\left({\textbf{x}}^{\star}\right) \). This is the sum of the differences between the optimal value of the function and the current value given a noisy state. To put it plainly, it shows the difference between “the computation result we get if all the parameter updates we receive are in perfect ideal order” and “the computation result we get in the real world when using, for example, PSP barrier.” Now we show the noisy system state, \( {\overset{\sim }{\textbf{x}}}_t \), converges in expectation toward the optimal, x, in probability. Specifically, since R[X] is accumulated over time, to get a time-independent metric, we need to show the value \( \frac{R\left[X\right]}{T} \) is bounded.

Theorem: SGD under PSP, convergence in probability Let \( f\left(\textbf{x}\right)=\sum \limits_{t=1}^T{f}_t\left(\textbf{x}\right) \) be a convex function where each ft ∈ R is also convex. Let x ∈ Rd be the minimizer of this function. Assume that ft are L-Lipschitz and that the distance between two points x and x is bounded: \( D\left(\textbf{x}\left|\right|{\textbf{x}}^{\prime}\right)=\frac{1}{2}\left|\right|\textbf{x}-{\textbf{x}}^{\prime }{\left|\right|}_2^2\le {F}^2 \), where F is constant. Let an update be given by \( {\textbf{u}}_t=-{\eta}_t\nabla {f}_t\left({\overset{\sim }{\textbf{x}}}_t\right) \) and the learning rate by \( {\eta}_t=\frac{\sigma }{\sqrt{t}} \). We have bound:

$$ P\left(\frac{R\left[X\right]}{T}-\frac{1}{\sqrt{T}}\left(\sigma {L}^2-\frac{2{F}^2}{\sigma}\right)-q\ge \delta \right)\le {e}^{-\frac{T{\delta}^2}{c+\frac{b\delta}{3}}}, $$
(10.3)

where δ is a constant and b ≤ 4NTLσ. The b term here is the upper bound on the random variables which are drawn from the lag distribution f (r). The q and c are two values that are related to the mean and variance of γt. If we assume that 0 < a < 1, then it can be proved that both q and c are bounded. Furthermore, if we assume with probability Φ that ∀t. 4NLσγt < O(T), then b < O(T). That means \( \frac{R\left[X\right]}{T} \) converges to O(T−1/2), in probability Φ with an exponential tail bound that decreases as time increases.

In other words, this theorem claims that as long as the difference between the noisy update sequence and the ideal sequence is bounded, and that the nodes in the system do not lag behind too far, PSP guarantees that (with certain probability) the difference between the result we get and the optimal result diminishes as more updates are generated and appended in the sequence. A formal proof of this theorem can be seen in [52].

10.5.1 How Effective Is Sampling

One key step in proving the preceding theorem is to prove the sequence inconsistency γt is bounded. We have proved that the mean and variance of vector γt are both bounded. Specifically, the average inconsistency (normalized by sequence length T) is bounded by

$$ \frac{1}{T}\sum \limits_{t=0}^T\textbf{E}\left({\boldsymbol{\upgamma}}_t\right)\le S\left(\frac{r\left(r+1\right)}{2}+\frac{a\left(r+2\right)}{{\left(1-a\right)}^2}\right), $$
(10.4)

and the variance has a similar bound:

$$ \frac{1}{T}\sum \limits_{t=0}^T\textbf{E}\left({\boldsymbol{\upgamma}}_t^2\right)<S\left(\frac{r\left(r+1\right)\left(2r+1\right)}{6}+\frac{a\left({r}^2+4\right)}{{\left(1-a\right)}^3}\right), $$
(10.5)

where

$$ S=\frac{1-a}{F(r)\left(1-a\right)+a-{a}^{T-r+1}}. $$
(10.6)

As intimidating as these bounds may seem, they can both be treated as constants for fixed a, T, r, and β values. They provide a means to quantify the impact of the PSP sampling primitive and provide stronger convergence guarantees than ASP, shown in Figure 10-11. They do not depend upon the entire lag distribution.

Figure 10-11
An illustration of the decomposition of the original sequence by sampling primitive into multiple sampling processes.

Sampling primitive decomposes the original sequence into multiple sampling processes (assuming no replacement for simplicity), and each has a partial view of the original one. Smaller sample size results in more sampling processes; each has even less complete view of the original one (i.e., less completeness), further reducing the synchronization level

The intuition provided in Eq. 10.4 and Eq. 10.5 is that, when applying PSP, the update sequence we get is not too different from the true sequence, regarding both mean and variance of the difference. To demonstrate the impact of the sampling primitive on bounds quantitatively, Figures 10-10a and 10-10b show how increasing the sampling count, β (from 1 to 128, marked with different line colors on the right), yields tighter bounds. Notably, only a small number of nodes need to be sampled to yield bounds close to the optimal. This result has an important implication to justify using the sampling primitive in large distributed learning systems due to its effectiveness. This will be further verified in the evaluation section.

The discontinuities at a = 0 and a = 1 reflect edge cases of the behavior of the barrier method control. Specifically, with a = 0, no probability mass is in the initial r steps, so no progress can be achieved if the system requires β > 0 workers to be within r steps of the fastest worker. If a = 1 and β = 0, then the system is operating in ASP mode, so the bounds are expected to be large. However, these are overly generous. Better bounds are O(T) for the mean and O(T2) for the variance, which we give in our proof. When a = 1 and β ≠ 0, the system should never wait and workers could slip as far as they like as long as they returned to be within r steps before the next sampling point.

Besides theoretical analysis, an intuitive visualization of sequence inconsistency γt is shown in Figure 10-10c. We run a distributed training experiment with various barrier methods for 100 seconds and measure the number of difference between true and noisy sequence at a fixed interval during the whole process. The result shows that the sequence inconsistency using ASP keeps growing linearly, while in SSP it increases and decreases within a certain bound, which is decided by the staleness parameter. Applying sampling to SSP relaxes that bound, but unlike ASP, inconsistencies using pSSP grow sublinearly with sequence length. BSP is omitted in the figure, since its true and noisy sequence is always the same. pBSP shows a tight bound (about 0.5) even with only 5% sampling.

10.5.2 Implementation Technique

As shown in Table 10-1, barrier control methods are widely used in existing systems, such as parameter servers, Hadoop, etc. Indeed, PSP is not yet widely available in many systems, which means the completeness dimension in synchronization method design cannot be readily utilized. The good news is that bringing the extra design dimension requires minimal effort. To implement PSP atop of current data analytics frameworks, developers only need to add a new primitive: sampling. As shown in Section 10.4, it is straightforward to compose existing barrier methods in a distributed system.

By default, we choose the trainers randomly. There are various ways to guarantee the random sampling, for example, organizing the nodes into a structural overlay such as the Distributed Hash Table (DHT). The random sampling is based on the fact that node identifiers are uniformly distributed in a namespace. Nodes can estimate the population size based on the allocated ID density in the namespace.

The choice of samples has a great impact on the performance of PSP. The sampling of PSP provides an estimate of the total distribution of the progress of all the workers. In a worst-case scenario where the sampled subset happens to be all stragglers, this subset cannot provide a very efficient estimation of all the workers. Different sampling strategies can be used in certain scenarios.

For example, we can change how frequently the sample changes during distributed computing. Or, we can choose the workers according to their previous computation time. Specifically, at each round, all the workers are categorized into two groups according to their historical computing time per iteration, one slow and one fast, and then choose equal numbers of workers from both groups to form the target subset. We can use clustering algorithms such as K-Means.

10.6 Evaluation

In this section, we investigate the performance of various barrier control methods in experiments and the trade-off they make. We focus on two common metrics in evaluating barrier strategies: the accuracy and system progress. Using these metrics, we explore various barrier controls with regard to the impact of sample settings and stragglers in the Federated Learning system. Besides, we also use a new metric called progress inconsistency as a metric of training accuracy, but without the impact of specific application hyperparameters.

10.6.1 Experiment Setup

We perform extensive experiments on the real-world dataset FEMNIST, which is part of LEAF, a modular benchmarking framework for learning in federated settings, and includes a suite of open source federated datasets [8]. Similar to MNIST, the FEMNIST dataset is for image classification tasks. But it contains 62 different classes (10 digits, 26 lowercases, and 26 uppercases). Each image is of size 28 by 28 pixels. The dataset contains 805,263 samples in total. The number of samples is distributed evenly across different classes.

To better study the performance of the proposed method with non-IID data distribution in Federated Learning, we follow the data partition setting in [7]. We first sort the data by class labels, divide them into 2n shards, and assign each of n workers 2 shards. This pathological non-IID partition makes the training data on different workers overlap as little as possible. The validation set is 10% of the total data. Besides, we preprocess it so that the validation set is roughly balanced in each class. As for training hyperparameters, we use a batch size of 128, and we use the Adam optimizer, with learning rate of 0.001 and coefficient of (0.9, 0.999).

We conduct our experiment on a server that has 56 Intel(R) Xeon(R) CPU E5-2680 v4 and a memory of 256G. In the rest of this section, if not otherwise mentioned, we use 16 workers by default. Besides, one extra worker is used for model validation to compute its accuracy. In the rest of this section, we aim to show the wide range of tuning space enabled by the sampling parameter and how existing barrier methods can be incorporated into PSP.

10.6.1.1 Accuracy

We execute the training process using each method on the non-IID FEMNIST dataset for about eight epochs. The results are shown in Figure 10-12. The subfigure uses time as the x axis. It shows the change of trained model accuracy in about 10,000 seconds. It compares the ASP, BSP, and pBSP (composing PSP with BSP) where the sampling size equals 4.

The first thing to note here is, though the performance of ASP looks optimal at the beginning due to its quick accumulation of updates from different workers, it quickly deteriorates and fails to converge. Compared to the unstable performance of ASP, BSP steadily converges. Then the pBSP clearly outperforms these two regarding model accuracy, especially in the later part of training. Due to its probabilistic nature, the pBSP line shows larger jitters than BSP, but also follows the general trend of BSP toward convergence steadily.

The strength of PSP lies in that it combines the advantages of existing methods. In the lower subfigure of Figure 10-12, we use the accumulated total number of updates the parameter server has received as the x axis to compare the “efficiency” of the updates in ASP, SSP, and pSSP. The staleness parameter of SSP and pSSP is set to 4 here. We can see that as updates are accumulating, despite using sampling, the accuracy increase of pSSP is similar to that of SSP.

Meanwhile, pSSP is much faster than SSP with regard to the update progress or the rate at which the updates accumulate at the parameter server. Figure 10-13 shows the number of updates at the server with regard to time (here, we show only results from the beginning of evaluations). As can be seen, at any given time, both pBSP and pSSP progress faster than BSP and SSP correspondingly. Of course, ASP progresses the fastest since it does not require any synchronization among workers, but its nonconverged updates make this advantage obsolete.

The difference of the number of updates can be directly interpreted as the communication cost, since each update means the transmission of weight and gradient between the server and clients. For example, at about 600s, the pSSP incurs 35% more traffic than SSP; and pBSP even doubles the traffic in BSP. In our experiments, the PSP can reduce communication overhead without sacrificing the final model accuracy.

Figure 10-12
Two graphs. Part a plots the A S P, B S P and p B S P curves for accuracy versus time. Part b plots the A S P, S S P, and p S S P curves for accuracy versus the number of updates. The curves constantly fluctuate in both graphs.

Performance comparison between different synchronization methods

Figure 10-13
A line graph plots the progress of A S P, B S P, p B S P, S S P, and p S S P. All the lines follow inclining trends.

Number of updates accumulated at the parameter server for different barrier methods

PSP combines the best of two worlds. On one hand, it has similar update efficiency as SSP and BSP; on the other hand, it achieves faster update progress that is similar to ASP. As a result, it outperforms the existing barrier control methods.

Figure 10-14
Two graphs. Part a depicts the number of nodes of A S P, B S P, S S P, p B S P, and p S S P bars with progress. Part b depicts the variations in C D F of nodes of p B S P parameterized by different sample sizes with progress.

(a) System progress distribution; (b) pBSP parameterized by different sample sizes, from 0 to 24. Increasing the sample size makes the curves shift from right to left with decreasing spread, covering the whole spectrum from the most lenient ASP to the most strict BSP

10.6.2 System Progress

In this section, we use 32 workers and run the evaluation for 400 seconds. Figure 10-14a shows the distribution of all nodes’ progress when evaluation is finished.

As expected, the most strict BSP leads to a tightly bounded progress distribution, but at the same time, using BSP makes all the nodes progress slowly. At the end of the experiment, all the nodes only proceed to about the 80th update. As a comparison, using ASP leads to a much faster progress of around 200 updates. But the cost is a much loosely spread distribution, which shows no synchronization at all among nodes. SSP allows certain staleness (4 in our experiment) and sits between BSP and ASP.

PSP shows another dimension of performance tuning. We set sample size β to 4, that is, a sampling ratio of only 12.5%. The result shows that pBSP is almost as tightly bound as BSP and also much faster than BSP itself. The same is also true when comparing pSSP and SSP. In both cases, PSP improves the iteration efficiency while limiting dispersion.

To further investigate the impact of the sample size, we focus on BSP and choose different sample sizes. In Figure 10-14b, we vary the sample size from 0 to 24. As we increase the sample size, the curves start shifting from right to left with tighter and tighter spread, indicating less variance in nodes’ progress. With sample size 0, the pBSP exhibits exactly the same behavior as that of ASP; with increased sample size, pBSP starts becoming more similar to SSP and BSP with tighter requirements on synchronization.

Another important observation worth mentioning is, with a very small sample size of one or two (i.e., very small communication cost on each individual node), pBSP can already effectively synchronize most of the nodes compared to ASP. The tail caused by stragglers can be further trimmed by using a larger sample size. This observation confirms our theoretical analysis in Section 10.5, which explicitly shows that a small sample size can effectively push the probabilistic convergence guarantee to its optimum even for a large system size, which further indicates the superior scalability of the proposed solution.

Figure 10-15
A line graph of the accuracy decrease versus slowness for A S P, B S P, and p B S P. All three lines depict declining trends.

Stragglers impact both system performance and accuracy of model updates. Probabilistic synchronization control by a sampling primitive is able to mitigate such impacts

When composed with BSP, PSP can increase the system progress of BSP by about 85% while retaining the almost the same tight bound on progress distribution. Besides, by tuning the sample size, the evaluation result shows that a small size such as 2 or 4 in a system of 32 workers can effectively provide a tight convergence guarantee.

10.6.2.1 Robustness to Straggler

Stragglers are not uncommon in traditional distributed training and are pervasive in the workers of Federated Learning. In this section, we show the impact of stragglers on system performance and accuracy of model updates and how probabilistic synchronization control by a sampling primitive can be used to mitigate such impacts.

As explained before, we model the system stragglers by increasing the training time of each slow trainer to n-fold, namely, on average they spend n times as much time as normal nodes to finish one iteration. The parameter n here is the “slowness” of the system. In the experiment shown in Figure 10-15, we keep the portion of slow nodes fixed and increase the slowness from 2 to 8. Then we measure the accuracy of using a certain barrier control method at the end of training. To be more precise, we choose a period of results before the ending and use their mean value and standard for each observation point.

Figure 10-15 plots the decreasing model accuracy due to stragglers as a function of the straggler slowness. As we can see, both ASP and BSP are sensitive to stragglers, both dropping about 20% accuracy by increasing slowness from 2x to 8x, while that of pBSP only drops by less than 10%. For BSP, this is mainly because the stragglers severely reduce the training update progress; for ASP, this can be explained as the result of its asynchronous nature, where updates from slow workers are delayed. This problem is exacerbated by the non-IID data, where the data overlap between different workers is limited, if not none at all. Once again, PSP takes the best of both worlds. As we have shown before, its probabilistic sampling mitigates the effect of data distribution and is also less prone to the progress reduction caused by stragglers.

PSP is less prone to the stragglers in the system. When the slowness increases from 2x to 8x, both ASP and BSP are sensitive to stragglers, both dropping about 20% accuracy, while that of pBSP only decreases by less than 10%.

Figure 10-16
A graph of accuracy and time. All the lines follow an upward trend with fluctuation.

Varying sample sizes in pSSP

Figure 10-17
Two boxplots of the comparison between p B S P with basic sampling strategy and p B S P with dynamic sampling strategy. The maximum accuracy is 40 in graph a and between 50 and 60 in graph b.

Compare different strategies of sampling in PSP

10.6.3 Sampling Settings

In Section 10.6, we investigate how the choice of sampling size affects the progress in PSP. One question is then: How to choose the suitable sample size? As pointed out in Section 10.5, one important observation that can be derived from our theory proof is that a small number of sampling can achieve similar performance as that using large sample numbers.

To demonstrate this point in evaluation, we choose different numbers of sampling size, from 2 to 8, in a 16-worker training, and compare them to SSP. The training lasts for a fixed time for all the used methods. In Figure 10-16, we can see that, even though the number of samples changes, the performance of pSSP is still close to that of SSP. In this scenario, choosing a smaller number of sampling leads to better performance than the others, due to its fast progress of updates. However, it is not a rule of thumb to always use a small sample size. Choosing suitable parameters in a great tuning space enabled by PSP is a nontrivial task, and we are working to illustrate this challenging problem in our future work.

In Section 10.4, we discuss three different sampling strategies. First is the basic strategy that chooses a certain number of workers as a subset and uses them to estimate training progress of all the workers. The second one, the dynamic sampling strategy, rechooses this subset dynamically instead of keeping it fixed. The third is a grouping strategy that precluster workers into two groups according to their execution speed and then chooses T samples equally from both groups.

Figure 10-18
Two graphs illustrate the average and variance of progress inconsistency for A S P and different samples of p B S P, with a variation in the number of workers. In part a, all the lines follow an upward trend, and all the lines fluctuate in part b.

Average and variance of normalized progress inconsistency in PSP with regard to sample size (100 nodes in total)

To compare these strategies, we use 24 workers and set 6 of them to be stragglers (1x slower in computing backpropagation). We use pBSP and increase its sampling size from 2 to 8. Both training use the same number of epochs. The results are shown in Figure 10-17. Each box shows the distribution of model accuracy numbers near the end of each training; the last ten results are used in this experiment.

The result shows that, compared to the basic strategy, the dynamic one can effectively increase the efficiency of PSP. The increase ranges from about 25% to twofold for different sampling sizes. The low accuracy of the basic strategy shows that it tends to result in a more asynchronous training, which is more similar to ASP than BSP.

The grouping strategy achieves similar results as the dynamic one, but shows smaller deviation of box, which means a smoother curve in training (result figure omitted due to space limit). Besides, in dynamic strategy, the sampling size does not visibly affect the model accuracy, which means that the smaller sample size can be used to increase system progress without sacrificing model accuracy. Also note that in both cases, the larger sampling size leads to smaller deviation. This also agrees with the design and analysis of PSP shown in previous sections.

We learned two things in this section. First, by varying the setting of the sampling size from 2 to 8 in pSSP by using a worker size of 16, it can be seen that a small sampling size can still achieve that of a large one, regarding model accuracy. Second, the dynamic and grouping sampling strategies can both effectively improve the performance. Compared to the basic strategy, both can effectively increase the efficiency of PSP. The increase ranges from about 25% to twofold for different sampling sizes.

10.6.3.1 Progress Inconsistency

In the previous section, we have evaluated the impact of barrier control methods on the accuracy of three different models. However, the training accuracy is affected not only by the barrier method, which controls training inconsistency, but also hyperparameters such as learning rate. The tolerance of error in training for different applications also varies greatly. To better understand the impact of barriers on model consistency during training without considering the influence of these factors, we use progress inconsistency as a metric to compare barriers.

In distributed training, for a worker, between the time it pulls a model from a server and updates its own local model, the server likely has already received several updates from other workers. These updates are the source of training inconsistency. We define progress inconsistency as the number of these updates between a worker’s corresponding read and update operations. In this experiment, we collect the progress inconsistency value of each node at its every step during training.

We investigate the relationship between the number of nodes and inconsistency of pBSP. All executions run for 100 seconds, and we increase workers from 50 to 500. We measure the average and variance of progress inconsistency, both normalized with the number of workers, as shown in Figure 10-18. The average inconsistency of ASP is mostly unaffected by size. With a smaller sample size, that of pBSP becomes close to ASP, but note that only the initial increase of network size has a considerable impact. With sample size fixed and network size growing, the average inconsistency grows sublinearly, which is an ideal property. As to the standard deviation values of pBSP, they mostly keep stable regardless of network size.

According to these observations, we can see that for PSP, both the average training inconsistency (denoted by mean) and the noise (denoted by variance) grow sublinearly toward a certain limit for different sample sizes, limited by that of ASP and BSP/SSP.

10.7 Summary

In this chapter, we explored the topic of distributed computing in Owl, with a focus on the topic of synchronization barriers. We showed Actor, an OCaml-based distributed computing engine, which has implemented three different computing paradigms, namely, map-reduce, parameter server, and peer-to-peer. Orthogonal to that, it also has implemented four different types of barrier control methods.

We proposed the Probabilistic Synchronous Parallel, which is suitable for data analytic applications deployed in large and unreliable distributed systems. It strikes a good trade-off between the efficiency and accuracy of iterative learning algorithms by probabilistically controlling how data is sampled from distributed workers. In Actor, we implemented PSP with a core system primitive of “sampling.” We showed that the sampling primitive can be combined with existing barrier control methods to derive fully distributed solutions. We then evaluated the performance of various barrier control methods. The effectiveness of PSP in different application scenarios depends on the suitable parameter, that is, the sample size. Similar to the performance tuning in numerical computation, we suggest resorting to prior knowledge and empirical measurement for its parameter tuning and regard this as the challenge for the future exploration.