A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce

Large-scale distributed training mainly consists of sub-model parallel training and parameter synchronization. With the expansion of training workers, the efficiency of parameter synchronization will be affected. To tackle this problem, we first propose 2D-TGA, a grouping AllReduce method based on the two-dimensional torus topology. This method synchronizes the model parameters by grouping and makes full use of bandwidth. Secondly, we propose a distributed algorithm, 2D-TGA-ADMM, which combines the 2D-TGA with the alternating direction method of multipliers (ADMM). It focuses on sub-model training and reduces the wait time among workers in the synchronization process. Finally, experimental results on the Tianhe-2 supercomputing platform show that compared with the MPI_Allreduce\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathtt {MPI\_Allreduce}}$$\end{document}, the 2D-TGA could shorten the synchronization wait time by 33%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$33\%$$\end{document}.


Introduction
In recent years, with the development of information technology, data are exploding, and we have ushered in a new era of "Big Data". Traditional machine learning focuses on the speed of data processing on a single machine, while it is impossible to store and calculate large amounts of data on a single machine. In addition, the development speed of computing engines has lagged far behind the growth speed of model computing demand. It is a necessary solution to distribute the data or model to multiple machines for computing.
The primary purpose of this revolution is to use large amounts of data to enable knowledge discovery and better decision-making. The primitive idea of distributed machine learning (DML) is to parallelize the computing operation across multiple local devices (aka workers and nodes) to solve the following distributed optimization problem where i presents model parameters vector; L i ( i ) is the local objective function for workeri ( i ∈ 1, 2, ⋯ , N ). Distributed optimization algorithms are currently one of the most popular research directions, with a specific focus on approaches that try to optimize a performance criterion employing available data stored at local devices [1]. The ADMM combines the decomposability of dual ascent method with good convergence of Lagrange multiplier method. It may be used to solve problem (1) and has a wide range of applications. Wang et al. [2] propose an ADMM-based DML architecture that preserves privacy. Raja et al. [3] design a Secure and Private-Collaborative Intrusion Detection System (SP-CIDS). Steck et al. [4] introduce the Sparse Linear Model, which may be employed in recommendation systems and is based on the ADMM algorithm architecture.
(1) min Distributed training utilizes multiple workers to train simultaneously to accelerate the training process by cooperating with each other. Communication becomes an essential part of the cooperation process. The communication mechanism in DML is more challenging. Firstly, training machine learning models usually uses iterative optimization algorithms, and more iterations will lead to high communication frequency. Secondly, DML often trains large models. In order to obtain the updated information of the models, each worker needs to communicate with each other, which determines the large amount of communication data in DML. Thirdly, DML [5] has an important synchronization problem of model parameters. Synchronization based on AllReduce will cause the problem of "slow workers". Workers need to wait for each other and then complete the synchronization. Many methods have been proposed to deal with these challenges. Hasanov et al. [6] utilize the hierarchical idea to design the MPI reduction algorithm, and Wang et al. [7] design an asynchronous ADMM algorithm based on a hierarchical view. Xie et al. [8] design a parameter synchronization architecture for the ADMM algorithm that combines a hierarchical architecture with Ring All-Reduce. The architecture adopts a stale synchronous parallel computing model. When the worker size is large, the flat parameter synchronization architecture will bring many problems, such as too small each parameter segment and too many transmission times in the Ring Allreduce architecture. A hierarchical parameter synchronization architecture can effectively alleviate these problems. The computing workers are organized in a hierarchical manner, and all workers are grouped at each level. The same worker can belong to different groups at different levels. Different levels can independently use different parameter synchronization architectures. We propose a grouping AllReduce algorithm based on the two-dimensional torus topology named 2D-TGA to alleviate these problems. The main contributions of this paper can be summarized as follows: 1) This paper proposes a synchronous algorithm 2D-TGA, which uses a grouping mode. In the first phase of the algorithm, intra-group workers perform Ring AllReduce and the groups are parallelized. In the second phase, each group selects the worker with the lowest rank as Leader and acts as the leader group. The workers in the leader group execute Ring AllReduce in a two-dimensional manner. This algorithm can further utilize the network bandwidth and reduce the synchronous time. 2) This paper proposes a distributed algorithm: 2D-TGA-ADMM, which is a distributed algorithm based on ADMM and uses 2D-TGA to synchronize model parameters. This algorithm has good scalability and can be used to solve large-scale DML problems.
3) We use the proposed distributed algorithm to solve the large-scale logistic regression problem on the Tianhe-2 supercomputing platform. The experimental results show that the efficiency of the synchronous algorithm proposed by us is better than that of the baseline algorithm.
The remainder of this paper is organized as follows. Section 2 illuminates the related work. Section 3 outlines the 2D-TGA algorithm in detail. Section 4 describes the design of the ADMM-based distributed algorithm, which uses 2D-TGA to synchronize the local model parameters. In Sect. 5, we first analyze the synchronization overhead of four synchronous AllReduce algorithms theoretically. Then, we use ADMM-based distributed algorithm to solve the logistic regression problem with 2-norm on the Tianhe-2 supercomputing platform and verify the efficiency of our proposed 2D-TGA for parameter synchronization. Section 6 concludes the paper and future work.

Related Work
Compared with single-machine training, distributed training introduces additional data communication between workers, so that the speed of distributed training cannot improve linearly with the increase in the number of computing workers. When communication requirements and capabilities are determined, distributed machine learning training can be accelerated by improving communication efficiency. In a decentralized parameter synchronization architecture, each worker exchanges information only with its own neighbors. The main advantage of a decentralized architecture is that it effectively reduces traffic. The typical AllReduce algorithm is to broadcast the root information to each process after the Reduction operation. This means that the root will have a communication bottleneck. The optimized algorithms are based on a few principles: reduce broadcast, recursive halving and doubling, butterfly, binary blocks, and ring [9].

Ring AllReduce
Ring-based AllReduce is an algorithm with constant communication cost. OpenMPI [10] is implemented in 2007 based on the Ring AllReduce algorithm. Patarasuk et al [11] propose a ring-based AllReduce based on the P2P architecture and prove that a ring-based AllReduce has the optimal bandwidth of the AllReduce algorithms in 2009. Baidu Silicon Valley Lab [12] integrates this strategy with the field of deep learning in 2017 and solves the network bottleneck of synchronous updates.
The Ring AllReduce algorithm is depicted in Fig. 1a. The algorithm uses two phases of Reduce-Scatter and AllGather for synchronization. When N workers are in a ring topology, 2(N − 1) steps are required to accomplish the synchronization. The Ring AllReduce can avoid contention on most network topologies [11]. As the number of workers increases, the Ring AllReduce algorithm will encounter issues such as considerable delay and low fault tolerance, which is not conducive to the scalability of distributed algorithms.

2D-Torus AllReduce
Sony [13] proposes the 2D-Torus AllReduce algorithm in 2018 to reduce the communication overhead during model parameters synchronization. The communication process is divided into two dimensions and three phases. After the two-phase combination, each GPU has some final model parameters. Figure 1b shows that the algorithm consists of three phases, Reduce-Scatter, AllReduce and AllGather. Although a 2D-Torus AllReduce has one phase more than a Ring All-Reduce, its overall communication overhead is still smaller [13]. The algorithm uses a 2D-Torus topology, which synchronizes both horizontally and vertically. In a distributed system with N workers, the algorithm requires 4(N − 1) communication steps to accomplish the synchronization.
Ying et al. [14] find that, on a TPU Pod, the 1D Ring AllReduce algorithm is limited by the latency of pushing Fig. 1 Topology of AllReducebased synchronization algorithm packets in a Hamiltonian circuit across all the nodes in the pod. They develop a 2D mesh algorithm, which performs the Reduction operation in two phases, one phase for each dimension. The 2D-Mesh algorithm has twice the throughput of gradient aggregations than the 1D Ring AllReduce.

Hierarchical AllReduce
Tencent proposes the hierarchical AllReduce algorithm [15], and the communication topology is shown in Fig. 1c. The algorithm uses a hierarchical ring topology and consists of three phases: intra-groups AllReduce, inter-groups AllReduce, and broadcast within the group. In a distributed system where N workers are divided into L groups, 3(N∕L − 1) + 2 communication steps are required to complete the synchronization operation. Compared with a Ring AllReduce, this three-phase hierarchical AllReduce decreases the running steps from 2(N − 1) to 3(N∕L − 1) + 2 . Facebook [16] applies the hierarchical AllReduce algorithm to large-scale model training, which greatly improves model training speed due to the algorithm's ability to reduce latency costs.
Although this hierarchical method reduces the number of running steps, the inter-groups operation still encounters the problem of high latency of 1D Ring AllReduce. Since grouping rings into a few hierarchical collective operations seems to give a better performance, the optimal dimensions of this hierarchical communication depend on multiple aspects. Ueno et al. [17] provide a strategy for choosing the optimal hierarchical communication for deep learning workloads.

Synchronous Algorithm 2D-TGA
The effectiveness of the synchronous algorithm is closely related to communication topology. The 2D-TGA method divides workers into groups to decrease communication latency and low fault tolerance when the number of workers increases in a ring topology.
2D-TGA algorithm mainly adopts logical ring Reduce-Scatter operation. As illustrated in Fig. 1d, the sixteen workers are divided into four groups, and a leader is chosen from each group. After executing the Ring AllReduce algorithm, all the workers within the group have the same parameters. By establishing a Cartesian topology for the leaders, the leaders are arranged in the form of a two-dimensional grid. Each leader exchanged parameters with the adjacent leader in different directions (UP, DOWN, LEFT, RIGHT). The mechanism will be described in the following paragraphs.
, and we number the chunks by The logical ring mode includes the following communication process: We use a generic operator ⊕ to denote the reduce operator. The Reduce-Scatter operation is performed as follows. In the first iteration, worker W i sends chunk [chk (i+N)%N ] to W (i+1)%N . After each worker receives the chunk, all workers have The result of each worker in the j-th iteration can be expressed as: We illustrate how the 2D-TGA algorithm operates using an example. Figure 2 shows how to divide sixteen workers into four groups, and each group has four workers. Four distinct colors represent the four groups. In addition, we set the dimension of the parameter vector on each worker to 16, and each parameter vector value is composed of the serial number of the worker it corresponds to. Through the Ring AllReduce method after grouping, the size of the ring made by the worker can be significantly reduced, the communication delay can be reduced, and the communication efficiency can be improved. According to the number of workers in the group, the sixteen parameters are divided into four chunks. The data in a red box indicate the data chunk to be sent. The workers in the first group conduct Ring AllReduce, as shown in Fig. 2a. Leaders do Reduce-Scatter in parallel horizontally, as seen in Fig. 2b. The orange arrow shows the transmission direction of the parameters. The LEFT worker transmits parameters to the RIGHT worker, and each worker gets parameters from the LEFT worker. Figure 2c depicts leaders performing Segmented-Ring operations vertically in parallel. The green arrow shows the direction in which the parameters transfer. The UP worker provides parameters to the DOWN worker, while the DOWN worker gets parameters from the UP worker. Leaders conduct the AllGather operation horizontally in parallel, as shown in Fig. 2d. The first group leader executes the broadcast operation, as shown in Fig. 2e. Algorithm 1 depicts the 2D-TGA algorithm. where x ∈ ℝ d denotes the model, d denotes the number of sample features , l(x) denotes the loss function, and r(x) denotes the regularization term. Typically, distributed optimization is usually turned (3) into a consensus problem, as illustrated in (4).

Write problem (4) in the augmented Lagrangian form:
where { i } symbolize Lagrangian multipliers, > 0 is the penalty parameter, and < ⋅, ⋅ > denotes the inner product. Convergence speed is also influenced by penalty parameter [4]. The value of is verified through experiments [18]. They found that a lower value can make the algorithm converge more quickly. By updating x i and z iteratively (at the kth iteration, denoted x k i and z k ), L({x i }, z, { i }) is minimized. In order to integrate the decomposability of dual ascent with the excellent convergence properties of method multipliers, an improved form of optimized ADMM was proposed and use the alternate method to allow the problem to be easily decomposed. The ADMM algorithm update process is as follows:

Distributed algorithm 2D-TGA-ADMM
The distributed ADMM algorithm's iterative phase is seen in (6). In distributed systems, the update approach works well. Equations (6a) and (6c) are used parallel by the workers to update the local parameters x i and i . The global variable z is updated by the total of (x i + i ) of all workers, as indicated in Equ. (6b). Thus, we take (x i + i ) as a whole and define it as w i , as shown in Equ. (7) Communication topology is an important factor that affects the scalability of distributed optimization algorithms. In order to minimize the synchronization time of model (6a) parameters, we adapt the communication-efficient grouping AllReduce based on the two-dimensional torus topology (2D-TGA) proposed in Sect. 3. Next, all w i in each worker are reduced to w by the synchronous algorithm 2D-TGA. The value of the global variable z is obtained by averaging w . The dual variable i is generated by completing (6c). This paper proposes a distributed algorithm 2D-TGA-ADMM to handle distributed optimization problems, which combines the ADMM algorithm with the communication-efficient 2D-TGA algorithm. Algorithm 2 depicts the algorithm flow. The 2D-TGA-ADMM algorithm can be implemented in a distributed computing environment like MPI or Spark.

Evaluation
This paper provides theoretical analysis to compare the AllReduce-based synchronous algorithm's performance, reflecting the theoretical time in a single worker synchronization action. The denotation is provided in Table 1. Based on this denotation, global synchronization time (GST) in different synchronization algorithms can be calculated.
The design of the 2D-TGA algorithm is based on Ring AllReduce, where the synchronization process consists of some Reduction phases (e.g., Scatter-Reduce, and Allgather), each of which is composed of communication steps. The Ring AllReduce algorithm has two phases. The first phase is the Scatter-Reduce, which passes through N −   Figure 3a displays the communication time for Ring-AllReduce, Hierarchical-AllReduce, 2D-Torus-AllReduce, and 2D-TGA algorithms with different #workers. We use the values in Table 1 to get the theoretical analysis results, as shown in Fig. 3a.
In this paper, the global communication time formula is used to calculate the communication time of the corresponding worker, as shown in the bold formula in Table 2. As shown in Fig. 3a, we set #groups to 16. As #workers increase, the communication time of Ring-AllReduce will grow linearly. The communication time of 2D-TGA and Hierarchical-AllReduce algorithms is comparable and does not rise as #workers increase and maintain steady, which can enhance the scalability of distributed algorithm. Figure 3a also shows that the 2D-TGA method is slightly better than the Hierarchical-AllReduce algorithm.
In addition, both the 2D-TGA and Hierarchical-All-Reduce algorithms leverage the concept of grouping. To investigate the impact of grouping on both algorithms, we increase the number of workers to 1024 and examine the impact of varying #groups on the two algorithms over time. Figure 3b shows that when the number of groups increases, the performance of the 2D-TGA is much better than Hierarchical-AllReduce algorithm. The main reason is that the algorithm uses Segmented-Ring operation in the process of vertical AllReduce, which may greatly reduce the communication time. Compared with using the All-Reduce algorithm directly, the synchronous time will be 1 3 reduced, which is also an advantage of the method presented in this paper.
Since grouping impacts the 2D-TGA algorithm, this study estimates the influence of different groups on the algorithm as the number of workers grows. It can be seen from Fig. 4a that grouping is closely related to algorithm performance. As shown in Fig. 4b, the less groups, the better the performance of the algorithm.

Experiment
Logistic regression is a machine learning method used to solve binary classification problems. To obtain strong generalization abilities, one adds an 2 regularization term; in this paper, we consider the following form of regularized logistic regression: where x ∈ ℝ d represents model parameters, n represents the number of samples, D i ∈ ℝ d represents the i-th sample, and b i ∈ {−1, 1} represents the label of the i-th sample.
Experimental Settings. In this section, distributed ADMM algorithm is used to solve the logistic regression problem with the 2-norm. Combining the ADMM algorithm with different synchronous algorithms is used to compare the impact of different synchronous algorithms on the scalability of distributed algorithms. To solve sub-problems in distributed ADMM method, we employ the Trust Region Newton method (TRON) [19], and the dataset uses the public dataset url 1 and webspam 2 , as shown in Table 3. The Tianhe-2  supercomputing cluster serves as the paper's experimental platform. Each node is equipped with two Xeon E5 12-core CPUs and 88 GiB of memory. Our experimental schemes are set as follows: 16 cores× 2 nodes, 16 cores× 4 nodes, 16 cores× 8 nodes, 16 cores× 32 nodes. Each node uses 16 cores, and each core runs one process. Each process represents a worker.
Convergence. This paper uses the relative error function ( f rerr ) to present the convergence of the ADMM algorithm.
The definition of f rerr is shown in Equ. (9), where f represents the value of the objective function in the current state and f * represents the minimum value of the objective function. Figures 5b and 6b, respectively, show the convergence of the ADMM algorithm using three different AllReduce-based synchronous algorithms to solve the logistic regression problem under the url and webspam datasets on 64 workers in 4 nodes. As can be seen from the two figures, the distributed ADMM algorithm based on three synchronous algorithms has the same convergence rate, which means that the three synchronous algorithms do not affect the convergence of the ADMM algorithm. This setting can eliminate interference and test the performance of the three synchronous algorithms more accurately.
Synchronization Wait Time. Table 4 shows the running time of the experiments, including updating time and synchronization wait time. The updating time refers to the computation time of the TRON method. Due to different calculation speeds of the workers, in a single iteration of the ADMM algorithm, we store the longest computation time between workers and accumulate it with the number of iterations of the ADMM algorithm. The synchronization wait time refers to the time for the model parameters w i to communicate among workers and Reduction operations.
We select different #workers to test the distributed ADMM algorithm based on the three different synchronous algorithms. Figure 5a shows the synchronous overhead of testing the distributed ADMM algorithm on the url public dataset. Compared with the _ algorithm in the MPI library, as #workers increase, the 2D-TGA synchronous algorithm can reduce the synchronization wait time by 32.6% . However, this algorithm still has drawbacks compared with the Ring-AllReduce synchronous algorithm on the url dataset. As shown in Fig. 5a, we also find that, as #workers increase, the synchronization wait time of the 2D-TGA gradually approaches to the Ring-AllReduce algorithm. This is also the same as the theoretical analysis in Fig. 3a. As #workers increase, the communication time of the Ring-AllReduce algorithm is higher than the 2D-TGA synchronous algorithm proposed in this paper.
In distributed machine learning, the dimension of the model is the decisive factor of the communication volume. In order to evaluate the effectiveness of the 2D-TGA synchronous algorithm, we test it on the high-dimensional webspam dataset. As shown in Fig. 6a, comparing the collection communication algorithm _ and Ring-AllReduce algorithms, the synchronization wait time of the 2D-TGA under different #workers has apparent advantages, and the efficiency can be improved by 33.8% compared with the collection communication algorithm _ . Why do the Ring-AllRedcue and 2D-TGA have different performances on datasets with different dimensions? Through the theoretical analysis of Fig. 3a, it can be found that as #workers increase, the performance of the Ring-AllReduce algorithm gets worse and worse than the 2D-TGA algorithm. Figure 6a also shows that the 2D-TGA algorithm is more suited to high-dimensional datasets than the Ring-AllReduce technique.
Through the theoretical analysis in Fig. 4, we find that as #workers increase, the number of groups also affects the performance of the 2D-TGA algorithm. As shown in Fig. 4b, theoretical analysis demonstrates that as the increase in # workers, the more #groups, the worse the performance of the algorithm. In this paper, 512 workers are evaluated using the url and webspam datasets, as illustrated in Fig. 7. 4, 16, and 64 groups are set to test the 2D-TGA algorithm. Four groups are preferable to sixteen, which is supported by the theoretical analysis. Furthermore, Fig. 4a demonstrates that the number of workers also determines the size of the grouping.

Conclusion
Synchronization of model parameters is critical in DML. With the increase in model parameters and #workers, the parameter synchronization mechanism will become an important factor that limits the scalability of the DML. In this paper, a new synchronous algorithm 2D-TGA is proposed, which can shorten the synchronous time by effectively utilizing the network bandwidth. Firstly, we introduce the topology of this algorithm. Then, we analyze the synchronous time of four AllReduce-based algorithms. The results show that the performance of 2D-TGA is good. To verify our analysis, we propose an ADMM-based distributed algorithm named 2D-TGA-ADMM, which combines the ADMM algorithm and 2D-TGA. We test it on the Tianhe-2 supercomputing platform to solve logistic regression problems. The synchronization cost of 2D-TGA is verified by selecting different datasets and #workers. Experimental results show that compared with the previous algorithms, this synchronous algorithm can reduce the synchronization wait time by 33% . With the increase in the number of workers, the 2D-TGA algorithm will neither increase the synchronization cost nor affect the convergence of the numerical algorithm. It is well suited to large-scale distributed machine learning.
This paper only studies the distributed ADMM algorithm to solve the logistic regression problem for sparse datasets. The next step is to test the algorithm on new unseen data, so as to improve the generalization ability. Since the distributed