Adaptive sparse ternary gradient compression for distributed DNN training in edge computing

Mao, Yingchi; Wu, Jun; Xu, Xuesong; Wang, Longbao

doi:10.1007/s42514-022-00091-2

Adaptive sparse ternary gradient compression for distributed DNN training in edge computing

Regular Paper
Open access
Published: 18 March 2022

Volume 4, pages 120–134, (2022)
Cite this article

Download PDF

You have full access to this open access article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Adaptive sparse ternary gradient compression for distributed DNN training in edge computing

Download PDF

Yingchi Mao ORCID: orcid.org/0000-0002-9884-8100¹,
Jun Wu²,
Xuesong Xu² &
…
Longbao Wang²

1902 Accesses
Explore all metrics

Abstract

In edge computing, though distributed training of Deep Neural Networks (DNNs) is expected to exchange massive gradients between parameter servers and working nodes, the high communication cost constrains the training speed. To break this limitation, gradient compression algorithms expect the ultimate compression ratio at the expense of the accuracy of the trained model. Therefore, new gradient compression techniques are necessary to ensure both communication efficiency and model accuracy. This paper introduces a novel technique—an Adaptive Sparse Ternary Gradient Compression (ASTC) scheme, which relies on the number of gradients in model layers to compress gradients. ASTC establishes the model compression selection criterion by gradients’ amount, compresses the network layer that meets the model’s standard, evaluates the gradients’ importance based on entropy to adaptively perform sparse compression, and finally conducts ternary quantization compression and a lossless code scheme on sparse gradients. Using public datasets (MNIST, CIFAR-10, Tiny ImageNet) and deep learning models (CNN, LeNet5, ResNet18) for experimental evaluation, we exhibit excellent results that the training efficiency of ASTC is about 1.6 times, 1.37 times, and 1.1 times higher than that of Top-1, AdaComp, and SBC, respectively. Furthermore, ASTC can be improved by an average of about $1.9\%$ in training accuracy compared with the above approaches.

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Deep Learning Techniques: An Overview

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the rapid development of artificial intelligence, Deep Neural Networks (DNNs) are being employed in a broad spectrum of application domains, from computer vision (Li et al. 2021) and natural language processing (Munir et al. 2019) to big data analysis (Derbeko et al. 2019; Minar and Nather 2018). In each of these domains, the superior accuracy of deep learning comes from strict requirements for computation and storage during the training phase (Sun et al. 2019). The reason is that complex network models have to iteratively optimize millions of parameters, which causes high communication time and computational cost (Wang et al. 2020; Cao et al. 2020). To satisfy the computing requirements of deep learning, mobile cloud computing (Hoang et al. 2018) was frequently used at first. The remote cloud has powerful computing capabilities and sufficient computing resources to solve the above problems by moving data from the edge of the network to a centralized location in the cloud server (Abimbola 2021; Raza et al. 2020). However, data transmission to the cloud server in this way may lead to huge communication overhead under normal circumstances (Index 2017) due to long distances between mobile devices and the remote cloud.

Edge computing (Patel et al. 2014) can meet the training requirements of DNN for the most part and make up for the deficiencies of cloud computing in terms of latency. To process a large amount of input data and effectively improve the training efficiency (Chen and Ran 2019), edge computing uses the way of sinking the server’s resources like computing and storage to edge devices, as shown in Fig. 1. However, in the actual application scenarios, edge distributed training demands massive gradient exchanges between edge servers and terminal devices, resulting in the high communication cost of the training process (Tang et al. 2020). Huge communication overhead limits the speed of distributed training, causing many artificial-intelligence applications to fail to satisfy users’ needs (Dean and Barroso 2013).

The main factor affecting the efficiency of the edge distributed DNN training refers to the communication time of each iteration, which consists of two parts, i.e., the number of communication rounds and the network bandwidth. In edge computing, due to the dynamic nature of the network, the network bandwidth cannot be accurately estimated. Therefore, reducing the number of communication rounds per iteration (i.e., compressing gradients in each round) is the key to improving the efficiency of distributed DNN training.

To reduce the training time of edge computing systems, distributed algorithms with compressing gradients are frequently employed. Gradient compression is mainly divided into two schemes: gradient quantization and gradient sparsification (Wang et al. 2018; Han et al. 2020). The former reduces communication by quantizing each gradient vector into finite bits, while the latter only transmits important gradients. A variety of quantization-based schemes have been explored for efficient distributed training, including one-bit SGD (Seide et al. 2014), TernGrad (Wen et al. 2017), and QSGD (Alistarh et al. 2017). However, the limited compression ratio makes it difficult to further quantize gradient bits to improve the training efficiency. Top-k (Stich et al. 2018; Alistarh et al. 2018) and DGC (Lin et al. 2018) are typical gradient compression schemes based on sparsity. Only gradients that exceed a given threshold will be propagated to the parameter server. Although Top-k and DGC can maximally utilize high compression ratios, they may result in a decreased accuracy of the trained model, which is seriously influenced by choosing an unreasonable gradient sparse threshold. To achieve a suitable sparse threshold, (Kuang et al. 2019) proposed an approach named Entropy-based Gradient Compression (EGC). The entropy-based threshold selection in EGC is self-adjustable according to the calculation of the entropy of gradients for each network layer in each epoch. Moreover, the distributed training can be performed quickly with a slight loss of training accuracy. However, this scheme computes all model layers to evaluate the gradient’s importance, which increases the compression time and results in high communication cost as more edge devices and more complex network models are used.

The above schemes all consider a single gradient compression approach to accelerate the distributed DNN training in edge computing, failing to reach an extreme compression ratio. Therefore, we adopt a mixed compression approach, that is, combined with gradient quantization and sparsification to achieve a more communication-efficient scheme in edge computing. However, compressing gradients with extremely high compression ratios will cause a more severe loss of the trained model and increase the computation time as more complex network models are used like EGC. To solve this problem, we propose a new gradient compression approach, the Adaptive Sparse Ternary Gradient Compression (ASTC) scheme, which comprehensively considers two indicators—the efficiency and accuracy of the trained model. ASTC establishes the model compression selection criterion by gradients’ amount to selectively compress specific model layers instead of all layers. An entropy-based gradient sparsification algorithm is also uesd in ASTC to adaptively determine the compression ratio. To further reduce the communication cost, ASTC combines ternary gradient quantization and a lossless code scheme. Compared with other gradient compression approaches, ASTC is available to quickly perform edge distributed DNN training without basically losing the trained model’s accuracy. Our contributions by this work include:

(1)
We established a model compression selection criterion by gradients’ amount of each model layer. We compressed the layers that satisfy the requirements of the standard, reducing the frequency and compression time.
(2)
We proposed an entropy-based gradient sparsification algorithm to adaptively generate a reasonable threshold according to the gradient entropy, thus automatically adjusting the compression ratio. To avoid the loss of accuracy due to the gradient sparsification, ASTC employed gradient residual and momentum correction mechanisms. To further optimize the gradient communication cost, the ternary quantization and a lossless code compression on gradient distances were exploited.
(3)
According to the experimental results, the proposed ASTC on several networks (CNN, LeNet5, ResNet18) in the public datasets (MNIST, CIFAR-10, Tiny ImageNet) exhibit better performances. Compared with Top-1, AdaComp and SBC, the training efficiency of ASTC was increased by about 1.6 times, 1.37 times and 1.1 times, respectively, and the training accuracy is about $1.9\%$ higher than that of Top-1, AdaComp and SBC on average.

2 Related work

To reduce the communication cost in distributed DNN training, researchers have proposed many gradient compression schemes. The conventional methods for gradient compressing are mainly divided into gradient quantization, gradient sparsification, and gradient sparse quantization.

2.1 Gradient quantization

Gradient quantization reduces the communication cost by quantizing each gradient into finite bits. How to determine the quantitative standard is the crucial factor to achieve the high-efficiency communication. Seide et al. (2014) proposed 1-bit SGD that quantized gradients to one bit while maintaining the quantization error. SIGNSGD (Bernstein et al. 2018) performed a bit quantization compression on gradients by selecting the sign of the stochastic gradient vector. Although 1-bit SGD and SIGNSGD are available for fewer bytes to be communicated, it will lower the accuracy of the trained model. Quantitative SGD (QSGD) (Alistarh et al. 2017) allowed users to smoothly trade-off communication bandwidth against convergence time, considering both the communication efficiency and accuracy. Working nodes can adjust the number of bits in each iteration but at the cost of higher variance. Wen et al. (2017) developed an approach called TernGrad, using three-level gradients, which significantly reduced the communication time by quantizing gradients into the form of {−1, 0, 1}. However, there are some shortcomings in the above schemes. For example, the maximum compression rate can only reach 1/32, and these approaches fail to learn gradients, making them unable to converge to an actual optimal value in the batch mode. In addition, they are incompatible with non-smooth regularization functions, which slows down the convergence speed.

2.2 Gradient sparsification

Gradient sparsification aims to transmit significant gradients in each iteration, reducing the number of gradients sent to the parameter server. The criterion of selecting significant gradients distinguishes between different gradient sparsification. For example, random sparsification and deterministic sparsification are commonly included. Random sparsification refers to the random selection of some gradients for communication and update, called random-k where k represents the number of selected gradients. Konečný et al. (2016) utilized a random mask to sparse the gradient matrix H into $\hat{H}$. The random mask is created by the server and then distributed to working nodes in each iteration. Wangni et al. (2018) proposed a random deletion compression that minimized the encoding length of gradients by the random deletion of the coordinates of gradients, thereby reducing the communication overhead. However, with the increase of edge devices, the accuracy of the trained model based on the above two schemes decreases obviously. Different from random sparsification, it is necessary to select a threshold for deterministic sparsification (Strom 1997), where only gradients larger than the threshold are sent for sparse processing. The predefined threshold and the adaptive threshold are two main types to choose from. Strom (2015) discarded gradients with their absolute values less than the predefined threshold. However, it is a challenge to determine a correct threshold for gradients. With the threshold unreasonably selected, the accuracy of the trained model will be seriously harmed. In order not to fix the value of the threshold, Top-k sparsification (Stich et al. 2018; Alistarh et al. 2018) selected the Top-k gradients (in terms of absolute values) in each iteration. Dryden et al. (2016) proposed an adaptive threshold scheme that used a fixed percentage $\pi $ to indicate the proportion of both positive and negative gradients. Aji and Heafield (2017) proposed another adaptive threshold approach that chose only one absolute threshold instead of two. Kuang et al. (2019) proposed an Entropy-based Gradient Compression scheme (EGC). More precisely, EGC used an entropy-based threshold to select gradients in each neural network layer. Compared with the Top-k approach, EGC had less communication overhead and improved the accuracy of the trained model. However, this scheme compressed all layers of the model, which resulted in a great increase in the compression time. Moreover, with more edge devices added, the communication overhead will be further raised.

2.3 Gradient sparse quantization

To further reduce the communication time of the distributed training, some researchers combine the gradient quantization and gradient sparsification to achieve higher compression ratios, called gradient sparse quantization. Considering that most gradient compression techniques are not suitable for convolutional neural networks and differences in layers of the neural network, mini-batch size, and other factors may also influence the compression rate, (Chen et al. 2018) proposed an approach called AdaComp, which quantified unsent gradients before adding them to gradient residues for the subsequent transmission. It automatically adjusted the compression rate depending on the local activities. Sattler et al. (2019) proposed the Sparse Binary Compression (SBC) based on gradient sparsification and the binarization method to obtain a new compression ratio. SBC allowed the smooth balance between gradient sparsification and temporal sparsification to adapt to learning tasks, but it lost a certain accuracy of the trained model. To improve the scheme, (Sattler et al. 2019) proposed the sparse ternary compression (STC), which combined Top-k sparsification and ternary quantization to enhance the compression ratio further. Unlike SBC, STC is very popular for federated learning.

3 Overall framework

The realization of the overall framework of ASTC we proposed in this paper relies on the central parameter server and working nodes. The distributed DNN training in edge computing includes the stage of training the local model by working nodes and the stage of distributing the global model by the central parameter server. For example, there are a total of m working nodes and one central parameter server in edge computing. For working nodes, each node firstly calculates partial gradients based on the local training data using ASTC. Specifically, ASTC firstly establishes a model compression selection criterion by gradients’ amount to compress layers that meet the criterion. Secondly, ASTC uses the entropy-based gradient sparsification algorithm to determine the compression ratio adaptively. To prevent excessive sparsification, gradient residual and momentum correction are employed in ASTC. Thirdly, the ternary gradient quantization and a lossless code compression based on gradient distances are exploited to further reduce the communication cost before uploading compressed gradients. For the central parameter server, after all working nodes upload their compressed gradients by ASTC, the parameter server firstly decodes gradients with the Golomb Rice decoding approach. Then it exploits decoded local gradients from working nodes to aggregate the global gradient for the update of the global model. Finally, the parameter server broadcasts the global model to all working nodes after the updating procedure. Up to now, a round DNN training in edge computing finishes. The whole above-mentioned process will continue until the global model converges to a satisfactory accuracy. The overall framework of ASTC is shown in Fig. 2.

In conclusion, ASTC includes three steps, as follows:

(1) Model compression selection criterion by gradients’ amount: selectively compress layers that satisfy the requirements of the standard in the model, reducing the frequency and time of compression.

(2) Entropy-based gradient sparsification algorithm: calculate the gradient entropy of the current layer with hyperparameter K to adaptively determine the threshold. After obtaining the threshold, sparsify gradients and dynamically choose them to send out. To prevent excessive sparsification, gradient residual and momentum correction are employed.

(3) Ternary gradient quantization algorithm: quantize sparse gradients, and convert them into a set of ternary tensors. To further reduce communication cost, use Golomb-Rice to optimize the distance of the three-value tensor instead of uploading absolute positions of gradients.

4 Adaptive sparse ternary gradient compression

In this section, we will provide a comprehensive introduction to ASTC. Firstly, we employ a model compression selection criterion by gradients’ amount to determine layers to be compressed. Then, the entropy-based gradient sparsification algorithm is exploited to select important gradients to send adaptively. Finally, we propose a ternary gradient quantization algorithm to reduce the communication cost during the distributed training in edge computing.

4.1 Model compression selection criterion by gradients’ amount

ASTC does not compress all gradients but selects gradients generated by specific layers that satisfy requirements of the selection criterion. The selection criterion is as follows,

$$\begin{aligned} \varepsilon \mathrm{{ = }}\frac{{{G_{all}}}}{{L\mathrm{{aye}}{\mathrm{{r}}^{(Layer/2)}}}} + a \cdot Laye{r^2}, \end{aligned}$$

(1)

where ${G_{all}}$ represents the sum of gradients in all layers of the model, Layer means the number of layers, and a refers to an adjustable constant parameter. By judging whether the number of gradients in the current layer meets the requirements of the selection criterion, the necessity of compression is determined, as shown by equation (2),

$$\begin{aligned} {\sigma _l} = \left\{ {\begin{array}{*{20}{c}} {\mathrm{{1}}........\quad if\, {G_l} \ge \varepsilon }\\ {\mathrm{{0}}........\quad if \,{G_l} < \varepsilon }, \end{array}} \right. \end{aligned}$$

(2)

where 1 refers to compress the current layer, while 0 means not to do so. ${G_l}$ represents the number of gradients in the current layer.

The objective of the selection criterion is to keep the layers with a small number of gradients and small communication overhead uncompressed and compress the layers with more gradients.

4.2 Entropy-based Gradient Sparsification Algorithm

4.2.1 Entropy-based Threshold Selection Algorithm

The sparsification process is first performed if the current layer is determined to be compressed according to the above selection criterion. The most critical point of sparsification is to choose an appropriate threshold. We propose the Entropy-based Threshold Selection Algorithm and use the information entropy to select the threshold adaptively. The information entropy indicates the degree of disorder or the uncertainty of information. The larger the value of information entropy, the more the information. In ASTC, we calculate the entropy of each layer to evaluate gradients’ importance.

To compute the entropy of gradients, firstly, all of the gradient matrices of each layer are used as the evaluation set, which is divided into N different containers. Secondly, the number of gradients h(i) is computed in each container. Thirdly, Equation (3) is used to calculate the gradient entropy of the current layer,

$$\begin{aligned} {H_l} = - \sum \limits _{i = 1}^N {p(i)lo{g_2}p(i)}, \end{aligned}$$

(3)

where p(i) = $\frac{{h(i)}}{{\sum \limits _{i = 1}^N {h(i)} }}$ represents the probability of gradients in each container. We utilize the classical information entropy formula to measure the importance of gradients in each network layer.

After achieving the gradient entropy, we use the Entropy-based Threshold Selection Algorithm to adaptively choose the threshold, which is mainly divided into two stages. At the first stage, we employ the fixed ratio K to indicate the ratio of gradients to be sent. At the second stage, we use the algorithm to obtain the top gradient ${\tau _l}\mathrm{{ = }}\frac{{{H_l}}}{K}$ where ${\tau _l}$ refers to the threshold in the l layer.

We can compute the threshold ${\tau _l}$ by using Algorithm 1, the critical gradients whose absolute values are larger than ${\tau _l}$ should be sent to the parameter server. This sparse processing makes the number of gradients in each layer of the model much less than the original gradient matrices. If the unsent gradients are discarded, the convergence and training accuracy of the model will be significantly harmed. Therefore, gradient residual and momentum correction are introduced to improve the accuracy of the trained model.

4.2.2 Gradient residual and momentum correction

To certain extent, the gradient residual follows a delayed update strategy. After a batch, the gradient value is accumulated for each node unless greater than the threshold. When using the gradient residual approach to update the gradients, some gradients are stale, which harms the accuracy of the trained model. Inspired by momentum SGD (Qian 1999) and deep compression (Lin et al. 2018), momentum correction is used to solve this problem.

In distributed momentum SGD, the update of the parameters in the central parameter server is shown in Eq. (4),

$$\begin{aligned} \begin{array}{l} {\mu _t} = mom \cdot {\mu _{t - 1}} + \sum \limits _{i = 1}^m {{g_i}({\theta _t})} \\ {\theta _{t + 1}} = {\theta _t} - \alpha {\mu _t}, \end{array} \end{aligned}$$

(4)

where mom refers to the momentum. If the momentum SGD is directly applied to the gradient residual, the rule of updating parameters is equivalent to Eq. (5),

$$\begin{aligned} \begin{array}{l} {v_{i,t}} = {v_{i,t - 1}} + {g_i}({\theta _t})\\ {\mu _t} = mom \cdot {\mu _{t - 1}} + \sum \limits _{i = 1}^m {sparse({v_{i,t}})} \\ {\theta _{t + 1}} = {\theta _t} - \alpha {\mu _t}, \end{array} \end{aligned}$$

(5)

where ${v_{i,t}}$ refers to the gradients’ accumulation at each node and $sparse(\cdot )$ represents the sparse function. In the case that the accumulation exceeds the threshold in $sparse(\cdot )$, it will be sent to the parameter server. However, the direct application of the momentum SGD to the gradient residual may cause wrong optimization direction. During the high gradient sparsification, the interval of outdated gradients increases sharply, causing a severe deterioration of the model’s performance. To avoid the above loss, momentum correction based on Eq. (5) is applied to ensure that the sparse update is equivalent to the dense update in Eq. (4). The update scheme is as follows,

$$\begin{aligned} \begin{array}{l} {\mu _{i,t}} = mom \cdot {\mu _{i,t - 1}} + {g_i}({\theta _t})\\ {v_{i,t}} = {v_{i,t - 1}} + {\mu _{i,t}}\\ {\theta _{t + 1}} = {\theta _t} - \alpha \cdot \sum \limits _{i = 1}^m {sparse({v_{i,t}})}, \end{array} \end{aligned}$$

(6)

the first two items of Eq. (6) represent the accumulation of gradients with momentum correction. The accumulated result ${v_{i,t}}$ is for subsequent sparse processing and the update of parameters.

4.3 Ternary Gradient Quantization Algorithm

4.3.1 Ternary quantization

When each node is provided with sparse gradients ${g_t} =sparse$ $({g_i}({\theta _t}))$ using the above-mentioned sparsity scheme, ASTC performs ternary quantization to further reduce the communication cost before sending gradients to the central parameter server. The ternary quantization quantizes all gradients into $\{ + \eta ,0, - \eta \} $, where $\eta $ refers to the threshold. There are three situations:

(1)
If the gradient is positive and more prominent than the threshold $\eta $, compress it to $ + \eta $.
(2)
If the gradient is negative with its absolute value larger than the threshold $\eta $, compress it to $ - \eta $.
(3)
In the cases other than (1) and (2), compress the gradient’s value to 0.

In theory, the ternary gradient quantization can reduce the flow $\left\lfloor {\frac{{32}}{{\log _2^3}}} \right\rfloor = 20$ from the node to the central parameter server to the minimum extent. That is, by using a 2-bit pair ternary gradient, the reduction factor is still 16 times.

4.3.2 Lossless code compression

To pass a set of sparse ternary tensors produced by the above compression scheme, only the positions and the quantized values of non-zero gradients in the tensor have to be sent. Compared with absolute positions of gradients, the transmission of distances between them can further reduce the communication time. Encoding of distances through Golomb-Rice can optimize the average size of each round of gradients by 10–11 bits (Strom 2015). The Golomb-Rice Location Coding Algorithm is as follows.

The first line of the Algorithm 2 selects the non-zero gradient and the line 2–4 compute the distance d between each non-zero gradient and the previous one. The line 5 encodes d by the Golomb-Rice Location Coding where q represents the coding position and r means the coding offset. The unary coding is adopted for q and binary coding for r.

5 Experiments

5.1 Experiment settings

The experiments are carried out with two DELL PowerEdge R740 servers and 20 CPU nodes. Each server is equipped with two 28-core Intel Xeon Platinum 8180 M CPUs and one NVIDIA GeForce RTX 3090 GPU. The memory capacity of each server is 93 GB. Each CPU node is equipped with two 10-core Intel Xeon E5-2660-V3 CPUs. The memory capacity of each CPU node is 32 GB. The software platform of our experiments is PyTorch, which is a deep learning experimental platform that provides a high degree of flexibility and efficiency. Our code is open-source and publicly available at https://github.com/wxf980218/edgeASTC.

We evaluate our proposed approach on three different learning tasks and compare its performance under 12 and 20 nodes with the state-of-the-art algorithms in a wide range of edge computing environments. Specifically, different sized convolution neural networks trained for image datasets of varying complexity are utilized for the edge computing tasks of image classification, i.e., CNN on MNIST, CNN on CIFAR-10, LeNet5 on CIFAR-10 and ResNet18 on Tiny ImageNet. We split the training data among nodes in a balanced way, where the number of training samples and their distribution is homogenous. In addition, the heterogeneity of nodes is not considered. The architectural information of the models that are used during the evaluation is elaborated as follows.

CNN on MNIST: We utilize a CNN with two convolution layers (by $5\times 5$ filters and Relu activation function), two pooling layers, two fully connected layers, and a 10-way softmax from (Chen et al. 2018) on MNIST (LeCun 1998). MNIST is a handwritten digital gray image dataset, and each image is composed of $28\times 28$ pixels with an 8-bit gray matrix, including 60000 training data and 10,000 test data. MNIST has a total of 10 categories.

CNN on CIFAR-10: Another CNN with three convolution layers (by $5\times 5$ filters and Relu activation function), one pooling layer, one fully connected layer, and a 10-way softmax is employed to train on the CIFAR-10 (Krizhevsky et al. 2014) dataset to evaluate our proposed approach. CIFAR-10 consists of 50,000 color training images and 10000 test images with $32\times 32$ pixels in 10 categories. Each category has 6000 images.

LeNet5 on CIFAR-10: We run experiments with LeNet-5 (LeCun et al. 1998) on CIFAR-10 for further evaluation. LeNet-5 is a very efficient convolution neural network for character recognition, which has seven layers, i.e., two convolution layers, two pooling layers and three fully connected layers. Each layer contains trainable parameters and multiple feature maps that extract input features through a convolution filter.

ResNet18 on Tiny ImageNet: To evaluate the effectiveness of ASTC, it is applied to a more complicate CNN model, i.e., ResNet18 (He et al. 2016) and a larger dataset, i.e., Tiny ImageNet. The full name of ResNet, which has 152 network layers, is deep residual network. We use a simplified version of ResNet, called ResNet18, to solve image classification problems. ResNet18 has 18 network layers, specifically with 8 ResNet blocks totaling 16 convolution layers with $3\times 3$ filters, batch normalization, Relu activation and a final FC layer with a 1K softmax. Tiny ImageNet runs similar to the ImageNet (Deng et al. 2009) challenge. It has 200 classes and each class has 500 training images, 50 validation images, and 50 test images.

The models and datasets we use in our experiments are sufficient for the purpose of evaluating our compression scheme ASTC and well demonstrate that our scheme has a better result compared with the benchmark approaches. The benchmark approaches for experimental comparison are as follows:

(1)
Top-k (Stich et al. 2018; Alistarh et al. 2018): For Top-k sparse compression, k=1 (in terms of absolute value) in each iteration is selected for the comparison experiments.
(2)
AdaComp (Chen et al. 2018): AdaComp is based on the local selection of gradient residuals, which adaptively adjusts the compression rate according to local activities.
(3)
SBC (Sattler et al. 2019): SBC combines the Top-1 scheme of gradient sparsification with a novel binary quantization method.

ASTC is experimentally compared with the Top-1, AdaComp, and SBC approaches in terms of training efficiency and training accuracy. The experimental settings are shown in Table 1.

Table 1 Parameter settings

Full size table

5.2 Analysis of results

5.2.1 Analysis of parameter values

Different a values will result in different selection criterion based on Eq. (1), which will influence the overall effect of ASTC. In this section, three different a values are selected for comparison experiments. Table 2 shows the results of distributed training experiments with different a values under 12 nodes, where each value refers to the average of the results obtained from 10 independent experiments.

The experimental results of the MNIST dataset training on the CNN model are shown in the second and third columns of Table 2. When a= 2, the average iteration time is the shortest, i.e., 0.64s and the accuracy is the highest, i.e.,$93.1\%$.

Using the dataset with CIFAR-10, the experimental results are shown in the 4th and 5th columns of Table 2. The average time per iteration for different a values is 0.59s, 0.57s, and 0.60s respectively. The minimum average per iteration is obtained at a =2, and the accuracy rate at this time is also up to $60.5\%$.

By training the CIFAR-10 dataset on the LeNet5 model, the experimental results are shown in the 6th and 7th columns of Table 2. When a =2, ASTC achieves the best training efficiency and accuracy.

When scaling to the more complex model ResNet18 and the large-scale dataset Tiny ImageNet, as shown in the last two columns, ASTC shows a more obvious optimization effect on the average time per iteration and accuracy. The model compression selection criterion in ASTC can better reflect its own advantages on ResNet18, which has more network layers than CNN and LeNet5. That is, small layers with less gradients in CNN and LeNet5 are not as compressed as possible. However, due to the model complexity of ResNet18, ASTC can more purposefully choose larger layers with more gradients for compression, which effectively reduces the communication overhead when transmitting plenty of gradients. In this case, the accuracy can also be improved to a certain extent with a reasonable parameter a.

Table 2 Results of different a values of ASTC under 12 nodes

Full size table

Increasing the number of nodes to 20 for the further evaluation, the experimental results are shown in Table 3. It can be seen from Table 3 that the trained model achieves the highest accuracy rate and the shortest average time per iteration at a =2. According to the selection criterion based on Eq. (1), we can demonstrate that different value a influences the calculated selection criterion value $\varepsilon $, which decides the number of compressed layers that meet the selection criterion and the model compression time. The following experiments are carried out with the optimal parameter $a=2$ of the selection criterion.

Table 3 Results of different a values of ASTC under 20 nodes

Full size table

5.2.2 Analysis of trainging efficiency

Figures 3, 4, 5 and 6 show the average iteration time of Top-1, AdaComp, SBC, and ASTC for the training of the MNIST dataset and CIFAR-10 dataset on the CNN model, the CIFAR-10 dataset on the LeNet5 model, and the Tiny ImageNet on the ResNet18 model under 12 nodes. Average iteration time refers to the average time when the trained model reaches the convergence. It can measure the efficiency of communication, that is, the less the average number of iterations, which required for reaching the same target accuracy of the trained model by using different algorithms, the higher the communication efficiency. It can be seen from the figure that AdaComp, SBC and ASTC outperform the traditional Top-1 in terms of the training efficiency.

The experimental results of the training MNIST on the CNN model are shown in Fig. 3. The average iteration time of the four approaches is 0.95s, 0.82s, 0.65s, and 0.6s, respectively. The training efficiency of AdaComp, SBC and ASTC is 1.19 times, 1.46 times and 1.58 times that of Top-1, respectively, and ASTC shows the highest training efficiency.

The results of replacing the dataset with CIFAR-10 are shown in Fig. 4. Among the four approaches, the average time per iteration of ASTC is the shortest, i.e., 0.57s. Compared with Top-1, AdaComp and SBC, the average iteration time using ASTC is decreased by $38\%$, $27\%$, and $9\%$, respectively.

When replacing the experiment with a more complex model LeNet5, the results at the same dataset of CIFAR-10 are shown in Fig. 5. The average iteration time of the four approaches is longer than that of the CNN model due to the complexity of the model. The training efficiency of ASTC remains the highest, which is about 1.6 times that of Top-1. Although the training efficiency of SBC is also very high, it is increased by about $16\%$ compared with ASTC.

Since the models and datasets we utilize still simple and small, we consider employing a more complex CNN model ResNet18 and a large dataset Tiny ImageNet to explore the effectiveness of ASTC. Results in Fig. 6 show that compared with baselines, the average iteration time of ASTC is the least, i.e., 3.60s, decreased by $43\%$, $34\%$, and $17\%$, respectively. ASTC outperforms all baselines by a wide margin. Therefore, with the increasing complexity of the model and dataset, ASTC provides a more useful way to achieve the communication-efficient distributed training.

The number of nodes is increased to 20 for extended evaluation, and the results are shown in Figs. 7, 8, 9 and 10. We successfully proved that Top-1 has the longest training time, followed by AdaComp, SBC and ASTC in the descending sequence.

AdaComp is available to adaptively adjust the degree of sparse compression according to local activities. Although its degree of sparse compression is not as high as that of Top-1, the quantization of the gradient residual further reduces the communication cost. As known from the experiments, the training efficiency of AdaComp is 1.17 times that of Top-1. SBC combines the binary quantization scheme based on Top-1, thereby significantly increasing the training efficiency, which is about 1.46 times that of Top-1. ASTC achieves the highest training efficiency, about 1.6 times that of Top-1, 1.37 times that of AdaComp, and 1.11 times that of SBC.

The reasons why baselines we employ show different communication compression effects are listed as follows. Firstly, Top-1 has the longest average iteration time for the edge distributed training because that Top-1 only adopts the sparsity of the updated gradients of the model. In contrast, all the last three schemes adopt the mixed gradient compression based on the sparsity and quantization. Secondly, to capture the correct gradient residue throughout the layer, AdaComp defines a sampling window as small enough, which increases the computation time of distributed training. At the same time, the adaptive gradient compression algorithm of AdaComp has to be used for the judgment of each network layer, that is, whether the gradient residual of the last iteration plus the latest gradient multiplied by the scaling factor are greater than the threshold, thereby increasing the delay time of edge distributed training. Thirdly, the average iteration time of SBC is less than that of AdaComp because SBC further adds delayed communication and lossless coding to the gradient sparsification and quantization, achieving a higher gradient compression ratio. Finally, the proposed ASTC has the most apparent effect on the improvement of communication efficiency. Unlike AdaComp, ASTC does not compress gradients in each network model layer, which reduces the number of gradients submitted to the parameter server by the model compression selection criteria based on gradients’ amount. In addition, the hybrid gradient compression, including the sparsification and quantization, achieves superior gradient compression results.

5.2.3 Analysis of trainging accuracy

Table 4 shows the results of the accuracy of Top-1, AdaComp, SBC, and ASTC on the CNN model with the MNIST dataset and the CIFAR-10 dataset, which also includes the accuracy on the LeNet5 model with the CIFAR-10 dataset and the ResNet18 model with the Tiny ImageNet dataset.

The accuracy rates of Top-1, AdaComp, SBC and ASTC, when using the CNN model to train MNIST on 12 nodes and 20 nodes, are $91.1\%$, $92.2\%$, $90.9\%$, $93.1\%$ and $88.9\%$, $89.7\%$, $89.2\%$,$91.3\%$, respectively. ASTC shows the highest accuracy rate, which is about $2.2\%$ higher than that of Top-1, about $1.2\%$ higher than that of AdaComp, and about $2.5\%$ higher than that of SBC.

The results of training the CIFAR-10 dataset on the CNN model are shown in the 4th and 5th columns of Table 4. ASTC still shows the highest accuracy at 12 nodes and 20 nodes, that is, $61.2\%$ and $59.2\%$ respectively, which is followed by AdaComp with the accuracy rates of $60.2\%$ and $58.3\%$, and Top-1 with the accuracy rates of $59.2\%$ and $57.5\%$, respectively. SBC has the lowest accuracy rates of $58.9\%$ and $57.3\%$, respectively.

The results of the training of CIFAR-10 on the LeNet5 model are shown in the 6th and 7th columns of Table 4. The accuracy of ASTC remains the highest regardless of whether it has 12 nodes or 20 nodes.

Considering the above models and datasets we utilize are simple and small, a complex CNN model ResNet18 and a large dataset Tiny ImageNet are employed for the further accuracy evaluation. The last two columns of Table 4 show that all the communication compression approaches fail to achieve a satisfactory accuracy. However, ASTC still presents a highest accuracy at 12 nodes and 20 nodes. What’s more, the model accuracy improved by ASTC is more obvious than the above three groups of experiments, which is about $8\%$ higher than that of Top-1, about $5\%$ higher than that of AdaComp, and about $10\%$ higher than that of SBC.

We summarize some reasons for the above experimental results. Firstly, SBC achieves the highest compression ratio at the cost of sacrificing the certain model accuracy, leading to the lowest accuracy of the four approaches. Secondly, since Top-1 only adopts the sparse gradient compression, its gradient compression ratio is not as high as that of SBC, and whose model accuracy decreasing less significantly than that of SBC. Thirdly, the reason why the model accuracy of AdaComp is higher than that of SBC and Top-1, which is because it adaptively adjusts the compression ratio in different batches, training epochs, network layers, and containers by scaling factors. Finally, the ASTC proposed in this paper enjoys the highest model accuracy because ASTC does not sparse the gradients of each layer in the training model, alleviating the negative impact of lossy compression on the model accuracy. At the same time, ASTC introduces the gradient residual and momentum correction mechanism to improve the convergence accuracy of the model further.

Table 4 Results of accuracy rates

Full size table

5.2.4 Analysis of convergence

Figures 11, 12, 13 and 14 show the training loss curve of Top-1, AdaComp, SBC, and ASTC on the CNN model with the MNIST dataset and CIFAR-10 dataset, LeNet5 model with the CIFAR-10 dataset and ResNet18 model with the Tiny ImageNet under 20 nodes.

The loss curve of MNIST dataset trained on CNN model is shown in Fig. 11. It can be seen from the figure that four approaches can all converge. Among all the approaches, the loss value of ASTC is the lowest, which finally remains at about 0.45, and that of SBC is the highest. The convergence rate of the four approaches is almost the same after 300 iterations.

Figure 12 shows the results of CNN model training with CIFAR-10 dataset. The loss of ASTC is the lowest, and that of SBC is still the highest. The loss curves of Top-1 and AdaComp are similar.

The model is replaced with LeNet5 for extended evaluation. As shown in Fig. 13, ASTC still has the lowest training loss without decreasing its convergence rate.

Figure 14 presents the training loss of ResNet18 model with the Tiny ImageNet. Since the depth of ResNet18 is significantly higher than that of traditional CNN model and LeNet5 model, ResNet18 can better process image datasets, and the training loss curves of all compression methods differ slightly. Specifically, ASTC enjoys the least training loss as usual, at about 2.02. Adacomp and Top-1 share a similar training loss while SBC still has the largest one.

In general, the convergence of the four schemes tends to be stable after 300 rounds of model iteration, and the difference in convergence speed is not obvious. However, their training loss is seriously affected by the gradient compression ratio. Because SBC pursues the extreme compression ratio, its training loss is larger than the other three approaches. The adaptive gradient compression mechanism of Adacomp weakened the effect of increasing the model loss on hybrid compression. Adacomp is not much different from the convergence loss curve of Top-1, except for the utilization of gradient sparsification. The ASTC proposed in this paper does not decrease the convergence accuracy. On the contrary, sparse quantization and lossless coding in ASTC further reduce the training loss of the model and improve the convergence accuracy of the model.

6 Conclusions

This paper proposes ASTC, a new gradient compression scheme to accelerate the distributed DNN training in edge computing. Meanwhile, it mitigates the reduced accuracy of the trained model caused by high gradient compression ratios. The critical principle of ASTC is to adaptively select the model layer that needs to compress the gradient by establishing a selection criterion by gradients’ number, thereby reasonably shortening the compressing time. ASTC also automatically determines the threshold value depending on the entropy-based threshold selection algorithm for sparsification compression. To prevent the excessive sparsification from reducing the accuracy of the trained model, gradient residual and momentum correction are employed. Besides, combining with ternary gradient quantization and lossless coding, ASTC further lowers the communication time in each iteration. Using the public datasets (MNIST, CIFAR10, Tiny ImageNet) and Deep Learning models (CNN, LeNet5, ResNet18) for experimental evaluation, we show excellent results that the training efficiency of ASTC is about 1.6 times, 1.37 times, and 1.1 times higher than that of Top-1, AdaComp, and SBC, respectively. Furthermore, compared with Top-1, AdaComp, and SBC, ASTC improves an average of about $1.9\%$ in the training accuracy rate.

It remains an interesting direction of further research to adapt both the kind of sparsity and the sparsity rate to the gradient sparsity phrase, thereby achieving higher compression rates in the process of edge distributed training. Meanwhile, we have to pay more attention to maintain and even improve the accuracy of the trained model as the gradient compression processes. It is worth noting that due to the limitations of the experimental environment, experimenting with more GPU nodes in the edge distributed training are failed. In that case, in the future, the effectiveness of the algorithm proposed in this paper can be discussed in large-scale clusters, including multiple heterogeneous nodes with various computing and network transmission capabilities.

References

Abimbola, B.: Cloud computing concept and roots. arXiv:2102.00981 (2021)
Aji, A.F., Heafield, K.: Sparse communication for distributed gradient descent[C]. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 440–445 (2017)
Alistarh, D., Grubic, D., Li, J., et al.: QSGD: communication-efficient SGD via gradient quantization and encoding[C]. In: Advances in Neural Information Processing Systems, pp. 1709–1720 (2017)
Alistarh, D., Hoefler, T., Johansson, M., et al.: The convergence of sparsified gradient methods[C]. In: Proceedings of the 32Nd International Conference on Neural Information Processing Systems, pp. 5977–5987 (2018)
Bernstein, J., Wang, Y.X., Azizzadenesheli, K., et al.: signSGD: Compressed optimisation for non-convex problems[C]. In: International Conference on Machine Learning, pp. 560–569 (2018)
Cao, N.Y., Chatterjee, B., Gong, M.X., et al.: A 65nm image processing SoC supporting multiple DNN models and real-time computation-communication trade-off via actor-critical neuro-controller[C]. In: VLSI Circuits 2020, pp. 1–2 (2020)
Chen, C.Y., Choi, J., Brand, D., et al.: Adacomp: adaptive residual gradient compression for data-parallel distributed training[C]. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
Chen, J., Ran, X.: Deep learning with edge computing: a review. Proc. IEEE 107, 1655–1674 (2019)
Article Google Scholar
Dean, J., Barroso, L.A.: The tail at scale. Commsunic. ACM 56, 74–80 (2013)
Article Google Scholar
Deng, J., Dong, W., Socher, R., et al.: Imagenet: a large-scale hierarchical image database[C]. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Derbeko, P., Dolev, S., Gudes, E., et al.: MLDStore - DNNs as similitude models for sharing big data (brief announcement)[C]. In: 3rd CSCML 2019, pp. 93–96 (2019)
Dryden, N., Moon, T., Jacobs, S.A., et al.: Communication quantization for data-parallel training of deep neural networks[C]. In: 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pp. 1–8 (2016)
Han, P., Wang, S., Leung, K.K.: Adaptive gradient sparsification for efficient federated learning: an online learning approach. arXiv:2001.04756 (2020)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition[C]. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Hoang, T., Dinh, C.L., Dusit, N., et al.: A survey of mobile cloud computing: architecture, applications, and approaches. Wirel. Commun. Mob. Comput. 13, 1587–1611 (2018)
Google Scholar
Index, C.V.N.: Cisco visual networking index: forecast and methodology, 2016–2021. Complet. Vis. Netw. Index Forecast 12, 749–759 (2017)
Google Scholar
Konečný, J., McMahan, H.B., Yu, F.X., et al.: Federated learning: strategies for improving communication efficiency[C]. In: NIPS Workshop on Private Multi-Party Machine Learning (2016)
Krizhevsky, A., Nair, V., Hinton, G.: The cifar-10 dataset, online: http://www.cs.toronto.edu/kriz/cifar.html (2014)
Krizhevsky, A., Nair, V., Hinton, G.: The cifar-10 dataset. http://www.cs.toronto.edu/kriz/cifar.html (2014)
Kuang, D., Chen, M., Xiao, D., et al.: Entropy-based gradient compression for distributed deep learning[C]. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 231–238 (2019)
LeCun, Y.: The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)
LeCun, Y., Bottou, L., Bengio, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Li, H.j., Yuan, G., Niu, W., et al.: Real-time mobile acceleration of DNNs: from computer vision to medical applications[C]. In: 26th Asia and South Pacific Design Automation Conference, pp. 581–586 (2021)
Lin, Y., Han, S., Mao, H., et al.: Deep gradient compression: reducing the communication bandwidth for distributed training[C]. International Conference on Learning Representations (2018)
Minar, M.R., Nather, J.: Recent advances in deep learning: an overview. arXiv:1807.08169 (2018)
Munir, M., Siddiqui, S.A., Kusters, F., et al.: TSXplain: demystification of DNN decisions for time-series using natural language and statistical features[C]. In: 26th Asia and South Pacific Design Automation Conference, pp. 426–439 (2019)
Patel, M., Naughton, B., Chan, C., et al.: Mobile-edge computing introductory technical white paper. Mob.-edge Comput. Indust. Initiat. 1089–7801 (2014)
Qian, N.: On the momentum term in gradient descent learning algorithms[C]. Neural Netw. 12(1), 145–151 (1999)
Article Google Scholar
Raza, M.R., Varol, A., Varol, N.: Cloud and fog computing: a survey to the concept and challenges[C]. In: 8th International Symposium on Digital Forensics and Security, pp. 1–6 (2020)
Sattler, F., Wiedemann, S., Müller, K.R., et al.: Sparse binary compression: towards distributed deep learning with minimal communication[C]. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)
Sattler, F., Wiedemann, S., Müller, K.R., et al.: Robust and communication-efficient federated learning from non-iid dat[C]. IEEE Trans. Neural Netw. Learn. Syst. 31(9), 3400–3413 (2019)
Article Google Scholar
Seide, F., Fu, H., Droppo, J., et al.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs[C]. In: Fifteenth Annual Conference of the International Speech Communication Association, Singapore (2014)
Stich, S.U., Cordonnier, J.B., Jaggi, M.: Sparsified SGD with memory[C]. In: Proceedings of the 32Nd International Conference on Neural Information Processing Systems, pp. 4452-4463 (2018)
Strom, N.: A tonotopic artificial neural network architecture for phoneme probability estimation[C]. In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 156–163 (1997)
Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing[C]. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Sun, M.S., Zhao, P., Wang, Y.Z., et al.: HSIM-DNN: hardware simulator for computation-, storage- and power-efficient deep neural networks[C]. In: Proceedings of the 2019 on Great Lakes Symposium on VLSI, pp. 81–86 (2019)
Tang, Z., Shi, S., Chu, X., et al.: Communication-efficient distributed deep learning: a comprehensive survey. arXiv:2003.06307 (2020)
Wang, H., Chen, J.R., Wan, X.C., et al.: Domain-specific communication optimization for distributed DNN training. arXiv:2008.08445 (2020)
Wang, S.P., Liu, P., Wu, J.J.: Communication usage optimization of gradient sparsification with aggregation in deep learning[C]. In: Proceedings of the 2018 VII International Conference on Network, Communication and Computing, pp. 22–26 (2018)
Wangni, J., Wang, J., Liu, J., et al.: Gradient sparsification for communication-efficient distributed optimization[C]. In: Proceedings of the 32Nd International Conference on Neural Information Processing Systems, pp. 1306–1316 (2018)
Wen, W., Xu, C., Yan, F., et al.: Terngrad: ternary gradients to reduce communication in distributed deep learning[C]. Adv. Neural Inf. Process. Syst. 30, 1508–1518 (2017)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 61902110, 61832005), the Key Research and Development Program of China, Jiangsu Province (BE2020729), the Transformation Program of Scientific and Technological Achievements of Jiangsu Provence (BA2021002) and the Key Technology Project of China Huaneng Group (Grant No. HNKJ20-H64, HNKJ19-H12).

Author information

Authors and Affiliations

School of Computer and Information, Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing, China
Yingchi Mao
School of Computer and Information, Hohai University, Nanjing, China
Jun Wu, Xuesong Xu & Longbao Wang

Authors

Yingchi Mao
View author publications
You can also search for this author in PubMed Google Scholar
Jun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xuesong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Longbao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yingchi Mao.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mao, Y., Wu, J., Xu, X. et al. Adaptive sparse ternary gradient compression for distributed DNN training in edge computing. CCF Trans. HPC 4, 120–134 (2022). https://doi.org/10.1007/s42514-022-00091-2

Download citation

Received: 01 November 2021
Accepted: 12 February 2022
Published: 18 March 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s42514-022-00091-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Adaptive sparse ternary gradient compression for distributed DNN training in edge computing

Abstract

Similar content being viewed by others