Heterogeneous gradient computing optimization for scalable deep neural networks

Nowadays, data processing applications based on neural networks cope with the growth in the amount of data to be processed and with the increase in both the depth and complexity of the neural networks architectures, and hence in the number of parameters to be learned. High-performance computing platforms are provided with fast computing resources, including multi-core processors and graphical processing units, to manage such computational burden of deep neural network applications. A common optimization technique is to distribute the workload between the processes deployed on the resources of the platform. This approach is known as data-parallelism. Each process, known as replica, trains its own copy of the model on a disjoint data partition. Nevertheless, the heterogeneity of the computational resources composing the platform requires to unevenly distribute the workload between the replicas according to its computational capabilities, to optimize the overall execution performance. Since the amount of data to be processed is different in each replica, the influence of the gradients computed by the replicas in the global parameter updating should be different. This work proposes a modification of the gradient computation method that considers the different speeds of the replicas, and hence, its amount of data assigned. The experimental results have been conducted on heterogeneous high-performance computing platforms for a wide range of models and datasets, showing an improvement in the final accuracy with respect to current techniques, with a comparable performance.


Introduction
Deep neural networks (DNNs) architectures [15,29] have leveraged important advances in research fields as image processing [12] and classification [31], natural language processing [24], medicine [19] and satellite observational studies [11], among others. DNNs are universal function approximators composed of neurons disposed in stacked layers. DNNs have the ability to learn relationships between input data as they propagate through the layers of the network to the output expected results. Each layer computes intermediate features using a set of internal trainable parameters, which are learned based on gradient optimization techniques, as stochastic gradient descent.
Taking the example of a supervised image classification algorithm, the optimization procedure works as follows. The objective of the supervised-based approaches is to learn a mapping function f (⋅, Θ) of the images in the input dataset to the corresponding labels y i ∈ . Input dataset is composed of examples x i ∈ ℝ h×w×c , where h is the height, w the width and c the number of channels of the images. By applying the map function f, each input image x i ∈ is assigned with an expected label ŷ i ∈ . The distance between the expected ( ŷ ) and actual (y) labels is measured by a loss function L(Θ) , and then it is used to adjust the model parameters Θ . The common method to update the network parameters reducing that distance is gradient descent, which computes the new values of the parameters using the gradient of the loss function as Θ � = Θ − g , where is the learning rate, Θ � and Θ the current and previous parameter values, respectively, and g is the gradient vector of the loss function, computed as g = ∇ Θ L(Θ).
Current large-scale DNNs architectures trained on large datasets and with a huge number of parameters have increased models training complexity [6]. Thus, the improvement of the performance and the accuracy of the models have become major challenges in this area. High-performance computing (HPC) platforms composed of resources with different computational capabilities (e.g., multi-core CPUs or GPUs) have addressed these challenges. Two main schemes are used to train DNN models in HPC platforms: model-parallelism and data-parallelism. Model parallelism scheme splits the model between the available computational resources, and each part is trained using the entire set of data examples. Conversely, data parallelism distributes disjoint partitions of the dataset between the processes running on the computational resources, known as replicas, training a copy of the model. This work focuses on the data parallelism scheme. Each data partition X p ∈ assigned to a replica p ∈ P is composed of subsets of data examples called batches. A batch passes through the model layers to perform a training iteration. Then, replicas compute the gradient vector and communicate each other to update the global parameter values Θ � . Parameter values are stored in centralized parameter servers, or distributed using collective communications [25]. An epoch is a complete pass of the batches in a data partition.
The heterogeneity of in HPC platforms is an issue that should be considered in order to efficiently train models under the data-parallelism scheme. The resources that compose the platforms have different computational capabilities, which lead to uneven replica training times. The consequence is that assuming collective communication at the communication stage, faster processes wait for slower processes, highly degrading performance. Parameter servers together with the asynchronous SGD (ASGD) [9] technique partially solve the performance problem of processes synchronization at the communication stages. Nevertheless, as the training progresses, faster replicas will compute their gradients using an obsolete version of the global parameters. The distance between local parameters employed to compute gradients in a replica and their global values is called staleness. While staleness reduces the performance impact of the communication in the overall training performance, the fact is that it has an impact in the final accuracy of the model.
Our approach is to unevenly distribute the data between the replicas according to its relative computational speed [22]. This mechanism ensures similar training times in each replica, and hence, it diminishes the waiting times at communication stages, and furthermore, it avoids staleness. Each replica p is assigned with a data partition X p proportional to its computational speed. We establish a global batch size B as an hyperparameter to the training. The size of the batch (number of examples) feeding the model in each replica is computed as b p = B × s p , hence, proportional to the relative speed of the replica s p . The total size of the partition assigned to each replica is X p = X × s p . Note that the number of iterations per epoch in each replica is constant, and equal to X p ∕b p . A decisive step is to figure out the relative speeds of the replicas s p in the platform. As example, the FuPerMod tool [8] represents the performance profile of P processes as a vector S = {s 1 , s 2 , … , s P } , with s p the inverse of the time spent in the execution of a benchmark provided by the user. The benchmark should reflect the computation performed in the actual training code, and hence it is usually a subset of the entire model.
While former approach have demonstrated to improve performance of training models in heterogeneous HPC platforms, the fact is that replicas use a different amount of data to train its local models, and compute the gradients based on different feature information. Thus, replicas should contribute unevenly to the parameters updating. A main contribution of this work is to introduce a methodology to improve the accuracy of the models based on weighted gradient computations, according to the speeds of the replicas involved in the training process. Assuming a dedicated heterogeneous HPC platform, the overall methodology is as follows. Data are distributed between replicas according to their relative speeds. Such speeds are figured out in a previous stage to the training and the amount of data assigned to each replica is computed. Then, replicas train its copies of the model on assigned data avoiding waiting times at communication stages. Collective communication operations are used to share the gradients. Parameter updating is achieved using weighted gradient computations, ensuring that each replica have an proportional impact on the weights depending on its speed.
As a summary, the proposed work implements the following solutions to the described problems.
• To speed up the computation phase, the distribution of the workload has been measured based on the speeds of each resource using popular methods available in the literature. • To improve the accuracy of the model, a methodology is proposed to ensure that the training information obtained in each replica has an impact relative to the assigned data workload. In this way, it is controlled that the replicas with low amount of data does not interfere negatively in the steps of training and weight updating. • In order to evaluate the proposed method, two heterogeneous platforms composed of multicore CPUs and GPUs have been tested. We compare the method with a baseline unweighted gradient computation using several scientific datasets, as CIFAR-10, CIFAR-100 and Mini-ImageNet. The proposed method obtains a higher accuracy than the baseline approach with a comparable performance for a variety of simple and deeper networks.
The organization of the paper is as follows. Section 2 explores related approaches proposed in the literature. Section 3 describes the proposed weighted gradient methodology. Section 4 evaluates the method in different platforms and datasets, and analyze the obtained results showing the viability of the proposal. Finally, Sect. 5 concludes.

Related work
The convergence of high-performance computing and deep learning model training, together with the development of parallelization techniques, has been studied in multiple works [16,32]. In [18], both model and data parallelism techniques are proposed as main schemes of parallelization for deep neural networks on HPC platforms. A thorough study and evaluation of the such parallelization techniques is performed in [2]. Model parallelism enhancements have been proposed for particular platforms in the literature [21]. Meanwhile, data parallelism have gained interest in different scientific research fields due to the performance improvements and its ease of use for different research fields. As examples, the work [11] enumerates the possibilities of the distributed data parallelism schemes over HSI image classification for different synchronous or asynchronous approaches, and the work [23] evaluates the performance of the data parallel approach in pattern recognition applications. HSI classification methods has been greatly enhanced due to the significant improvements in image processing and analysis techniques. Nevertheless, these techniques should be efficiently parallelized in order to optimize their performance. For instance, Danfeng et al. [13] proposed a multimodal deep neural architecture for feature representation learning using fully connected (FCs) and convolutional (CNNs) networks to extract relevant pixel-wise and spatial information, respectively. Meanwhile, the work proposed by Danfeng et al. [14] extracts detailed spectral representations using a sequential network to learn group-wise spectral information of the different HSI materials.
Particular aspects of the data parallelism scheme have been addressed in several contributions. The work [26] evaluates the impact on the distributed training for different hyperparameter configurations. The work [17] proposes the Stale Synchronous Parallel method to address the problem of the staleness on homogeneous data distributions using stochastic gradient descent (SGD) as optimizer. In addition, the staleness issue has been addressed in other works. The work [20] proposes an asynchronous SGD (ASGD) approach to reduce the variance of the gradient values during the optimization process, while [10] proposes to modify the learning rate depending on the current staleness value for ASGD. In the former work, a discussion is performed about the feasibility of modifying different hyperparameters of the model and its impact on the training. Authors in [7] propose to avoid excessive staleness impact on the model accuracy by benefiting the gradients from the fastest processes, discarding slower processes ones. Nevertheless, this approach causes a significant loss of information from slower processes.
Regarding workload distribution in data-parallelism scheme, a dynamic workload distribution scheme is proposed in [5], to adapt the assigned batch size to each replica in every iteration. A recurrent neural network (RNN) is used in order to measure the speed of each replica. On the other hand, static load balancing approaches are used to avoid the impact of the speed calculation time in every training iteration. In this sense, the work [22] performs a unique workload distribution computation in a prior stage to the training stage.
Notwithstanding heterogeneous data partitioning improves training performance on heterogeneous platforms, its impact on the accuracy of the final model is not addressed extensively in the literature. The work [30] proposes a method called TernGrad based on scaling down the gradient norms in order to improve the convergence and speed up of training. Method convergence and performance have been evaluated using SGD under a parameter localization scheme. In this scheme, workers holds its own copy of the parameters locally. Additionally, it proposes the use of ternary values to reduce communication. Ternary values quantifies gradients (32bits) before being communicated using a ternary vector with values ∈ {−1, 0, 1} that reduces the communication volume by a 'x' factor of x = 32∕ log 2 (3).
The reduction in the communications is studied in [28] using different gradient quantization techniques in the loss calculation step for the MNIST and CIFAR-10 datasets. The work [30] improves former quantization in each computation element for more complex Imagenet dataset. The relationship between the quantization of the gradients and the accuracy obtained in the model is evaluated in [1].
In brief, most of the works refer to the management of the gradient focusing on performance. Since the performance problem has been solved using computational balancing techniques, weighting methods gained importance to improve accuracy, by computing model parameters based on the replicas gradient directions. The importance of weighting methods for DNNs is deeply studied in the literature demonstrating how these methods influence the convergence and network predictions [3]. Additionally, weighted methods are also used to differently assigning a weight to each dataset class [4].
In this sense, our proposal addresses data-parallelism models training on heterogeneous data partitions. Gradient contributions from each replica to the global parameter computations are weighted using the speed of each replica. Therefore, our objective in terms of performance and specially final accuracy is to outperform current baseline method with an unweighted gradient computation.

Hetgrad optimization methodology
This section details the methodology followed to solve the described problems derived from the pointed challenges. In this context, the implemented method proposes a static mapping technique to distribute the data among the available resources prior to the execution of the deep neural model, seeking for the static workload balancing and introducing a new loss balancing to compensate the heterogeneous partitioning previously performed. The introduced balanced (or weighted) gradient calculation considerably improves the convergence of the neural model during the training phase, which conducts a more accurate parameter tuning, leading to a better accuracy.
Following previous research [22], the strategy for the performance optimization of deep networks has been based primarily on data parallelization approaches, where a replica of the full model is loaded and run by each computational resource using non-overlapped data subsets. While in those clusters with homogeneous resources, data partitioning is done directly, giving the same partition size to each resource, in heterogeneous clusters, a fair partitioning must be done. Indeed, at equal partition, differences in the computational capacities of the resources on current heterogeneous platforms can induce long waiting times when synchronously combining gradients, leading to a degradation of overall performance. In order to overcome this drawback, the data partitions have to balance the load between the different resources according to their computational capacities (particularly in proportion to the speed of the computing device), minimizing the computational time.
Emphasizing to divide the data into fair partitions that balance the workload of the heterogeneous cluster, a prior analysis of the computational features of the cluster should be conducted in order to determine the amount of data (i.e., the percentage of data) that each resource is capable of handling. Several techniques can be taken into account, such as the widely used FuPerMod framework [8]. This generalpurpose data partitioning framework performs accurate and efficient benchmarking to obtain the relative speed of the resources that constitute the cluster, providing the load measurements for each element that optimize execution time. Also, other criteria could be taken into account to determine these measurements [5]. In this particular case, the resource speed is used to define the heterogeneity of the platform, as explained in the following Sect. 4.3 and implemented in the work [22]. As a result, a vector of measurements is obtained = {c 1 , c 2 , … , c D } , which is normalized taking into account the slowest device or the device with the most limited computational capability, providing a metric appropriate to the target problem. Indeed, the normalized vector is providing unit of capability. Based on these measurements, the data partitioning is conducted, splitting the training data N into non-overlapped partitions with different sizes P 1 , P 2 , … , P D . To accomplish this, N is divided between the sum of units of capability, obtaining the number of samples per unit of capability 1 3 Heterogeneous gradient computing optimization for scalable.
Then, the i-th partition is obtained as P i =Ñ ⋅ c i . If P i is not an integer, the data is rounded down, which may appear to disregard the remaining samples. Nevertheless, with each gradient update, data are shuffled and the partitions are refilled, thus all samples are processed at the end.
At this point, it is important to decide on the synchronization strategy. Indeed, deep neural networks optimize their parameters through the back-propagation of the gradient signal within an iterative process guided by the optimization algorithm. In particular, they conduct several training epochs, where in each epoch, the optimization algorithm pass all the training data through the network. To avoid complications in model convergence, the data passing is done iterative, grouping the training data into non-overlapped batches of identical size B that are processed separately by the network. As a result, to complete an epoch, all batches must have been processed by the neural model, so each epoch is defined by a series of iterations determined by M = N∕B , where N indicates the number of training samples, B the batch size and M the number of iterations/batches. By having several distributed replicas, the gradients obtained by each replica must be combined in order to compose a global gradient that can be transmitted to the replicas to update their parameters. Although asynchronous communication of gradients has been proposed, it its hampered by the so-called staleness problem, complicating also the control mechanisms to obtain a suitable global gradient. In this sense, the proposed method considers a synchronized communication between the working nodes. This introduces a slight constraint when organizing the data in each partition. In fact, the number of batches required to complete an epoch M must be the same on all devices to ensure the synchronization. Thus, for each partition P i , a batch size B i = P i ∕M is set, ∀i ∈ [1, D] . Furthermore, the set of all batches in each partition must cover the total training data. This is done by introducing a global batch size G, which acts mimicking the total number of training data N. While setting the size of each partition is done by P i =Ñ ⋅ c i , batch sizes are obtained as B i =G ⋅ c i , where G = G∕ ∑ D i=1 c i is the number of batches per unit of capability. Once again, if B i is not an integer, is rounded to obtain the desired M iterations.
Once parameters {P 1 , P 2 , … , P D } , M and {B 1 , B 2 , … , B D } have been determined, replicas are trained for M iterations, using completely different training samples. This avoids the same data being involved several times in the same parameters update. Furthermore, data shuffling is implemented, ensuring that all replicas are trained with the entire dataset. Focusing on the training stage, each replica performs the calculation of its gradient as a function of the loss (i.e., classification error) obtained by the model. In this regard, at the k-th iteration, the i-th replica obtains the loss of its batch [k] according to its current parameters k i . As a result, each replica obtains its gradient vector L k ( ) i and communicates it through Open-MPI allReduce collectives so that the gradients are combined as shown in Eq. (1).
Once the gradient combination is done, L k ( ) global is sent to the replicas, which accordingly update their parameters using the defined learning rate. In this context, (1) considers the contributions of the replicas at the same level, giving them equal weight in the final calculation of the global gradient as shown in Fig. 1a.
As a result, the global gradient has been obtained disregarding the amount of data involved in the calculation of the local gradients. Nevertheless, this is not fair as each replica employs a different partition size. Therefore, replicas with less data contribute the same as replicas with more data, potentially hampering the gradient calculation.
To properly avoid this problem, the proposed method introduces the weighted gradient calculation, which aligns the contribution of each replica to the partition size it receives. This adjustment is performed during the calculation of the global gradient as Eq. (2) indicates: Therefore, the gradients of each partition are scaled according to the computational capability of the device, as shown in Fig. 1b. This provides more weight to those gradients calculated with a larger number of samples, which are more robust and accurate than those calculated in smaller partitions, indeed. Consequently, the global gradient will take a more appropriate direction, avoiding local minima and stagnation introduced by small partitions, resulting in a better parameter fit and in an improved classification result. As a consequence, the floating points operations are defined as: where Dim defines the (height × width × f ilters) data at the input of each layer l, k is the kernel size of the convolutions and G is the size of the gradient vector L k ( ) i that represents one multiplication for each gradient value. Fig. 1 Gradient hyper-plane. Red arrow represent the gradient direction of the obtained total gradient �� ⃗ ∇ T after sharing replica gradients �� ⃗ ∇ p . The weighted gradient for each process is represented as p . Angle direction of the gradients is represented as p (color figure online)

3
Heterogeneous gradient computing optimization for scalable… 4 Experimental results

Benchmark datasets
The experimentation have been conducted on three different datasets, i.e., (1) CIFAR-10, (2) CIFAR-100, and (3) Mini-ImageNet. The former is composed of 60,000 RGB images of size 32 × 32 × 3 , with the images belonging to 10 different classes. The second one is similar to CIFAR-10, where the number of classes rises to 100, and hence, the complexity is higher. The last one is composed of 60,000 RGB images of size 84 × 84 × 3 resized to 256 × 256 × 3 , and center cropped to 224 × 224 × 3 before the training step. Similar to CIFAR-100, Mini-ImageNet is composed of 100 different classes.

Platform description
The used platform for the experiments is the modular supercomputer architecture (MSA) that has been developed by the European project Dynamical Exascale Entry Platform-Extreme Scale Technologies (DEEP-EST) [27]. The MSA is an innovative HPC architecture that can integrate an arbitrary number of modules with heterogeneous hardware components. In this work, two modules have been used. The first one is known as Data Analytics Module (DP-DAM), which is composed of 16 nodes communicated with an EXTOLL network of 100 Gb/s. In turn, each DP-DAM node is composed of two Intel Xeon "Cascade Lake" Platinum 8260M CPU running at 2.40 GHz with 24 physical cores per CPU and a Nvidia V100 Tesla GPU of 32 GB HBM2. The second module is the Extreme Scale Booster (DP-ESB) which is composed of 74 nodes. Each DP-ESB node is composed of an Intel Xeon "Cascade Lake" Silver 4215 CPU running at 2.50 GHz with 8 physical cores and a Nvidia V100 Tesla GPU with 32 GB HBM2. Nodes are connected through a 100 Gb/s EDR Infiniband network.
Conducted experiments use four nodes from the DP-DAM module for Experiments 1 and 3. Meanwhile, for Experiment 2, eight nodes have been used from the DP-ESB module.

Experimental settings
A data-parallel scheme is assumed and a heterogeneous workload partitioning is performed. In order to perform the data partitioning step, each replica is defined by the computing speed based on its computational capabilities. As a consequence, speeds are different for each replica. Thus, the speed is used to assign an input data portion to the replicas. This means that acceleration speedups are determined by the differences between resource speeds. In the experimentation, five repetitions of three experiments have been conducted.

Experiment 1: Initial performance insights
The first experiment compares the obtained accuracy from the baseline unweighted gradient computation with the HetGrad proposal, using the DP-DAM module with CIFAR-10 and CIFAR-100. Experiments are conducted for the VGG16, ResNet18, ResNet50 and DenseNet121 models. The objective of this experiment is to obtain initial insights on the HetGrad accuracy. Furthermore, an evaluation of the speedup

3
Heterogeneous gradient computing optimization for scalable… obtained by the heterogeneous partitioning of the workload according to the speeds of the processes has been conducted and compared with a homogeneous partitioning.
Obtained results from the first experiment are detailed by Table 1 and depicted by Figs. 2 and 3 considering CIFAR-10 and CIFAR-100, respectively. As it is appreciated, positive results have been obtained for both datasets. Attending to CIFAR-10, the accuracy improvement is similar for all models, where DenseNet121 presents the higher accuracy gain, from 92.25%±0.24 to 93.46%±0. 16. Regarding CIFAR-100, accuracy improvements are more significant since the complexity of the dataset is higher. The model with the highest accuracy gain is VGG16 with an accuracy improvement from 56.24%±0.16 to 67.02%±0.31. Meanwhile, DenseNet121 is the   HetGrad are provided by Table 2, where the heterogeneous workload balance provides a notable improvement in terms of performance for the training step of DNNs.

Experiment 2: Deep models and scalability
In this experiment, both baseline and HetGrad models are also compared using DP-ESB module on the Mini-ImageNet dataset. Other evaluated models are ResNet18, ResNet50 and ResNet101. Since DP-ESB increments the number of nodes in the platform to eight with respect to the former experiment, this experiment includes a scaling factor to support deeper ResNet models. Table 3 details the obtained results. As shown, the accuracy improves for all evaluated ResNet models. Focusing on deep complex models as ResNet101, the accuracy increases from 60.94±4.01 to 69.37±0.13. Note that a complex dataset as Mini-Imagenet does not improves the accuracy for deeper models in the baseline proposal, which stagnates in around ≃ 60%. Meanwhile, it is demonstrated that the proposed HetGrad obtains notable improvements in terms of scaling for complex datasets and models. Such complex datasets and models require more processes to improve the performance of the training, which includes heterogeneity in the computational resources that are used. This fact highly benefits our proposal, since heterogeneity is a main factor impacting the parameter updating.

Experiment 3: Optimizers performance analysis
Finally, different optimizers have been evaluated to complete the evaluation of the proposal using CIFAR-100. In particular, these optimizers are SGDM, RMSprop, Adam, AdamW and AdaBelief. Two models have been selected to conduct this experiment, i.e., VGG16 and ResNet18.
Results from Fig. 4 report that every optimizer works better using the preposed HetGrad. Focusing on ResNet18 model, the best accuracy is obtained using AdamW, with a 68.52%±1.66 for the baseline and 71.69%±0.40 for the HetGrad proposal. In VGG16, AdaBelief obtains the best accuracy for the HetGrad proposal with 68.01%±0.22 and 62.59%±0.24 for the baseline. Regarding optimizers, the overall conclusion is that optimizers with regularization methods, as decoupled weight decay in AdamW, work better for complex models. In addition, the HetGrad Table 3 Accuracy (in %) for mini-ImageNet using ResNet models Bold text indicates the best result obtained between both methods

ResNet18
ResNet50 ResNet101 Mini-ImageNet proposal obtains significantly better accuracy results than the baseline. These results get minimized at the convergence of the model but a significant improvement in accuracy is maintained.

Conclusions
In this paper, a weighted gradient calculation for data-parallel DNNs in scalable heterogeneous HPC platforms is presented. The proposed approach significantly reduces the impact in the global gradient calculation from replicas with a low amount of data. This is achieved using the speed of each replica in the step of global gradient calculation. Thus, in these calculations, each replica contributes proportionally to its assigned data and speed. The conducted experiments using benchmark datasets, reveal that the proposed HetGrad approach achieves better accuracy results for early epochs, providing a better final accuracy compared with the original gradient calculation while accelerate training. Furthermore, complex datasets and models have been proved to obtain major accuracy improvements in the proposed method under a scaling factor. Additionally, the data-parallelism scheme is used where the amount of data distributed for the replicas is obtained using heterogeneous workload partitioning, providing a significant time reduction. As future works, the implementation of more sophisticated optimization methods based on the heterogeneity of the resources for Federated Learning (FL) or the adaptation of the proposed method for HSI classification could obtain beneficial results.