Artificial neural networks training acceleration through network science strategies

The development of deep learning has led to a dramatic increase in the number of applications of artificial intelligence. However, the training of deeper neural networks for stable and accurate models translates into artificial neural networks (ANNs) that become unmanageable as the number of features increases. This work extends our earlier study where we explored the acceleration effects obtained by enforcing, in turn, scale freeness, small worldness, and sparsity during the ANN training process. The efficiency of that approach was confirmed by recent studies (conducted independently) where a million-node ANN was trained on non-specialized laptops. Encouraged by those results, our study is now focused on some tunable parameters, to pursue a further acceleration effect. We show that, although optimal parameter tuning is unfeasible, due to the high non-linearity of ANN problems, we can actually come up with a set of useful guidelines that lead to speed-ups in practical cases. We find that significant reductions in execution time can generally be achieved by setting the revised fraction parameter (ζ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta $$\end{document}) to relatively low values.

training set and to keep the number of connections constant. We should note that the revision step is not only rewiring the links but also re-computing the actual weight of the new links.
The efficiency of this approach has also been recently confirmed by independent researchers, who managed to train a million-node ANN on non-specialized laptops (Liu et al. 2019).
Encouraged by those results, our research has now moved into looking at algorithm tuning parameters to pursue a further acceleration effect, at a negligible accuracy loss. The focus is on the revision stage (determined by the ζ parameter) and on its impact on the training time over epochs. Noteworthy results have been achieved by conducting an in-depth investigation into the optimal tuning of ζ and by providing general guidelines on how to achieve better trade-offs between time and accuracy, as described in Sect. 5.2.
The rest of the paper is organized as follows: Sect. 2 provides the background theories employed in this work. To better position our contribution, Sect. 3 captures the state of the art. Next, Sect. 4 addresses the methodology followed and Sect. 5 shows the results obtained. Finally, Sect. 6 draws the conclusions.

Background
This section briefly introduces the main concepts required for understanding this work.
Note that, for the sake of simplicity, the words 'weight' and 'link' are used interchangeably, and only weighted links have been considered. The goal is to demonstrate the effectiveness of the SET approach, aiming at lower revised fraction values, in the context of the multilayer perceptron (MLP) supervised model. MLP is a feed-forward ANN composed by several hidden layers, forming a deep network, as shown in Fig.1. Because of the intra-layer links flow, an MLP can be seen as a fully connected directed graph between the input and output layers.
Supervised learning involves observing several samples of a given dataset, which will be divided into 'training' and 'test' samples. While the former is used to train the neural network, the latter works as a litmus test, as it is compared with the ANN predictions. One can find further details on deep learning in LeCun et al. (2015); Goodfellow et al. (2016).
The construction of a fully connected graph inevitably leads to higher computational costs, as the network grows. To overcome this issue, the SET framework (Mocanu et al. 2018) drew inspiration from human brain models and modelled an ANN topology as a weighted sparse Erdős-Rényi graph in which edges were randomly placed with nodes, according to a fixed probability (Erdős and Rényi 1959;Barabási and Pósfai 2016;Latora et al. 2017).
Like in Mocanu et al. (2018), the edge probability is defined as follows: where W k ∈ R n k−1 ×n k is a sparse weight matrix between the k-th layer and the previous one, ∈ R + is the sparsity parameter, and i, j are a pair of neurons; moreover, n k is the number of neurons in the k-th layer. As outlined in the previous section, this process led to forcing network sparsity. This stratagem is balanced by introducing the tunable revise fraction parameter ζ , which defines the weights fraction size that needs to be rewired (with a new weight assignment) during the training process.
Indeed, at the end of each epoch, there is a weight adjustment phase. It consists of removing the closest-to-zero links in between layers plus a wider revising range ( i.e., ζ ). This parameter verifies the correctness of the forced-to-be-zero weights. Subsequently, the framework adds new weights randomly to exactly compensate the removed ones. Thanks to this procedure, the number of links between layers remains constant across different epochs, without isolated neurons (Mocanu et al. 2018).
Herein, the role of ζ is analysed as well as showing how to find a good range of ζ values. Our aim is to strike a good balance between learning speed and accuracy.

Related literature
In recent years, ANNs have been widely applied in a broad range of domains such as image classification (He et al. 2016), machine translation (Vaswani et al. 2017), and text to speech (Kalchbrenner et al. 2018).
Previous work proves that the accuracy of an ANN (also known as model quality) crucially depends on both the model size (defined as the number of layers and neurons per layers) and the amount of training data (Hestness et al. 2017). Due to these reasons, the amount of resources required to train large ANNs is often prohibitive for real-life applications.
An approach promising to achieve high accuracy even with modest hardware resources is sparsity (Gale et al. 2019). An ANN is referred to as sparse when only a subset (hopefully of small size) of the model parameters has a value different from zero. The advantages of sparse networks are obvious. On the one hand, sparse data structures can be used to store matrices associated with the representation of an ANN. On the other hand, most of the matrix multiplications (which constitute the most time expensive stage of neural network computation) can be avoided. Furthermore, previous works (Ullrich et al. 2017;Mocanu et al. 2018) suggested Fig. 1 Example of a generic multilayer perceptron network with more than two hidden layers. Circles represent neurons, and arrows describe the links between layers that high levels of sparsity do not severely affect the accuracy of an ANN.
This section provides a brief overview of methods used to induce sparse ANNs, by classifying existing methods in two main categories, namely: 1. Methods derived from network science to induce sparse ANNs, 2. Methods derived from ANN regularization to induce sparse ANNs.

Methods derived from network science to induce sparse ANNs
Some previous papers focus on the interplay between network science and artificial networks (Stier and Granitzer 2019;Mocanu et al. 2018;Bourely et al. 2017). More specifically, they draw inspiration from biological phenomena such as the organization of human brain (Latora et al. 2017;Barabási and Pósfai 2016). Early studies in network science, in fact, pointed out that real graphs (e.g. social networks describing social ties among members of a community) display important features such as power-law distribution in node degree (Barabási and Pósfai 2016) and the small-world property (Watts and Strogatz 1998). Many authors agree that these properties are likely to exist in many large networked systems one can observe in nature. For instance, in case of biological and neuronal networks, Hilgetag and Goulas (2016) suggested that the neuronal network describing the human brain can be depicted as a globally sparse network with a modular structure.
As a consequence, approaches based on network science consider ANNs as sparse networks whose topological features resemble those of many biological systems and they take advantage from their sparseness to speed up the training stage.
A special mention goes to recent research in Liu et al. (2019), where the authors managed to train a million-node ANN on non-specialized laptops, based on the SET framework that was initially introduce in Mocanu et al. (2018). SET is a training procedure in which connections are pruned on the basis of their magnitude, while other connections are randomly added. The SET algorithm is actually capable of generating ANNs that have sparsely connected layers and, yet, achieve excellent predictive accuracy on real datasets.
Inspired by studies on rewiring in human brain, Bellec et al. (2018) formulated the DEEPR algorithm for training ANNs under connectivity constraints. This algorithm automatically rewires an ANN during the training stage and, to perform such a task, it combines a stochastic gradient descent algorithm with a random walk in the space of parameters to learn. Bourely et al. (2017) studied to what extent the accuracy of an ANN depends on the density of connections between two consecutive layers. In their approach, they proposed sparse neural network architectures, which derive from random or structured bipartite graphs. Experimental results show that, with a properly chosen topology, sparse neural networks can equal or supersede a fully connected ANN with the same number of nodes and layers in accuracy, with the clear advantage of handling a much smaller parameter space.
Stier and Granitzer (2019) illustrated a procedure to generate ANNs, which derive from artificial graphs. The proposed approach generates a random directed and acyclic graph G according to the Watts-Strogatz (1998) or the Barabási-Albert (2016) models. Nodes in G are then mapped onto layers in an ANN, and some classifiers (such as support vector machines and random forest) are trained to decide if a Watts-Strogatz topology yields a better accuracy than a Barabási-Albert one (or vice versa).

Methods derived from ANN regularization to induce sparse ANNs
Methods such as L 1 or L 0 regularization, which gained popularity in supervised learning, have been extensively applied to generate compact yet accurate ANNs. For instance, Srinivas et al. (2017) introduced additional gate variables to efficiently perform model selection. Furthermore, Louizos et al. (2017) described an L 0 -norm regularization method, which forces connection weights to become zero. Zero-weight connections are thus pruned, and this is equivalent to induce sparse networks.
The methods above are successful in producing sparse but accurate ANNs; however, they lack explainability. Thus, it is hard to understand why certain architectures are more competitive than others.
It is also interesting to point out that regularization techniques can be viewed as procedures compressing an ANN by deleting unnecessary connections (or, in an equivalent fashion, to select only few parameters). According to Frankle and Carbin (2018), techniques to prune an ANN are effective to uncover sub-networks within an ANN whose initialization made the training process more effective. According to these premises, Frankle and Carbin suggested what they called the lottery ticket hypothesis. In other words, dense and randomly initialized ANNs contain sub-networks (called winning tickets) that, when trained in isolation, are able to reach the same (or a comparable) test accuracy as the original network, and within a similar number of iterations.

Method
Herein we illustrate our research questions and strategy.
To speed up the training process, the investigation relates to the effects drawn by ζ variations during the evolutionary weight phase, at each epoch. The analysis involves a gradual ζ reduction with the goal to provide a guideline on how to find the best ζ values range, to trade-off between speed-up and accuracy loss on different application domains.
In Mocanu et al. (2018), the default revise fraction was set to ζ = 0.3 (i.e. 30% of the revised fraction of nodes) and no further investigations on the sensitivity to ζ were carried out. Unlike in Mocanu et al. (2018)'s research, an in-depth analysis on the revised fraction is herein conducted to understand these effects, particularly how the revise step affects the training when ζ is substantially reduced. In this paper, ζ ∈ [0, 1] and ζ ∈ [0% − 100%] notations are used interchangeably.
Some obvious considerations of this problem are that a shorter execution time and a certain percentage of accuracy loss for smaller values of ζ are expected. Nonetheless, this relationship is bound to be nonlinear; thus, it is crucial to get to quantitative results.

Dataset and ANN descriptions
The experiments were conducted using well-known datasets, publicly available online 1 : -Lung Cancer 2 is a biological dataset composed by features on lung cancer in order to train the ANN to be able to detect them. -CLL_SUB_111 3 is composed by B-cell chronic lymphocytic leukaemia. This dataset born to profile the five most frequent genomic aberrations ( i.e., deletions affecting chromosome bands 13q14, 11q22-q23, 17p13 and 6q21, and gains of genomic material affecting chromosome band 12q13) (Haslinger et al. 2004). -COIL20 4 is an image dataset used to train ANNs to detect 20 different objects. The images of each object were taken five degrees apart as the object is rotated on a turntable and each object has 72 images. The size of each image is 32 × 32 pixels, with 256 grey levels per pixel. Thus, each one is represented by a 1024-dimensional vector (Cai et al. 2011, PAMI), (Cai et al. 2011, VLDB).
Both Lung Cancer and CLL_SUB_111 are biological datasets, widely used for their importance in medicine, whereas the COIL20 dataset is a popular images dataset. Further quantitative details are provided in Table 1.
The ANN used is composed of three hidden layers with 3,000 neurons per layer. The activation functions used by default are ReLu for the hidden layers and sigmoid for the output (Table 2).

Comparison with our previous work
In Mocanu et al. (2018), the goal was to implement the SET algorithm and test it with numerous datasets, on several ANN 1 http://featureselection.asu.edu/.   It provides information about: the loss function, the batch sizes, the learning rate, the momentum, and the weight decay types (MLPs, CNN, RBMs), and on different types of tasks (supervised and unsupervised learning). The current study investigates the role of the revise fraction parameter ζ , rather than on the algorithm itself. The aim is to provide a general guideline on finding the best ζ values range to reduce execution time, at a negligible loss of accuracy. In Cavallaro et al. (2020), a preliminary study on the role of ζ has suggested a negligible accuracy loss, lower fluctuations, and a valuable gain in overall execution time with ζ < 0.02 with the Lung Cancer dataset. In the present paper, this intuition is analysed on a wider range of datasets to provide stronger justifications for the findings. The most important contribution of our study has been to confirm the effectiveness of the SET framework. Indeed, the random sparseness in ANNs introduced by the SET algorithm is powerful enough even without further fine tuning of weights ( i.e., revise fraction) during the training process.

Results
This section compares the results obtained by varying the parameter ζ , evaluating the training goodness in terms of the balance between high accuracy reached and short execution time. These topics are treated in Sects. 5.1 and 5.2, respectively. Section 5.3 provides a brief comment on the preferable ζ value, following up from the previous subsections.
For brevity, only the most important outcomes are reported hereafter. The number of epochs was increased from the default value of 100 up to 150 with the aim of finding the ending point of the transient phase. By combining these two tuning parameters ( i.e., number of epochs and ζ ), we have discovered that, with the datasets herein analysed, the meaningful revise range is 0 ≤ ζ ≤ 0.02.
In particular, Sect. 5.2 shows further investigations in terms of execution time gains, conducted by replicated experiments over ten runs and averaging the obtained results.

Accuracy investigation
This section shows the results obtained from the comparative analysis in terms of accuracy improvements over 150 epochs, on the three datasets.
In the Lung Cancer dataset (Fig. 2a), substantial accuracy fluctuations are present, but there is a no well-defined transient phase for ζ > 0.02. The benchmark value ζ = 0.3 shows an accuracy variation of more than 10% (e.g. accuracy increasing from 82% to 97% at the 60-th epoch and an accuracy from 85% to 95% at the 140th epoch). Note that, since the first 10 epochs are within the settling phase, the significant observations concern the simulation from the 11th epoch. Due to this uncertainty and due to the absence of a transient phase, it is impossible to identify an optimal stopping condition for the algorithm. For instance, at the 60th epoch an accuracy collapse from 97% to 82% was found, followed by an accuracy of 94% at the next epoch.
For a lower revise fraction, i.e., ζ ≤ 0.02, an improvement in terms of both stability ( i.e., lower fluctuations) and accuracy loss emerges, as expected. In this scenario, defining an exit condition according to the accuracy trend over time is easier. Indeed, despite a higher accuracy loss, the curve stability allows the identification of a gradual accuracy growth over the epochs, with no unexpected sharp drops.
To quantify the amount of accuracy loss, refer to Table 3, which reports both the revise fraction and the highest accuracy reached during the whole simulation, as a percentage. Moreover, mean and confidence interval bounds are provided. From Table 3, it is possible to assert that, on average, the improvement achieved by using a higher revise fraction (as the default one is) has an accuracy gain of just less than 3% (e.g. mean at ζ = 0% vs mean at ζ = 30%) that is a negligible improvement in most of the application domains. This depends on the tolerance level required. For example, if the goal is to achieve an accuracy of at least 90%, then a lower ζ is sufficiently effective. The confidence interval is rather low, given that the fluctuation between the lower and the upper bounds is comprised between 0.8 and 0.9. ] plus ζ = 30% that is the benchmark value. In particular, ζ = 0% with circled markers, ζ = 1% has triangular markers, ζ = 2% is shown with squared markers, and for ζ = 30% cross shape markers have been used In the Coil20 dataset (Fig. 2b), a short transient phase with no evident improvements among the simulations with different values of ζ emerges. Indeed, there are just small accuracy fluctuations of ±3%. These results do not surprise, since improvements achieved through ζ variations also depend on the goodness of the dataset itself, both in terms of its size and in the choice of its features. Table 3 shows that accuracy is always above 98%; thus, even with ζ = 0 the accuracy loss is negligible. Also the confidence interval is lower than 0.3. As the accuracy is continuously increasing over the training epochs, defining a dynamic exit condition is easier in this application domain. Figure 2c shows the results obtained in CLL_SUB_111 dataset. It is evident that the worse and more unstable approaches among the one considered are both the default one ( i.e., ζ = 30%) and ζ = 2%.
From Table 3, it is interesting to notice how the accuracy levels are even more stable when using a lower revise fraction ( i.e., going from a mean equal to 62.23% in ζ = 30% up to 67.14% in ζ = 0%). The fluctuations compared with the other two datasets are more evident, even when looking at the confidence interval; indeed, it varies from 1.06 (with ζ = 0) up to 2.18 (with ζ = 30), which is larger than the previously analysed one. Because of significant accuracy fluctuations, a possible early exit condition should be considered only with ζ = 0 even at the cost of a slighter higher accuracy loss.
The results obtained so far suggest that there is no need to fine-tune ζ , because the sparsity introduced by the SET algorithm is sufficiently powerful, and only a few links need to be rewired ( i.e., ζ ≤ 0.2). Apart from the goodness of the datasets themselves (as in COIL20), opting for a lower revise fraction has shown that, on the one hand, the accuracy loss From left: the revise fraction in percentage; the highest accuracy reached during the simulation expressed in percentage; the accuracy mean during the simulation, and the confidence interval bounds. Note that these last three parameters are computed after the first 10 epochs to avoid noise is sometimes negligible. On the other hand, as it was in the CLL_SUB_111 dataset, the performances are even higher than the ones obtained through the benchmark value. This confirms the hypothesis made in Sect. 5.1 of the goodness of using a randomly sparse ANN topology.

Execution time investigation
This section shows the comparative analysis conducted among the datasets used, in terms of execution time, over replicated simulations. Ten runs have been averaged, using the default value ζ = 0.3, as benchmark ( i.e., ζ de f ault ). Note that only the most significant and competitive ζ value has been considered ( i.e., ζ 0 = 0). Figure 3 shows the execution time (in seconds) of the same averaged simulations computed on the three datasets.
In both Lung and CLL_SUB_111 datasets, ζ = 0 is faster than the benchmark value. In particular, in CLL_SUB_111, the execution time is almost 40% faster than the default one and with higher accuracy performances too, as previously asserted in Sect. 5.1. It became less competitive in COIL20. The reason is the same with the results emerged in the accuracy analysis. Indeed, the goodness of the dataset is such as to make insignificant the improvements obtained by varying the revise parameter. Furthermore, the execution time gain between ζ = 0 and ζ de f ault has been computed among the datasets over ten runs as follows: The execution time gain was equal to 0.1370 in Lung, −0.0052 in COIL20, and 0.3664 in CLL_SUB_111. This means that, except for COIL20, there is an improvement in terms of algorithm performances. Thus, the algorithm became faster using a lower revise fraction. This is even more evident in CLL_SUB_111 as already noticed from Figure 3. On the other hand, the slow down emerged in COIL20 is almost negligible; thus, it may be concluded that for specific types of datasets, there is neither gain nor loss in choosing a lower ζ .
These results confirmed the previous hypothesis of the unnecessary fine-tune ζ process even because, on particular datasets (e.g. COIL20), an in-depth analysis of ζ is profitless. Thus, a relatively low revise fraction has been demonstrated to be a good practice in most of the cases.

Considerations on the tuning process
In Sects. 5.1 and 5.2, we have described the effects of ζ in terms of accuracy loss and execution time, respectively. This section provides a brief summary of what emerged from those experiments. As largely discussed in the literature, it is unrealistic to try and find a perfect value, which works well in all possible deep learning scenarios a priori. The same consideration should be made during the revise fraction tuning. This is why those tests are not aimed at finding the optimal value, which depends on too many variables. Instead, it may be asserted that, from the experiments herein conducted, a relatively low ζ is always a good choice. Indeed, in the datasets analysed the best results have been obtained with 0 ≤ ζ ≤ 0.02. It also important to highlight that because of the high non-linearity of the problem itself, more than one ζ value could effectively work, and the process of fine-tuning ζ is an operation that may require more time than the training process itself. This is why this study would provide a good enough range of possible ζ values. Thus, the tests have been conducted on very different datasets to assert that, empirically speaking, in different scenarios 0 ≤ ζ ≤ 0.02 it is sufficient to offer a high accuracy with low fluctuations and, at the same time, faster execution time.

Conclusions
In this paper, we moved a step forward from earlier work Mocanu et al. (2018). Not only did our experiments confirm the efficiency arising from training sparse neural networks, but they also managed to further exploit sparsity through a better tuned algorithm, featuring increased speed at a negligible accuracy loss.
The revised fraction goodness is independent from the application domain; thus, a relatively low zeta is always a good practice. Of course, according to the specific scenario considered, the performance may be higher than (or at least equal to) the benchmark value. Yet, it is evident that network science algorithms, by keeping sparsity in ANNs, are a promising direction for accelerating their training processes.
From one side, acting on the revise parameter ζ , accuracy and execution time performances are positively affected. From the other side, it is unrealistic to try and define a priori an optimal ζ value, without considering the specific application domain, because of the high non-linearity of the problem. However, through this analysis it is possible to assert that a relatively low ζ is generally sufficient to balance both accuracy loss and execution time. Another strategy could be to sample the dataset in order to manage a lower amount of data and train only that portion of information on which to conduct tests on ζ .
This study paves the way for other works, such as the implementation of dynamic exit conditions to further speed-up the algorithm itself, the development of adaptive algorithms that dynamically tune the parameters, and the study of different distributions for the initial weight assignments.