A brain-inspired algorithm for training highly sparse neural networks

Sparse neural networks attract increasing interest as they exhibit comparable performance to their dense counterparts while being computationally efficient. Pruning the dense neural networks is among the most widely used methods to obtain a sparse neural network. Driven by the high training cost of such methods that can be unaffordable for a low-resource device, training sparse neural networks sparsely from scratch has recently gained attention. However, existing sparse training algorithms suffer from various issues, including poor performance in high sparsity scenarios, computing dense gradient information during training, or pure random topology search. In this paper, inspired by the evolution of the biological brain and the Hebbian learning theory, we present a new sparse training approach that evolves sparse neural networks according to the behavior of neurons in the network. Concretely, by exploiting the cosine similarity metric to measure the importance of the connections, our proposed method, “Cosine similarity-based and random topology exploration (CTRE)”, evolves the topology of sparse neural networks by adding the most important connections to the network without calculating dense gradient in the backward. We carried out different experiments on eight datasets, including tabular, image, and text datasets, and demonstrate that our proposed method outperforms several state-of-the-art sparse training algorithms in extremely sparse neural networks by a large gap. The implementation code is available on Github.


Introduction
Dense artificial neural networks are a commonly used machine-learning technique that has a wide range of application domains, such as speech recognition [18], image processing [41,50], and natural language processing (NLP) [6]. It has been shown in [26] that the performance of deep neural networks scales with model size and dataset size, and generalization benefits from over-parameterization [58]. However, the ever-increasing size of deep neural networks has given rise to major challenges, including high computational cost both during training and inference and high memory requirement [76]. Such an increase in the number of computations can lead to a critical rise in the energy consumption in data centers and, consequently, a deteriorative effect on the environment [75]. However, a trustworthy AI system should function in the most environmentally friendly way possible during development, and deployment [19]. In addition, such gigantic computational costs will lead to a situation where on-device training and inference of neural network models on low-resource devices, e.g., an edge device with limited computational resources and battery life, might not be economically viable [76].
Sparse neural networks have been considered as an effective solution to address these challenges [27,53]. By using sparsely connected layers instead of fully-connected ones, sparse neural networks have reached a competitive performance to their dense equivalent networks in various applications [12,3], while having much fewer parameters. It has been shown that biological brains, especially the human brain, enjoy sparse connections among neurons [13]. Most existing solutions to obtain sparse neural networks focus on inference efficiency in order to reduce the storage requirement of deploying the network and prediction time of test instances. This class of methods, named dense-to-sparse training, starts by training a dense neural network followed by a pruning phase that aims to remove unimportant weight from the network. As categorized in [53], in dense-to-sparse training, the pruning phase can be done after training [37, 12,23], simultaneous to training [48], or one-shot prior to training [38]. However, starting from a dense network leads to a memory requirement of fitting a dense network on the device and the computational resources for at least a few iterations of training the dense model. Therefore, training sparse neural networks using dense-to-sparse methods might be infeasible on low-resource devices due to the energy and computational resource constraints.
With the emergence of the sparse training concept in [51], there has been a growing interest in training sparse neural networks which are sparse from scratch. This sparse connectivity might be fixed during training (known as static sparse connectivity [51, 53, 31]), or might dynamically change, by removing and re-adding weights (known as dynamic sparse connectivity [52,5]). By optimizing the topology along with the weights during the training, dynamic sparse training algorithms outperform the static ones [52]. As discussed in [52], the weight removal in dynamic sparse training algorithms is similar to the synapses shrinkage in the human brain during sleep, where the weak synapses shrink and the strong ones remain unchanged. While most dynamic sparse training methods use magnitude as a pruning criterion, weight regrowing approaches are of different types, including random [52, 57] and gradient-based regrowth [10,28]. As shown in [47], random addition of weights might lead to a low training speed, and the performance of sparse training is highly correlated with the total number of parameters explored during training. To speed up the convergence, gradient information of non-existing connections can be used to add the most important connections to the network [9]. However, computing the gradient of all non-existing connections in a sparse neural network can be computationally demanding. Furthermore, increasing the network size might escalate the high computational cost into a bottleneck in the sparse training of networks on low-resource devices. Besides, in Section 4.2, we demonstrate that some gradient-based sparse training algorithms might fail in a highly sparse neural network. to find the most important connections to add to the network; however, we do not consider the similarity of the existing connections (empty entries). Finally, the weights corresponding to the highest similarity values in the similarity matrices (underlined values) that have not been dropped in the weight removal step are added to the network (underlined green values), the same amount as removed previously. If a connection with high similarity has been dropped in the weight removal step (underlined red value), a random connection will be inserted instead (pink connection).
In this paper, to address some of these challenges, we introduce a more biologically plausible algorithm for obtaining a sparse neural network. By taking inspiration from the Hebbian learning theory, which states "neurons that fire together, wire together" [25], we introduce a new weight addition policy in the context of sparse training algorithms. Our proposed method, "Cosine similarity-based and Random Topology Exploration (CTRE)", exploits both the similarity of neurons as an importance measure of the connections and random search simultaneously (CTRE sim , Figure 1) or sequentially (CTRE seq ) to find a performant sub-network. In short, our contributions are as follows: • We propose a novel and biologically plausible algorithm for training sparse neural networks, which has a limited number of parameters during training. Our proposed algorithm, CTRE, exploits both similarity of neurons and random search to find a performant sparse topology. • We introduce the Hebbian learning theory in the training of the sparse neural networks.
Using the cosine similarity of each pair of neurons in two consecutive layers, we determine the most important connections at each epoch during sparse training of the network; we discuss in detail why this approach is an extension to the Hebbian learning theory in Section 3.2. • Our proposed algorithms outperform state-of-the-art sparse training algorithms in highly sparse neural networks.
While deep learning models have shown great success in vision and NLP tasks, these models have not been fully explored in the domain of tabular data [61]. However, designing deep models that are capable of processing tabular data is of great interest for researchers as it paves the way to building multi-modal pipelines for problems [17]. This paper mainly focuses on Multi-Layer Perceptrons (MLPs), which are commonly used for tabular and biological data. Despite the simple structure of MLPs and having only a few hyperparameters to tune, they have shown good performance in classification tasks [15,69]. In addition, in [29], authors investigated that despite the massive attention on CNN architectures, they utilize only 5% of the neural network workload of TPUs in Google data centers, while MLPs constitutes 61% of the total workload. Therefore, it is crucial to develop an efficient algorithm that can accelerate MLPs and are resource-efficient during training and inference. To pursue this goal, in this research, we aim to design sparse MLPs with a limited number of parameters during training and inference. To demonstrate the validity of our proposed algorithm, in addition to evaluating the methods on tabular and text datasets, we compare the methods also on the image datasets such as MNIST, Fashion-MNIST, and CIFAR10/100 datasets which are commonly used as benchmarks in previous studies.

Sparse Neural Networks
Methods to obtain and train sparse neural networks can be stratified into two major categories: dense-to-sparse and sparse-to-sparse. In the following, we shed light on each of these two approaches.
Dense-to-sparse. Dense-to-sparse methods to obtain sparse neural networks start training from a dense model and then prune the unimportant connections. They can be divided into three major subcategories: (1) Pruning after training: Most existing dense-to-sparse methods start with a trained dense network and iteratively (one or several iterations) prune and retrain the network to reach desired sparsity level. Seminal works were performed in the 1990s in [37,24], where authors use hessian matrix information to prune a trained dense network. More recently, in [23,12], authors use magnitude to remove unimportant connections. Other metrics, such as gradient [42], Taylor expansion [55, 56], and low-rank decomposition [70,40], have been also employed to prune the network. While being effective techniques in terms of the performance of the obtained sparse network, these methods suffer from high computational costs during training. (2) Pruning during training: To decrease the computational cost, this group of methods perform pruning during training [14,30,34]. Various criteria can be used for pruning, such as magnitude [20,77], L 0 regularization [48,63], group Lasso regularization [72], and variational dropout [54]. (3) Pruning before training: The first study to apply pruning prior to training was done by Lee et al. in [38], that used connection sensitivity to remove weights. Later works have followed the same approach by pruning the network before training using different approaches, such as gradient norm after pruning [71], connection sensitivity after pruning [8], and Synaptic Flow [68].
Sparse-to-sparse. To lower the computational cost of dense-to-sparse methods, sparseto-sparse training algorithms (also known as sparse training) use a sparse network from scratch with a sparse connectivity, which might be static (static sparse training [51, 31]) or dynamic (dynamic sparse training (DST) [52,5]). By allowing the topology to be optimized along with the weights, sparse neural networks trained with DST have reached a comparable performance to the equivalent dense networks or even outperform them.
DST methods can be divided into two main categories based on the weight addition policy: (1) Random regrowth: Sparse Evolutionary Training (SET) [52] is one of the earliest works that starts with a sparse neural network and perform magnitude pruning and random weight regrowing at each epoch to update the topology. In [57], the authors proposed the idea of parameter reallocation automatically across layers during sparse training in CNNS. Many works have further studied sparse training concept recently [16,3,45,44,46,47]. (2) Gradient information: A group of works have tried to exploit gradient information to speed up the training process in DST [62]. Dettmers and Zettlemoyer [9] used the momentum of the non-existing connections as a criterion to grow weights instead of random addition in the SET algorithm; While being effective in terms of the accuracy, this method requires computing gradients and updating the momentum for all non-existing parameters. The Rigged Lottery (RigL) [10] addressed the high computational cost by using infrequent gradient information. However, it still requires the computational cost for computing the periodic dense gradients.
[28] tried to further improve RigL by using the gradient for only a subset of non-existing weights. In [7], authors exploit gradient information in the search for a performant subnetwork and discuss that gradient-based weight addition is biologically plausible.

Hebbian Learning Theory
The Hebbian learning rule was proposed in 1949 by Hebb as the learning rule for neurons [25] inspired by biological systems. It describes how the neurons' activations influence the connections among them. The classical Hebb's rule indicates "neurons that fire together, wire together". This can be formulated as ∆ w i j = η p i q j , where ∆ w i j is the change in synaptic weight w i j between two neurons p i (presynaptic) and q j (postsynaptic) in two consecutive layers, and η is the learning rate. While some previous works have adapted Hebb's rule to some machine learning tasks, [64,43], it has not been vastly investigated in many others, particularly in the sparse neural networks. By adapting Hebb's rule to artificial neural networks, we can obtain powerful models that might be close to the function of structures found in neural systems of various species [33]. In [2], authors have incorporated the Hebbian learning theory to train a newly introduced neural network. In [67], the Hebbian learning concept has been used to sparsify the neural networks for face recognition; they drop the connections between the weakly correlated neurons. In [7], authors proposed a gradient-based algorithm for obtaining a sparse neural network; they discuss the gradient-based connection growth policy is mathematically close to the Hebbian learning theory. In this work, by taking inspiration from the Hebbian learning theory, we aim to introduce a new sparse training algorithm for obtaining sparse neural networks.

Cosine Similarity
In most machine learning problems, the Euclidean distance is a common tool to measure the distance due to its simplicity. However, the Euclidean distance is highly sensitive to the vectors' magnitude [73]. Cosine similarity is another metric that addresses this issue; it measures the similarity of the shapes of two vectors as the cosine of the angle between them. In other words, it determines whether the two vectors are pointing in the same direction or not [22]. Due to its simplicity and efficiency, the cosine similarity is a widely used metric in machine learning and pattern recognition field [73]. It often measures the document similarity in natural language processing tasks [66,39]. Cosine Similarity has proven to be an effective tool also in neural networks. In [49], to bound the pre-activations in a multi-layer neural network that might disturb the generalization, authors have proposed to use cosine similarity instead of the dot product and showed that it reaches a better performance than the simple dot product. In [59], authors have used this metric to improve face verification using deep learning.

Proposed Method
In this section, we first formulate the problem. Secondly, we demonstrate the cosine similarity as a tool for determining the importance of weights in neural networks and how it relates to the Hebbian learning theory. Finally, we present two new sparse training algorithms using cosine similarity-based connection importance.

Problem Definition
Given a set of training samples X and target output y y y, a dense neural network is trained where m is the number of training samples, L is the loss function, f is a neural network parametrized by θ θ θ , f (x x x (i) ; θ θ θ ) is the predicted output for input x x x (i) , and y y y (i) is the true label. θ θ θ ∈ R N is consisted of parameters of each layer l ∈ {1, 2, ..., H} of the network as θ θ θ l ∈ R N l , where N l = n l−1 × n l is the number of parameters of layer l, n l is number of neurons at layer l, and the total number of parameters of the dense network is N. A sparse neural network, however, uses only a subset of θ θ θ l , and discards s l fraction of parameters of each layer θ θ θ l (their weight values are equal to zero); s l is referred to as the sparsity of layer l. The overall sparsity of the network is S = 1 − D, is the overall density of the network. We aim to obtain a sparse neural network with sparsity level of S and parameters θ θ θ . We aim to train this network to minimize the loss on the training set as follows: where θ θ θ 0 is the total number of non-zero connections of the network which is determined by the density level. Network Structure. The architecture we consider is a Multi-layer Perceptron (MLP) with H layers. Initially, sparse connections between two consecutive layers are initialized with an Erdős-Rényi random graph; each connection in this graph exists with a probability of P(θ l i ) = ε(n l−1 +n l ) n l−1 n l , i ∈ {1, 2, ..., N l }, where ε ∈ R + denotes the hyperparameter that controls the sparsity level. The lower the value of ε is, the sparser the network would be. In other words, by increasing ε, the probability of P(θ l i ) would be higher which results in more connections and a denser network. Each existing connection is initialized with a small value from a normal distribution.

Cosine Similarity to Determine Connections Importance
In this paper, we use the cosine similarity as a metric to derive the importance of non-existing connections and evolve the topology of a sparse neural network. We first demonstrate how we measure cosine similarity of two neurons. Then, we argue why this choice has been made and how it relates to the Hebbian Learning theory. We measure the similarity of two neurons p and q as: where S S Si i im m m l is the similarity matrix between neurons in two successive layers l − 1 and l. A A A l−1 :,p and A A A l :,q ∈ R m are the activation vectors corresponding to neurons p and q in layers l − 1 and l, respectively. If Sim l p,q is high for two unconnected neurons (close to 1), it means that they have a high similarity among their activations; therefore, we prefer to add a connection between them as it suggests that this path contains important information about data. However, if Sim l p,q is low for two neurons (close to 0), it means that the activations of neurons p and q are not similar, and the connection among them might not be beneficial for the network.
We now argue why cosine similarity can be used to measure the importance of a nonexisting connection in sparse neural networks and how it connects to the Hebbian learning theory. Basically, by taking inspiration from the Hebbian learning theory, we aim to rewire the neurons that fire together in the context of sparse training algorithms, instead of only strengthening the existing connections among neurons that fire together [65]. It has been discussed in [65] that connecting a pair of neurons with strong coincident activations can be viewed as a natural extension of the Hebbian learning; it is necessary to wire the neurons that usually fire together in order to understand better the relationship among the higher-order representation of those neurons. If a causal connection between their higher-order representation does exist, growing a connection among them will enable an effective inference about the relationship between them. Therefore, we need to discover which pairs of neurons usually fire together and then rewire them.
We employ cosine similarity to measure the relation between the activation values of two neurons. Such as the Hebb's rule (Section 2.2), the importance of a connection in our method is also determined by multiplying the activations of its corresponding neurons, albeit normalized; in Equation 2, A A A l−1 :,p is the presynaptic activation and A A A l :,q is the postsynaptic activation. If the activations of two connected neurons agree, by computing the dot product of activations, both Equation 2 and Hebb's rule assign higher importance to the corresponding connection. This would result in increased weight and a better chance of adding this connection. Thus, both methods reward connections between neurons that exhibit similar behavior. As mentioned earlier, the main difference between the Hebb's rule and Equation 2 is normalization. We will discuss in Section 5.3 why the normalization step is necessary for evolving the topology of a sparse neural network.
In summary, if the cosine similarity of the activation vector of two neurons is high, it indicates the necessity of the connection between them in the network's performance. Therefore, we use the cosine similarity information to find out if the link between a pair of neurons should be rewired or not. Based on this knowledge, we propose two new algorithms to evolve the sparse neural network in the following sections.

Sequential Cosine Similarity-based and Random Topology Exploration (CTRE seq )
Our first proposed algorithm, Sequential Cosine Similarity-based and Random Topology Exploration (CTRE seq ) evolves the network topology using both cosine similarity between neurons of each pair of consecutive layers in the network and random search. Overall, in the beginning, at each training epoch, it removes unimportant connections based on their magnitude and adds new connections to the network based on their cosine similarity. When the network performance stops improving, the algorithm switches to random topology search. In the following, we will explain the algorithm in more detail.
After initializing the sparse network with sparsity level determined by ε, the training begins. The training procedure consists of two consecutive phases: Add connection c to the network Firstly, a standard feed-forward and back-propagation are performed. (b) Then, a proportion ζ of connections with the lowest magnitude in each layer is removed. In Section 5.2, we further discuss why this choice has been made. (c) Subsequently, we add new connections to the network based on the neurons' similarity. Taking advantage of the cosine similarity metric, we measure the similarity of two neurons as formulated in Equation 2. In each layer, we add connections (as many connections as the removed connections in this layer) with the highest similarity between the corresponding neurons; the new connections are initialized with a small value from a uniform distribution. 2. Random Exploration: The second phase begins when the performance of the network on a validation set does not improve in e early stop epochs ( e early stop is a hyperparameter of CTRE seq ). This is due to the fact that the activation values might not change significantly after some epochs and, consequently, the similarity of neurons. As a result, the topology search using cosine similarity might stop as well. To prevent this, we begin a random search when the classification accuracy on the validation set stops increasing. This phase is almost similar to phase 1, and they are different in the weight regrowing policy. In this phase, instead of using cosine similarity information, we add connections randomly to the network. In this way, we prevent early stopping of the topology search. Algorithm 1 summarizes this method.

Simultaneous Cosine Similarity-based and Random Topology Exploration (CTRE sim )
To constantly exploit the cosine similarity information during training and avoid early stopping of topology exploration, we propose another method for obtaining a sparse neural network, named Simultaneous Cosine Similarity-based and Random Topology Exploration (CTRE sim ).
Prior to the training, we initialize a sparse neural network. After that, the training procedure starts with three steps in each epoch. The first two steps are similar to the CTRE seq , which are (a) standard feed-forward and back-propagation, and (b) magnitude-based weight removal. However, in step (c), instead of relying solely on cosine similarity information or random addition, we combine both strategies. There are two reasons behind this choice: (1) As discussed in Section 3.3, as the training proceeds, the activation values become stable and might not change significantly after a while and, consequently, the similarity values. In CTRE seq , we addressed this issue by switching completely to random search. However, the training speed might slow down if we rely only on the random search. (2) If we rely only on cosine similarity information, there is a possibility to add some connections based on the similarity of the neurons, which have been removed based on the magnitude in the weight removal step. It means that in these cases, the path between these pairs of similar neurons does not contribute to the performance of the network. Therefore, we should not add such connections to the network. These are the potential limitations of CTRE seq .
To address these limitations, CTRE sim takes another approach to prevent adding the removed connections which have a high cosine similarity to the network, as follows. In step c, we add the connections with high similarities to the network; however, if some connections with high cosine similarity are earlier removed based on their magnitude in step b, we add random connections to the network. In other words, we split our budget between similarity-based and random exploration. More importantly, we let the network dynamically decide how much budget should be allocated to each exploration at each epoch. The benefits from this approach are twofold; we prevent early stopping of the topology search, and also prevent re-adding connections that have shown to be unhelpful for the network's performance. Algorithm 2 summarizes this method.

Experiments and Results
In this section, we evaluate our proposed algorithms and compare them with several state-ofthe-art algorithms for obtaining a sparse neural network. First, we describe the settings of the conducted experiments, including the hyperparameter values, implementation details, and datasets. Then, we compare them in terms of the classification accuracy on several datasets and networks with different sizes and sparsity levels.

Settings
This section gives a brief overview of the experiment settings, including hyperparameter values, implementation details, and datasets used for the evaluation of the methods.

Hyperparameters.
The network that we use to perform experiments is a 3-layer MLP as described in Section 3.1. The activation functions used for hidden and output layers are "Relu" and "Softmax", respectively, and the loss function used is "CrossEntropy". The values for most hyperparameters have been selected using a grid search over a limited number of values. The hyperparameter ζ has been set to 0.2. In Algorithm 1, e early stop has been set to 40. We train the network with Stochastic Gradient Decent (SGD) with momentum and L 2 regularizer. The momentum coefficient, the regularization coefficient, and learning rate are 0.9, 0.0001, and 0.01, respectively. All the experiments are performed using 500 training epochs. The datasets have been preprocessed using the Min-Max Scaler so that each feature is normalized between 0 and 1, except for Madelon, where we use standard scaler (each feature will have zero mean and unit variance). For the image datasets, data augmentation has not been performed unless it has been explicitly stated.  Isolet  617  Speech  7737  6237 1560  26  Madelon  500  Artificial  2600  2000 600  2  MNIST  784  Image  70000 60000 10000  10  Fashion_MNIST  784  Image  70000 60000 10000  10  CIFAR10  3072  Image  60000 50000 10000  10  CIFAR100  3072  Image  60000 50000 10000  100  PCMAC  3289  Text  1943  1554 389  2  BASEHOCK  4862  Text  1993 1594 399 2

Comparison
We compare the results with three state-of-the-art methods for obtaining sparse neural networks, including, SNIP, RigL, and SET.
• SNIP [38]. Single-shot network pruning (SNIP) is a dense-to-sparse sparsification algorithm that prunes the network prior to initialization based on connection sensitivity. It calculates this metric after a few iteration of dense training. After pruning, SNIP starts the training with the sparse neural network. • RigL [10]. The rigged lottery (RigL) is a sparse-to-sparse algorithm for obtaining a sparse neural network that uses the gradient information as the weight addition criteria. • SET [52]. Sparse evolutionary training (SET) is a sparse-to-sparse training algorithm that uses random weight addition for updating the topology.
Besides, we measure the classification performance of a fully-connected MLP as the baseline method.

Implementation
We evaluate our proposed methods and the considered baselines on eight datasets. We implemented our proposed method using Tensorflow [1]. The baseline of this implementation is the RigL code from Github 2 . It also includes the implementation for SNIP, SET, and fullyconnected MLP. This code uses a binary mask over weights to implement sparsity. In addition, we provide a purely sparse implementation that uses Scipy library sparse matrices. This code is developed from the sparse implementation of SET, which is available on Github 3 . For all the experiments, we use the Tensorflow implementation to have a fair comparison among methods. However, we provide the results using the sparse implementation in Appendix C. Most experiments were run on a CPU (Dell R730). For image datasets, we used a Tesla-P100 GPU. All the experiments were repeated with three random seeds. The only exception is the experiments from Section 4.2 where we run 15 random seeds to analyze the statistical significance of the obtained results with respect to the considered algorithms (Section 4.2.1). To ensure a fair comparison, for the sparse training methods (SET, RigL, and CTRE), the sparsity mask is updated at the end of each epoch, and drop fraction (ζ ) and learning rate are constant during training.

Datasets
We conducted our experiments on eight benchmark datasets as follows: • Madelon [21] is an artificial dataset with 20 informative features , and 480 noise features.
• Isolet [11] has been created with the spoken name of each letter of the English alphabet.
• MNIST [36] is a database of 28 × 28 images of handwritten digits. More details about the datasets is presented in Table 1.

Performance Evaluation
In this experiment, we compare the methods in terms of classification accuracy on networks with varying sizes and sparsity levels. We consider three MLPs, each having three hidden layers with 100, 500, and 1000 hidden neurons, respectively. By changing the value of ε for each MLP, we study the effect of sparsity level on the performance of the methods. Table 2 summarizes the results of these experiments that are carried out on the five datasets, including tabular and image datasets that have different characteristics. We have also included the density (as percentage) and the number of connections (divided by 10 3 ) for each network in this table. For training on each dataset, we allocate 10% of the training set to a validation set. During training, each MLP is trained on the new training set. At each epoch, we measure the performance on the validation set. Finally, Table 2 presents the results of each algorithm on an unseen test set and using the model that gives the highest validation accuracy during training. The learning curves regarding each case are presented in Appendix A; however, we present some interesting cases in Figure 2.
First, we analyze the performance of methods on the two tabular datasets. As can be seen in Table 2, on Madelon dataset, CTRE sim is the best performer in most cases. Interestingly, the accuracy increases when the network becomes sparser. However, this can be explained intuitively; since the Madelon dataset contains many noise features (> 95%), the higher the number of the connections is, the higher the risk for over-fitting the noise features will be. CTRE sim can find the most important information paths in the network, which most likely start from the input neurons corresponding to the informative features. As a result, it can reach an accuracy of 78.8% with only 0.3% of total connections of the equivalent dense network (n l = 1000), while the maximum accuracy achieved by other methods considered is 61.9% (SET). On the second tabular dataset, Isolet, CTRE sim is the best performers on two very sparse models, including 0.4% (n l = 500) and 0.3% (n l = 1000) densities. In addition, in all the other cases, CTRE sim and CTRE seq are the second and third-best performers. In terms of learning speed, we can observe in Figure 2 that CTRE sim can find a good topology much faster than other methods, which results in an increase in the accuracy within a short period after the training starts. From Figure 2, it can be seen that RigL fails to find an informative sub-network in these cases (D < 0.3%). This indicates that gradient information might not be informative in highly sparse networks.
On the image dataset, CTRE sim and CTRE seq are the best and second-best performers in most of the cases considered. When the network size is small (n l = 100), SET is the major competitor of CTRE. However, when the model size increases, CTRE outperforms SET. This Table 2: Classification accuracy (%) comparison among methods on networks with various sizes and sparsity levels. The density (%) and number of connections for each case is indicated in the table. Please note that N (total number of parameters of the network) is scaled by ×10 3 . n l = 100 n l = 100 n l = 100 n l = 500 n l = 500 n l = 500 n l = 1000 n l = 1000 n l = 1000   indicates that the pure random weight addition policy in SET can perform well in networks with a higher density, while it is hard to find such sub-network randomly in high sparsity scenarios due to the very large search space. RigL also has a comparable performance to SET, except for very sparse models. As discussed in the previous paragraph, on a highly sparse network (D < 0.3%), RigL has poor performance. Besides, as shown in Figure 2, SNIP starts with a steep increase in the accuracy due to the few iterations of training a dense network and thus, starting with good topology. However, as the training proceeds, this topology cannot achieve the same performance as other methods. Therefore, it indicates that dynamic weight update is an essential factor in the sparse training of neural networks. These observations confirm that the cosine similarity is an informative criterion for adding weight in the network compared to random (SET) and gradient-based addition (RigL) in very sparse neural networks. CTRE can reach a better performance than state-of-the-art sparse training algorithms in terms of learning speed and accuracy when the network is highly sparse. Besides, by comparing the results with the dense network, it is clear that it is possible to reach a comparable performance to the dense network even with a network with 100 times fewer connections which is an excellent choice for low-resource devices on edge. We further compare the computational cost of the algorithms in Appendix B and their learning speed in Appendix A.

Statistical Significance Analysis
In this section, we analyze the statistical significance of the results obtained by CTRE compared to the other algorithms. To measure this, we perform Kolmogorov-Smirnov test (KS-test). The null hypothesis is that the two independent results/samples are drawn from the same continuous distribution. If the p-value is very small (p-value < 0.05), it suggests that the difference between the two sets of results is significant and the hypothesis is rejected. Otherwise, the obtained results are close together and the hypothesis is true.
We perform KS-test between the results obtained by CTRE (for simplicity, we consider maximum results of CT RE seq and CT RE sim ) and the other considered algorithms for the experiments in Table 2. The results of the KS-test is summarized in Table 3. In this table, Reject shows that the results are sufficiently distinct and True means that the obtained results are close together. The * sign in Table 3 shows that an algorithm has achieved the maximum accuracy in the corresponding experiment. Finally, the entries colored as red shows an experiment where a compared method obtains a close result to CTRE while having lower mean accuracy.
From Table 3, we can observe that in majority of the experiments, CTRE obtains higher mean accuracy than the other methods while being statistically different from them. The only dataset where the results in most cases are close is the Fashion-MNIST dataset where SET has comparable results to CTRE on this dataset. In addition, in high sparsity regime and large network size (n l = 1000, ε = 1), CTRE achieves the highest accuracy among the methods while being significantly distinct from them. Overall, Table 3 indicates that CTRE is

Sparsity-Performance Trade-off Analysis in Highly Sparse MLPs
We carry out another experiment to study the trade-off between sparsity and accuracy on very high sparsity cases. We perform this experiment for two difficult classification tasks including, image classification on CIFAR100, which is considered as a more difficult dataset than the earlier considered image datasets, and text classification on PCMAC and BASESHOCK   that are subsets of 20-newsgroup dataset; they have a high number of features and a low number of samples. This experiment uses a 3-layer MLP with 1000 and 3000 hidden neurons for text datasets and CIFAR100 dataset, respectively. We change the density value between 0 and 1 and compare our proposed approaches to SNIP, RigL, and SET (due to the close performance of CTRE sim and CTRE seq on earlier considered image datasets, on CIFAR100, we perform the experiments with CTRE sim ). We use data augmentation for CIFAR100. Also, as the network is considerably large on this dataset, we set the learning rate to 0.05 to speed up the training. The results are presented in Figure 3. As shown in Figure 3, in highly sparse networks (D < 0.5%), CTRE sim outperforms other methods by a large gap. As discussed in Section 4.2, RigL performs poorly in these scenarios. SNIP outperforms SET and RigL at the very low densities while still has lower results than CTRE sim in all cases. While SET outperforms other methods for larger density values on CIFAR100 and BASESHOCK, it performs poorly on a very sparse network. On text datasets, CTRE seq has comparable performance to CTRE sim and SET on higher densities, and it achieves the highest accuracy on PCMAC. Overall, we can observe that CTRE sim has decent performance on these three datasets with a density value between 0.3% and 0.5%.

Discussion
In this section, we perform an in-depth analysis to understand the behavior of CTRE better. First, in Section 5.1, we perform two ablation studies to study the effectiveness of both random topology search and similarity importance metric in the performance of CTRE. In Section 5.2, we discuss why we have chosen magnitude over cosine similarity for the weight removal step. In Section 5.3, we discuss why the insensitivity of cosine similarity to the vector's magnitude is important in the performance of CTRE. Finally, we discuss the convergence of CTRE in Section 5.4.

Ablation Study: Analysis of Topology Search Policies
This section presents and discusses the results of two ablation studies designed to understand better the effect of different topology search policies in CTRE. In the following, we describe each ablation experiment separately.

Ablation Study 1: Random Topology Search
The first ablation study aims to analyze the effect of random connection addition on the behavior of CTRE. Therefore, instead of using the similarity information and random search (simultaneously in CTRE sim and sequentially in CTRE seq ), we only use the cosine similarity information at each epoch. We call this approach CTRE w/oRandom and repeat the experiments from Section 4.2. The detailed results are available in Table 4.
As can be seen in Table 4, in most cases considered, CTRE w/oRandom has been outperformed by CTRE sim and CTRE seq . On the other hand, we can observe that on image datasets, CTRE w/oRandom has comparable performance to the other two methods; this indicates the effectiveness of similarity information on the image datasets. However, on tabular datasets, it performs poorly on high sparsity cases (ε = 1). Therefore, using only cosine information in these scenarios can cause the topology search to be stuck in a local minimum. This might have been originated by the early stopping of changes in the activation values, which leads to Table 4: Classification accuracy (%) comparison among Cosine similarity-based methods. n l = 100 n l = 100 n l = 100 n l = 500 n l = 500 n l = 500 n l = 1000 n l = 1000 n l = 1000 an early stop in topology search. CTRE seq solves this by changing the weight update policy to random search. However, there is a risk of early switching to random search when the cosine information has not been fully exploited. Finally, by considering both random and cosine information in each epoch, the CTRE sim algorithm will minimize the risk of staying in the local minimum or switching to a completely random search, both of which might slow the training process. In the context of network topology search, these components can also be characterized as exploitation (local information based on the similarity between neurons) and exploration (random search). As a result, CTRE sim can mitigate the limitations of CTRE seq and find a performant sub-network by leveraging these two components, which outperform state-of-the-art algorithms.

Ablation Study 2: Cosine similarity-based Topology Search
To study the effectiveness of cosine similarity addition in the performance of CTRE, we design an experiment; in this experiment, we add connections in the reverse order of importance to the network. We expect that adding weights in this order would result in poor performance. We perform this experiment on CTRE sim . Concretely, at each step, we add a number of weights with the lowest similarity among the corresponding neurons; if a weight with a very low similarity has been removed in the last weight removal step, we add a random connection instead. We call this method CTRE sim/LTH (LTH refers to low to high importance).
As can be seen in Table 4, CTRE sim/LTH has been outperformed by CTRE sim and CTRE seq in most of the cases considered. This shows that cosine similarity is a useful metric to detect the most important weights in the network. By comparing CTRE sim/LTH with SET (Table 2), it is clear that in most cases CTRE sim/LTH has a close or slightly worse accuracy than SET. Therefore, it can be inferred that CTRE sim/LTH is selecting non-informative weights, which can be similar to or worse than a random search. As a result, this can indicate the effectiveness of the introduced similarity metric (Equation 2) in finding a well-performing sparse neural network. It is worth noting that on the Isolet dataset, CTRE sim/LTH outperforms CTRE sim and CTRE seq in some cases, particularly in the networks with higher density. This is similar to the results of SET as well. Therefore, we can conclude that random search outperforms other methods on the Isolet dataset and low sparsity levels. However, it is not easy to find a highly sparse network using the random search policy.

Analysis of Weight Removal Policy
In this section, we aim to analyze the weight removal policy and further explain the reason behind choosing magnitude-based pruning over the cosine similarity (discussed in Section 3.2). In many previous studies, magnitude-based pruning has been commonly used as a criterion to remove unimportant weight from a neural network. We design an experiment to compare the performance of magnitude-based and cosine similarity-based pruning in neural networks.
In this experiment, we start with a trained network and gradually remove weights based on the magnitude and cosine similarity value (Using Equation 2) of the corresponding connection. We also consider random pruning as the baseline.
Settings. We perform this experiment using two networks: (1) A 3-layer dense MLP with 1000 neurons in each layer, and (2) A 3-layer sparse MLP with 1000 neurons in each layer that is trained using the SET approach [52] (3.2% density). The choice of SET instead of CTRE was made to avoid any biases on the cosine similarity weight removal, as CTRE uses cosine information to add weights. Both of these networks are trained on the MNIST dataset.
Weight Removal. We remove weights with two orders on each of the sparse and dense networks: least to most important and vice versa. We gradually remove weights; at each step, we remove 1% of the connections and measure the accuracy of the pruned network until no connection remains in the network.
Results. The results when the two networks are trained for 10, 30, 50, and 100 epochs are available in Figure 4. In this figure, the lines with higher transparency correspond to the weight removal of the SET-MLP, and the lines with lower transparency correspond to the dense-MLP. This experiment has been repeated with three seeds for each case.
As shown in Figure 4, when weights are removed from least to most important, magnitudebased pruning can order weights better than cosine similarity-based pruning. When the networks are trained for 100 epochs, by dropping the unimportant weights using magnitude, the major accuracy drop starts almost after removing 70% of the connections, while it happens after removing 30% for cosine similarity. This behavior exists in both the dense and the sparse networks. As expected, the drop for random removal happens from the beginning of the pruning procedure. In earlier epochs (10, 30, and 50), the drop in the accuracy happens earlier for both magnitude and cosine similarity.
It can be seen in Figure 4, by removing weights in the opposite order (from most to least important), the behavior of drop in the accuracy is almost similar for cosine similarity-based and magnitude-based pruning in SET-MLP, particularly in the earlier epochs. Therefore, both magnitude and cosine similarity can identify the most important connections in good order. However, this behavior is different in the dense network; magnitude-based pruning can better detect the most important weights. In the dense network, the drop in the accuracy for magnitude-based pruning happens earlier than cosine similarity pruning.
Conclusions. These observations can lead us to conclude that, firstly, the magnitude can be a good metric for weight removal in sparse training. Secondly, it can be inferred that cosine similarity can be a good metric for adding the most important connections in the weight addition phase in sparse neural networks in the absence of magnitude. As discussed earlier, the cosine similarity information of each connection is an informative criterion to detect the most important weights in a sparse neural network and has similar behavior to magnitude-based pruning in these scenarios. Therefore, in the absence of magnitude for nonexisting connections in a sparse neural network (during weight addition), cosine similarity can be a useful criterion to detect the most important weights without requiring computing dense gradient information.

Magnitude Insensitivity: The Favorable Feature of Cosine Similarity in Noisy Environments
This section further discusses why cosine similarity has been chosen as a metric to determine the importance of non-existing connections. Specifically, we mainly focus on analyzing the importance of normalization in Equation 2 in the performance of the algorithm. While based on the Hebbian learning rule, the connection among a pair of neurons with high activations should be strengthened, we argue that in the search for a performant sparse neural network, the magnitude of the activations should be ignored.
Based on Hebb's rule (Section 2.2), the connection among the neurons with high activations receives higher synaptic updates. Therefore, if we evolve the topology using this rule (without any normalization) the importance of a non-existing connection should be determined by: A A A l−1 :,p · A A A l :,q . We evaluate the performance of this metric by replacing it with Equation 2 in CTRE sim and CTRE seq ; we name these algorithms CTRE sim-Hebb and CTRE seq-Hebb , respectively.
We evaluate these methods on the Madelon dataset. The reason behind choosing this dataset is due to its interesting properties; it contains 480 noisy features (out of the 500 features). Therefore, finding informative information paths through the network is considered to be a challenging task. The settings of this experiment are similar to Section 4.2; we measure the performance on networks with different sizes and sparsity levels. The results are presented in Table 5 and the accuracy during training is plotted in Figure 5. CTRE sim-Hebb and CTRE seq-Hebb have been outperformed by CTRE sim and CTRE seq in all cases considered. Table 5: Classification accuracy (%) comparison of Cosine similarity-based methods and pure Hebbian-based evolution, on the Madelon dataset. Please note that N (total number of parameters of the network) is scaled by ×10 3 . n l = 100 n l = 100 n l = 100 n l = 500 n l = 500 n l = 500 n l = 1000 n l = 1000 n l = 1000 Particularly, we can observe that as the network becomes sparser, the gap between the performance of the pure Hebbian-based methods and the cosine similarity-based methods increases.
The poor performance of CTRE sim-Hebb and CTRE seq-Hebb on the Madelon dataset is resulted from their sensitivity to the magnitude of activation values. As Madelon contains many noisy features, some uninformative neurons likely receive a high activation value. Therefore, if we use only the activation magnitude to find the informative paths of information, the algorithm will be biased on the neurons with very high activation, which might not be informative. Therefore, it is likely to assign new connections to noisy features with high activation. This would cause the algorithm to be stuck in a local minimum which might be difficult to escape as these neurons continue to receive more and more connections at each epoch. Furthermore, as the networks become sparser, the informative features have a lower chance of receiving more connections (there are more noisy features compared to the informative ones). Therefore, in sparse networks, the gap between the performance of these methods is much larger than in denser networks. Based on these observations, it can be concluded that the insensitivity of cosine similarity to the vector's magnitude helps CTRE to be more robust in noisy environments.

Convergence Analysis
This section discusses the convergence of the proposed algorithm for training sparse neural networks from scratch, CTRE. In short, we first discuss the effect of the weight evolution process on the algorithm's convergence. Secondly, we explore whether cosine similarity causes CTRE to converge into a local minimum or not. First, we analyze if the weight evolution process in the CTRE algorithm interferes with the convergence of the back-propagation algorithm or not. In the CTRE algorithm, a number of connections of the weights are removed at each training epoch, and the same number of connections are added based on the cosine or random search policies. The weight evolution process is performed at each epoch after the standard feed-forward and back-propagation steps. The removed connections have a small magnitude compared to the other connections, and newly activated connections also get a small value. Therefore, they do not change the loss value significantly. The new weights will be updated in the next feed-forward and backpropagation step, and they will grow or shrink. Therefore, the weight evolution process does not disrupt the convergence of the model.
To validate this, we depict the test loss during training in Figure 6 for the high sparsity regime and a large network (ε = 1, n l = 1000). It can be observed that the loss function converges for the CTRE algorithm on all the datasets. In addition, in most cases, its convergence speed is much faster than for the other algorithms.
Secondly, we analyze whether CTRE is prone to converge to a local optima. As discussed in Section 5.2, cosine similarity is very successful at determining the most and least important connections in the network. However, in the mid-importance range, it might not be able to rank connections as well as the magnitude criterion; therefore, it might add some connections that do not contribute to decreasing the network loss. In such cases, the cosine similarity metric might prevent topology exploration and get stuck in local minima. CTRE explores other weights and exits this local minimum by using a random search. To validate this, in Figure 7, we have presented the loss during training for CTRE seq , CTRE sim , and CTRE w/oRandom on three highly sparse neural networks trained on the Isolet dataset. The fast decrease in the loss in these plots indicates that all three methods quickly find a good-performing sub-network. However, the loss value of CTRE w/oRandom does not improve significantly after 200 epochs, and it converges to a higher value than the other two methods. Therefore, it is important to use random exploration to keep improving the topology and avoid local minima as it is done in CTRE.

Conclusion and Broader Impacts
In this research, we introduced a new biologically plausible sparse training algorithm named CTRE. CTRE exploits both the similarity of neurons as an importance measure of the connections and random search, sequentially (CTRE seq ) or simultaneously (CTRE sim ), to explore a performant sparse topology. The findings of this study indicate that the cosine similarity between neurons' activations can help to evolve a sparse network in a purely sparse manner even in highly sparse scenarios, while most state-of-the-art methods may fail in these cases. In our view, by using the neurons' similarity to evolve the topology, our proposed approach can be an excellent initial step toward explainable sparse neural networks. Overall, due to the ability of CTRE to extract highly sparse neural networks, it can be a viable alternative for saving energy in both low-resource devices and data centers and pave the way to achieving environmentally friendly AI systems. Nevertheless, the trade-off between accuracy and sparsity, with CTRE deployed on real-world applications, should be considered carefully; particularly, if any loss of accuracy may pose safety risks to the user, the sparsity level of the network needs to be analyzed with greater care. An interesting future direction of this research is to extend CTRE to CNNs; driven by the decent performance of CTRE on image datasets, we believe that it has the potential to be extended to CNN architectures. However, in-depth theoretical analysis and systematic experiments are required to adapt this similarity metric to CNN architectures. This is due to the fact that CNNs require weight sharing, which does not exist in real neurons, and consequently, it is not straightforward to apply Hebbian learning directly [60]. There have been some efforts to make CNNs more biologically plausible [60,4]. Therefore, applying CTRE to CNNs should be done with great care and theoretical analysis that we believe is in the scope of future works.

A Performance Evaluation
In this appendix, we compare the performance of the algorithms in terms of the accuracy, the learning speed, and computational complexity. Particularly, we analyze the results obtained in Section 4.2 in the manuscript. We first introduce a new metric for comparing the learning speed of the methods. We also include the learning curves for the experiments of Section 4.2 in Figures 8, 9, 10, 11, 12, and 13, which corresponds to Madelon, Isolet, MNIST, Fashion-MNIST, CIFAR10, and CIFAR100, respectively. The characteristics of these datasets are presented in Table 1.
To compare the training speed, we define a metric that computes the fraction of the total training time required to reach a certain level of accuracy. We call this metric Training Delay (TD) and compute it as follows: where acc i is the test accuracy at epoch i, th is the threshold hyperparameter between 0 and 1, acc max is the maximum accuracy achieved by the training methods for the model with n l hidden neurons and sparsity level ε, and # epochs is the total number of training epochs. In other words, TD shows the trade-off between accuracy and learning speed. The lower TD is for a method, the faster it can be trained to reach a certain desired level of accuracy (determined by th); therefore, it has a better trade-off between accuracy and learning speed. We believe that minimizing this metric is crucial for low-resource devices where accuracy is not the only important aspect for evaluating the performance of the method. Instead, achieving a decent level of accuracy within a minimum number of training epochs is the primary concern.
For each network with different sizes and each training method, we measure TD on all datasets. We consider only the high sparsity case (when ε = 1) since when the network is dense, all the methods have very low TD, and the difference between them is negligible. However, when we are looking for a highly sparse sub-network, it takes longer for each method to find the well-performing sub-network, and the difference among the methods is more apparent. We set the threshold th to 0.9; therefore, we compute the training delay for reaching 0.9 of the maximum accuracy achieved on this model. The results are presented in Table 6. If a method cannot reach the th × acc max within the total number of epochs (500 in these experiments), we keep the corresponding entry empty.
As can be seen in Table 6, CTRE (including CTRE sim and CTRE seq ) has the lowest training delay (TD) in 13 out of 18 cases considered. On the Isolet and the Madelon datasets, some methods cannot reach the required level of accuracy (0.9 of the maximum accuracy) within the 500 training epochs. SNIP has the worst performance among these methods and cannot reach the required level of accuracy on Madelon, Isolet, and CIFAR10. RigL has a similar performance to SNIP; while it has a decent performance on Fashion-MNIST and MNIST, it has a poor performance on the other datasets. Finally, SET has comparable performance to other methods on Fashion-MNIST and MNIST. However, when the network is highly sparse and large (ε = 1 and n l > 100) on the Isolet dataset, it does not have a good performance. Table 6: Training delay (TD) (%) comparison among methods. Empty fields indicate that the method cannot reach the considered level of accuracy in 500 training epochs.

Madelon
Isolet MNIST Fashion-MNIST CIFAR10 CIFAR100 n l n l n l n l n l n l n l n l n l n l n l n l n l n l n l ε ε ε Method 100 500 1000 100 500 1000 100 500 1000 100 500 1000 100 500 1000 100 500 1000

B Computational Complexity
In this appendix, we compare the algorithms in terms of the computational complexity. While the computational cost during inference is equal for all methods (in the case of having the same sparsity level), the computational complexity during training is different.
We compare the computational complexity with the two closest sparse training algorithms to CTRE: SET and RigL. Our proposed methods require an extra cost of computing the cosine similarity matrix for the connections compared to SET. For each layer in each epoch, CTRE requires computing three dot products of size m (number of samples) for each connection in this layer to compute similarity matrix in Equation 2. Therefore, for each layer l, CTRE requires in the order of O(mN l ) extra computations at each epoch, where N l is the number of parameters of layer l. However, this additional cost considerably improves the accuracy and the learning speed (discussed in Appendix A), particularly on tabular datasets and highly sparse neural networks. Therefore, depending on the application, the specialists should decide about the trade-off between accuracy and the computational cost when finding highly sparse neural networks. Compared to RigL, which requires computing occasional dense gradients, CTRE has the same order of complexity. This is because the order of computing gradients for back-propagation is also O(mN l ). However, CTRE outperforms RigL, especially in the high sparsity region.
To further decrease the computational cost of CTRE, we have tried to reduce the cost of cosine similarity computation by considering a proportion of the samples to compute the similarity matrix. We run CTRE sim with half of the samples ( CTRE sample2 ) and a quarter of samples ( CTRE sample4 ) to compute the cosine similarity matrix. The results can be observed in Table 7. It is clear that even with half of the samples, CTRE can still achieve a close performance as the original method on all datasets. Based on these observations, it can be concluded that only a fraction of samples can be used to compute the similarity matrix. In this way, we would be able to decrease the computational cost without affecting the performance.
Further studies can be performed to decrease the computational cost of deriving the similarity matrix. In short, CTRE is a first step in finding highly sparse neural networks using neuron characteristics and can be further explored in future works. Table 7: Classification accuracy (%) of CTRE sim with different number of training samples for computing the similarity matrices. n l = 100 n l = 100 n l = 100 n l = 500 n l = 500 n l = 500 n l = 1000 n l = 1000 n l = 1000 ε ε ε Dataset Method 52.0 ± 0.1 52.7 ± 0.3 52.9 ± 0.6 52.9 ± 0.6 52.9 ± 0.6 51.6 ± 0.5 51.6 ± 0.5 51.6 ± 0.5 53.6 ± 0.3 53.6 ± 0.5 53.6 ± 0.5 53.6 ± 0.5 Table 8: Classification accuracy (%) comparison using pure sparse implementation. n l = 100 n l = 100 n l = 100 n l = 500 n l = 500 n l = 500 n l = 1000 n l = 1000 n l = 1000 ε ε ε Dataset Method 1 5 13 1 5 13 1 5 13 Table 9: Classification accuracy (%) comparison among cosine and euclidean-based similarity metrics in CT RE algorithm. n l = 100 n l = 100 n l = 100 n l = 500 n l = 500 n l = 500 n l = 1000 n l = 1000 n l = 1000 ε ε ε Dataset Method 1 5 13 1 5 13 1 5 13 As shown in Table 9, CTRE sim outperforms CTRE sim-euclidean in most cases considered. Particularly, in the high sparsity regime (ε = 1), there is a considerable gap between the performance of these two methods. It can be concluded that Euclidean-based similarity metric is not very informative in evolving the topology of sparse neural networks. This might stem from the sensitivity of this metric to the vectors' magnitude. Cosine-similarity metric, a magnitude insensitive metric which also presents a biologically-plausible approach for measuring the importance of the connections, is a good choice for obtaining sparse neural networks in the CT RE algorithm.

E Ablation Study: Cosine and Random-based Weight Addition Order in CTRE seq
In this section, we perform an ablation study on the CTRE seq algorithm to measure the effectiveness of the cosine-based and random search order in the performance of the algorithm. With this aim, we measure the performance of two variants of CTRE seq (Algorithm 1) that are different from CTRE seq in the order of the cosine and random weight addition: • CTRE abl1:seq starts training with random weight addition, and switches to cosine-based addition when there is no improvement in the validation accuracy for e early stop epochs. • CTRE abl2:seq splits the training process into two equal phases in terms of the number of epochs. In the first phase it add weights randomly, and in the second phase it uses the cosine-similarity information for weight addition.
We compare the performance of these two algorithms with CTRE seq in Table 10. CTRE seq outperforms CTRE abl1:seq and CTRE abl2:seq in the majority of the experiments. In high sparsity regime and large network size (ε = 1, n l = 1000), there is a considerable gap between their performance. The reason behind this difference is that by using cosine similarity-based weight addition at the beginning of the algorithm, the algorithm finds a well-performing sub-network very fast. Then, in the rest of the algorithm it improves this topology using cosine information and random search. However, by starting with random weight addition, it might take longer for the algorithm to reach a reasonable level of performance. This results in a lower accuracy than CTRE seq at the end of training.
To summarize, while starting with cosine similarity-based weight addition and switching to random search might seem counter-intuitive, we show that this strategy is beneficial for the CTRE algorithm. Table 10: Classification accuracy (%) comparison among variants of CT RE seq algorithm. n l = 100 n l = 100 n l = 100 n l = 500 n l = 500 n l = 500 n l = 1000 n l = 1000 n l = 1000 ε ε ε Dataset  Method  1  5  13  1  5  13  1  5  13