1 Introduction

DenseFootnote 1 artificial neural networks are a commonly used machine-learning technique that has a wide range of application domains, such as speech recognition (Graves et al., 2013), image processing (Liang & Hu, 2015; Masi et al., 2018), and natural language processing (NLP) (Brown et al., 2020). It has been shown in Hestness et al. (2017) that the performance of deep neural networks scales with model size and dataset size, and generalization benefits from over-parameterization (Neyshabur et al., 2019). However, the ever-increasing size of deep neural networks has given rise to major challenges, including high computational cost both during training and inference and high memory requirement (Zhang et al., 2020). Such an increase in the number of computations can lead to a critical rise in the energy consumption in data centers and, consequently, a deteriorative effect on the environment (Yang et al., 2018). However, a trustworthy AI system should function in the most environmentally friendly way possible during development, and deployment (Group, 2020). In addition, such gigantic computational costs will lead to a situation where on-device training and inference of neural network models on low-resource devices, e.g., an edge device with limited computational resources and battery life, might not be economically viable (Zhang et al., 2020).

Sparse neural networks have been considered as an effective solution to address these challenges (Hoefler et al., 2021; Mocanu et al., 2021). By using sparsely connected layers instead of fully-connected ones, sparse neural networks have reached a competitive performance to their dense equivalent networks in various applications (Frankle & Carbin, 2018; Atashgahi et al., 2022), while having much fewer parameters. It has been shown that biological brains, especially the human brain, enjoy sparse connections among neurons (Friston, 2008). Most existing solutions to obtain sparse neural networks focus on inference efficiency in order to reduce the storage requirement of deploying the network and prediction time of test instances. This class of methods, named dense-to-sparse training, starts by training a dense neural network followed by a pruning phase that aims to remove unimportant weight from the network. As categorized in Mocanu et al. (2021), in dense-to-sparse training, the pruning phase can be done after training (Frankle & Carbin, 2018; Han et al., 2015; LeCun et al., 1990), simultaneous to training (Louizos et al., 2018), or one-shot prior to training (Lee et al., 2019). However, starting from a dense network leads to a memory requirement of fitting a dense network on the device and the computational resources for at least a few iterations of training the dense model. Therefore, training sparse neural networks using dense-to-sparse methods might be infeasible on low-resource devices due to the energy and computational resource constraints.

Fig. 1
figure 1

Schematic of the proposed approach (CTREsim). At each epoch, after feed-forward and back-propagation, a fraction \(\zeta \) of the weights with the smallest magnitude is dropped (red connections). Then, similarity matrices \({{\varvec{Sim}}}^{1}\) and \({{\varvec{Sim}}}^{2}\) are computed using Eq. 2 to find the most important connections to add to the network; however, we do not consider the similarity of the existing connections (empty entries). Finally, the weights corresponding to the highest similarity values in the similarity matrices (underlined values) that have not been dropped in the weight removal step are added to the network (underlined green values), the same amount as removed previously. If a connection with high similarity has been dropped in the weight removal step (underlined red value), a random connection will be inserted instead (pink connection)

With the emergence of the sparse training concept in Mocanu et al. (2016), there has been a growing interest in training sparse neural networks which are sparse from scratch. This sparse connectivity might be fixed during training (known as static sparse connectivity (Kepner & Robinett, 2019; Mocanu et al., 2016, 2021)), or might dynamically change, by removing and re-adding weights (known as dynamic sparse connectivity (Mocanu et al., 2018; Bellec et al., 2018)). By optimizing the topology along with the weights during the training, dynamic sparse training algorithms outperform the static ones (Mocanu et al., 2018). As discussed in Mocanu et al. (2018), the weight removal in dynamic sparse training algorithms is similar to the synapses shrinkage in the human brain during sleep, where the weak synapses shrink and the strong ones remain unchanged. While most dynamic sparse training methods use magnitude as a pruning criterion, weight regrowing approaches are of different types, including random (Mocanu et al., 2018; Mostafa & Wang, 2019) and gradient-based regrowth (Evci et al., 2020; Jayakumar et al., 2020). As shown in Liu et al. (2021c), random addition of weights might lead to a low training speed, and the performance of sparse training is highly correlated with the total number of parameters explored during training. To speed up the convergence, gradient information of non-existing connections can be used to add the most important connections to the network (Dettmers & Zettlemoyer, 2019). However, computing the gradient of all non-existing connections in a sparse neural network can be computationally demanding. Furthermore, increasing the network size might escalate the high computational cost into a bottleneck in the sparse training of networks on low-resource devices. Besides, in Sect. 4.2, we demonstrate that some gradient-based sparse training algorithms might fail in a highly sparse neural network.

In this paper, to address some of these challenges, we introduce a more biologically plausible algorithm for obtaining a sparse neural network. By taking inspiration from the Hebbian learning theory, which states “neurons that fire together, wire together” (Hebb, 2005), we introduce a new weight addition policy in the context of sparse training algorithms. Our proposed method, “Cosine similarity-based and Random Topology Exploration (CTRE)”, exploits both the similarity of neurons as an importance measure of the connections and random search simultaneously (CTREsim, Fig. 1) or sequentially (CTREseq) to find a performant sub-network. In short, our contributions are as follows:

  • We propose a novel and biologically plausible algorithm for training sparse neural networks, which has a limited number of parameters during training. Our proposed algorithm, CTRE, exploits both similarity of neurons and random search to find a performant sparse topology.

  • We introduce the Hebbian learning theory in the training of the sparse neural networks. Using the cosine similarity of each pair of neurons in two consecutive layers, we determine the most important connections at each epoch during sparse training of the network; we discuss in detail why this approach is an extension to the Hebbian learning theory in Sect. 3.2.

  • Our proposed algorithms outperform state-of-the-art sparse training algorithms in highly sparse neural networks.

While deep learning models have shown great success in vision and NLP tasks, these models have not been fully explored in the domain of tabular data (Popov et al., 2019). However, designing deep models that are capable of processing tabular data is of great interest for researchers as it paves the way to building multi-modal pipelines for problems (Gorishniy et al., 2021). This paper mainly focuses on Multi-Layer Perceptrons (MLPs), which are commonly used for tabular and biological data. Despite the simple structure of MLPs and having only a few hyperparameters to tune, they have shown good performance in classification tasks (Galke & Scherp, 2021; Tolstikhin et al., 2021). In addition, in Jouppi et al. (2017), authors investigated that despite the massive attention on CNN architectures, they utilize only \(5\%\) of the neural network workload of TPUs in Google data centers, while MLPs constitutes \(61\%\) of the total workload. Therefore, it is crucial to develop an efficient algorithm that can accelerate MLPs and are resource-efficient during training and inference. To pursue this goal, in this research, we aim to design sparse MLPs with a limited number of parameters during training and inference. To demonstrate the validity of our proposed algorithm, in addition to evaluating the methods on tabular and text datasets, we compare the methods also on the image datasets such as MNIST, Fashion-MNIST, and CIFAR10/100 datasets which are commonly used as benchmarks in previous studies.

2 Background

2.1 Sparse neural networks

Methods to obtain and train sparse neural networks can be stratified into two major categories: dense-to-sparse and sparse-to-sparse. In the following, we shed light on each of these two approaches.

Dense-to-sparse Dense-to-sparse methods to obtain sparse neural networks start training from a dense model and then prune the unimportant connections. They can be divided into three major subcategories: (1) Pruning after training: Most existing dense-to-sparse methods start with a trained dense network and iteratively (one or several iterations) prune and retrain the network to reach desired sparsity level. Seminal works were performed in the 1990s LeCun et al. (1990), Hassibi and Stork (1993), where authors use hessian matrix information to prune a trained dense network. More recently, in Han et al. (2015); Frankle and Carbin (2018), authors use magnitude to remove unimportant connections. Other metrics, such as gradient (Liu & Wu, 2019), Taylor expansion (Molchanov et al., 2016, 2019), and low-rank decomposition (Li et al., 2020; Wang et al., 2019a), have been also employed to prune the network. While being effective techniques in terms of the performance of the obtained sparse network, these methods suffer from high computational costs during training. (2) Pruning during training: To decrease the computational cost, this group of methods perform pruning during training (Gale et al., 2019; Junjie et al., 2019; Kusupati et al., 2020). Various criteria can be used for pruning, such as magnitude (Guo et al., 2016; Zhu & Gupta, 2017), L\(_0\) regularization (Louizos et al., 2018; Savarese et al., 2020), group Lasso regularization (Wen et al., 2016), and variational dropout (Molchanov et al., 2017). (3) Pruning before training: The first study to apply pruning prior to training was done by Lee et al. (2019), that used connection sensitivity to remove weights. Later works have followed the same approach by pruning the network before training using different approaches, such as gradient norm after pruning (Wang et al., 2019b), connection sensitivity after pruning (de Jorge et al., 2020), and Synaptic Flow (Tanaka et al., 2020).

Sparse-to-sparse To lower the computational cost of dense-to-sparse methods, sparse-to-sparse training algorithms (also known as sparse training) use a sparse network from scratch with a sparse connectivity, which might be static (static sparse training (Kepner & Robinett, 2019; Mocanu et al., 2016)) or dynamic (dynamic sparse training (DST) (Bellec et al., 2018; Mocanu et al., 2018)). By allowing the topology to be optimized along with the weights, sparse neural networks trained with DST have reached a comparable performance to the equivalent dense networks or even outperform them.

DST methods can be divided into two main categories based on the weight addition policy: (1) Random regrowth: Sparse Evolutionary Training (SET) (Mocanu et al., 2018) is one of the earliest works that starts with a sparse neural network and perform magnitude pruning and random weight regrowing at each epoch to update the topology. In Mostafa and Wang (2019), the authors proposed the idea of parameter reallocation automatically across layers during sparse training in CNNS. Many works have further studied sparse training concept recently (Atashgahi et al., 2022; Gordon et al., 2018; Liu et al., 2020, 2021a, b, c). (2) Gradient information: A group of works have tried to exploit gradient information to speed up the training process in DST (Raihan & Aamodt, 2020). Dettmers and Zettlemoyer (2019) used the momentum of the non-existing connections as a criterion to grow weights instead of random addition in the SET algorithm; While being effective in terms of the accuracy, this method requires computing gradients and updating the momentum for all non-existing parameters. The Rigged Lottery (RigL) (Evci et al., 2020) addressed the high computational cost by using infrequent gradient information. However, it still requires the computational cost for computing the periodic dense gradients. Jayakumar et al. (2020) tried to further improve RigL by using the gradient for only a subset of non-existing weights. In Dai et al. (2019), authors exploit gradient information in the search for a performant sub-network and discuss that gradient-based weight addition is biologically plausible.

2.2 Hebbian learning theory

The Hebbian learning rule was proposed in 1949 by Hebb as the learning rule for neurons (Hebb, 2005) inspired by biological systems. It describes how the neurons’ activations influence the connections among them. The classical Hebb’s rule indicates “neurons that fire together, wire together”. This can be formulated as \(\Delta w_{ij}=\eta p_iq_j\), where \(\Delta w_{ij}\) is the change in synaptic weight \(w_{ij}\) between two neurons \(p_i\) (presynaptic) and \(q_j\) (postsynaptic) in two consecutive layers, and \(\eta\) is the learning rate. While some previous works have adapted Hebb’s rule to some machine learning tasks, Liu et al. (2017), Scellier and Bengio (2016) it has not been vastly investigated in many others, particularly in the sparse neural networks. By adapting Hebb’s rule to artificial neural networks, we can obtain powerful models that might be close to the function of structures found in neural systems of various species (Kuriscak et al., 2015). In Arora et al. (2014), authors have incorporated the Hebbian learning theory to train a newly introduced neural network. In Sun et al. (2016), the Hebbian learning concept has been used to sparsify the neural networks for face recognition; they drop the connections between the weakly correlated neurons. In Dai et al. (2019), authors proposed a gradient-based algorithm for obtaining a sparse neural network; they discuss the gradient-based connection growth policy is mathematically close to the Hebbian learning theory. In this work, by taking inspiration from the Hebbian learning theory, we aim to introduce a new sparse training algorithm for obtaining sparse neural networks.

2.3 Cosine similarity

In most machine learning problems, the Euclidean distance is a common tool to measure the distance due to its simplicity. However, the Euclidean distance is highly sensitive to the vectors’ magnitude (Xia et al., 2015). Cosine similarity is another metric that addresses this issue; it measures the similarity of the shapes of two vectors as the cosine of the angle between them. In other words, it determines whether the two vectors are pointing in the same direction or not (Han et al., 2012). Due to its simplicity and efficiency, the cosine similarity is a widely used metric in machine learning and pattern recognition field (Xia et al., 2015). It often measures the document similarity in natural language processing tasks (Li & Han, 2013; Sidorov et al., 2014). Cosine Similarity has proven to be an effective tool also in neural networks. In Luo et al. (2018), to bound the pre-activations in a multi-layer neural network that might disturb the generalization, authors have proposed to use cosine similarity instead of the dot product and showed that it reaches a better performance than the simple dot product. In Nguyen and Bai (2010), authors have used this metric to improve face verification using deep learning.

3 Proposed method

In this section, we first formulate the problem. Secondly, we demonstrate the cosine similarity as a tool for determining the importance of weights in neural networks and how it relates to the Hebbian learning theory. Finally, we present two new sparse training algorithms using cosine similarity-based connection importance.

3.1 Problem definition

Given a set of training samples \({\mathbb {X}}\) and target output \({{\varvec{y}}}\), a dense neural network is trained to minimize \(J({{\varvec{\theta }}}) = \frac{1}{m} \sum _{i=1}^{m} L( f({{\varvec{x}}}^{(i)} ; {{\varvec{\theta }}}), {{\varvec{y}}}^{(i)}),\) where m is the number of training samples, L is the loss function, f is a neural network parametrized by \({{\varvec{\theta }}}\), \(f({{\varvec{x}}}^{(i)} ; {{\varvec{\theta }}})\) is the predicted output for input \({{\varvec{x}}}^{(i)}\), and \({{\varvec{y}}}^{(i)}\) is the true label. \({{\varvec{\theta }}}\in \mathbb {R}^{N}\) is consisted of parameters of each layer \(l \in \{1,2, ..., H\}\) of the network as \({{\varvec{\theta }}}^l \in \mathbb {R}^{N^l}\), where \(N^l = n^{l-1}\times n^l\) is the number of parameters of layer l, \(n^l\) is number of neurons at layer l, and the total number of parameters of the dense network is N. A sparse neural network, however, uses only a subset of \({{\varvec{\theta }}}^l\), and discards \(s^{l}\) fraction of parameters of each layer \({{\varvec{\theta }}}^l\) (their weight values are equal to zero); \(s^{l}\) is referred to as the sparsity of layer l. The overall sparsity of the network is \(S = 1- D\), where \(D = \frac{\sum _{l = 1}^H{(1-s^l)N^l}}{N}\) is the overall density of the network. We aim to obtain a sparse neural network with sparsity level of S and parameters \({{\varvec{\theta }}}\). We aim to train this network to minimize the loss on the training set as follows:

$$\begin{aligned} \mathbb {{{\varvec{\theta }}}}^{*} = \mathop {\mathrm {arg\,min}}\limits _{ {{\varvec{\theta }}}\in \mathbb {R}^{N},\; \left| \left| {{\varvec{\theta }}}\right| \right| _0=D\times N} \frac{1}{m} \sum _{i=1}^{m} L( f({{\varvec{x}}}^{(i)} ; {{\varvec{\theta }}}), {{\varvec{y}}}^{(i)}), \end{aligned}$$
(1)

where \(\left| \left| {{\varvec{\theta }}}\right| \right| _0\) is the total number of non-zero connections of the network which is determined by the density level.

Network structure The architecture we consider is a Multi-layer Perceptron (MLP) with H layers. Initially, sparse connections between two consecutive layers are initialized with an Erdös-Rényi random graph; each connection in this graph exists with a probability of \(P( \theta ^{l}_{i}) = \frac{\varepsilon ( n^{l-1}+ n^l)}{ n^{l-1} n^l},\; i \in \{1, 2, ..., N^l\},\) where \(\varepsilon \in {\mathbb {R}}^+\) denotes the hyperparameter that controls the sparsity level. The lower the value of \(\varepsilon\) is, the sparser the network would be. In other words, by increasing \(\varepsilon\), the probability of \(P( \theta ^{l}_{i})\) would be higher which results in more connections and a denser network. Each existing connection is initialized with a small value from a normal distribution.

3.2 Cosine similarity to determine connections importance

In this paper, we use the cosine similarity as a metric to derive the importance of non-existing connections and evolve the topology of a sparse neural network. We first demonstrate how we measure cosine similarity of two neurons. Then, we argue why this choice has been made and how it relates to the Hebbian Learning theory. We measure the similarity of two neurons p and q as:

$$\begin{aligned} {Sim}_{p,q}^{l} = \left| \frac{ {{\varvec{A}}}_{:, p}^{l-1} \cdot {{\varvec{A}}}_{:, q}^{l}}{\left| \left| {{\varvec{A}}}_{:, p}^{l-1}\right| \right| \left| \left| {{\varvec{A}}}_{:, q}^{l}\right| \right| }\right| , \end{aligned}$$
(2)

where \({{\varvec{Sim}}}^{l}\) is the similarity matrix between neurons in two successive layers \(l-1\) and l. \({{\varvec{A}}}_{:, p}^{l-1}\) and \({{\varvec{A}}}_{:, q}^{l} \in \mathbb {R}^{m}\) are the activation vectors corresponding to neurons p and q in layers \(l-1\) and l, respectively. If \({Sim}_{p,q}^{l}\) is high for two unconnected neurons (close to 1), it means that they have a high similarity among their activations; therefore, we prefer to add a connection between them as it suggests that this path contains important information about data. However, if \({Sim}_{p,q}^{l}\) is low for two neurons (close to 0), it means that the activations of neurons p and q are not similar, and the connection among them might not be beneficial for the network.

We now argue why cosine similarity can be used to measure the importance of a non-existing connection in sparse neural networks and how it connects to the Hebbian learning theory. Basically, by taking inspiration from the Hebbian learning theory, we aim to rewire the neurons that fire together in the context of sparse training algorithms, instead of only strengthening the existing connections among neurons that fire together (Schumacher, 2021). It has been discussed in Schumacher (2021) that connecting a pair of neurons with strong coincident activations can be viewed as a natural extension of the Hebbian learning; it is necessary to wire the neurons that usually fire together in order to understand better the relationship among the higher-order representation of those neurons. If a causal connection between their higher-order representation does exist, growing a connection among them will enable an effective inference about the relationship between them. Therefore, we need to discover which pairs of neurons usually fire together and then rewire them.

We employ cosine similarity to measure the relation between the activation values of two neurons. Such as the Hebb’s rule (Sect. 2.2), the importance of a connection in our method is also determined by multiplying the activations of its corresponding neurons, albeit normalized; in Eq. 2, \({{\varvec{A}}}_{:, p}^{l-1}\) is the presynaptic activation and \({{\varvec{A}}}_{:, q}^{l}\) is the postsynaptic activation. If the activations of two connected neurons agree, by computing the dot product of activations, both Eq. 2 and Hebb’s rule assign higher importance to the corresponding connection. This would result in increased weight and a better chance of adding this connection. Thus, both methods reward connections between neurons that exhibit similar behavior. As mentioned earlier, the main difference between the Hebb’s rule and Equation 2 is normalization. We will discuss in Sect. 5.3 why the normalization step is necessary for evolving the topology of a sparse neural network.

In summary, if the cosine similarity of the activation vector of two neurons is high, it indicates the necessity of the connection between them in the network’s performance. Therefore, we use the cosine similarity information to find out if the link between a pair of neurons should be rewired or not. Based on this knowledge, we propose two new algorithms to evolve the sparse neural network in the following sections.

3.3 Sequential cosine similarity-based and random topology exploration (CTREseq)

Our first proposed algorithm, Sequential Cosine Similarity-based and Random Topology Exploration (CTREseq) evolves the network topology using both cosine similarity between neurons of each pair of consecutive layers in the network and random search. Overall, in the beginning, at each training epoch, it removes unimportant connections based on their magnitude and adds new connections to the network based on their cosine similarity. When the network performance stops improving, the algorithm switches to random topology search. In the following, we will explain the algorithm in more detail.

After initializing the sparse network with sparsity level determined by \(\varepsilon\), the training begins. The training procedure consists of two consecutive phases: 1. Cosine Similarity-based Exploration: The training starts with this phase in which each epoch includes three steps: (a) Firstly, a standard feed-forward and back-propagation are performed. (b) Then, a proportion \(\displaystyle \zeta \) of connections with the lowest magnitude in each layer is removed. In Sect. 5.2, we further discuss why this choice has been made. (c) Subsequently, we add new connections to the network based on the neurons’ similarity. Taking advantage of the cosine similarity metric, we measure the similarity of two neurons as formulated in Eq. 2. In each layer, we add connections (as many connections as the removed connections in this layer) with the highest similarity between the corresponding neurons; the new connections are initialized with a small value from a uniform distribution. 2. Random Exploration: The second phase begins when the performance of the network on a validation set does not improve in \(e_{early\;stop}\) epochs ( \(e_{early\;stop}\) is a hyperparameter of CTREseq). This is due to the fact that the activation values might not change significantly after some epochs and, consequently, the similarity of neurons. As a result, the topology search using cosine similarity might stop as well. To prevent this, we begin a random search when the classification accuracy on the validation set stops increasing. This phase is almost similar to phase 1, and they are different in the weight regrowing policy. In this phase, instead of using cosine similarity information, we add connections randomly to the network. In this way, we prevent early stopping of the topology search. Algorithm 1 summarizes this method.

figure a
figure b

3.4 Simultaneous cosine similarity-based and random topology exploration (CTREsim)

To constantly exploit the cosine similarity information during training and avoid early stopping of topology exploration, we propose another method for obtaining a sparse neural network, named Simultaneous Cosine Similarity-based and Random Topology Exploration (CTREsim).

Prior to the training, we initialize a sparse neural network. After that, the training procedure starts with three steps in each epoch. The first two steps are similar to the CTREseq, which are (a) standard feed-forward and back-propagation, and (b) magnitude-based weight removal. However, in step (c), instead of relying solely on cosine similarity information or random addition, we combine both strategies. There are two reasons behind this choice: (1) As discussed in Sect. 3.3, as the training proceeds, the activation values become stable and might not change significantly after a while and, consequently, the similarity values. In CTREseq, we addressed this issue by switching completely to random search. However, the training speed might slow down if we rely only on the random search. (2) If we rely only on cosine similarity information, there is a possibility to add some connections based on the similarity of the neurons, which have been removed based on the magnitude in the weight removal step. It means that in these cases, the path between these pairs of similar neurons does not contribute to the performance of the network. Therefore, we should not add such connections to the network. These are the potential limitations of CTREseq.

To address these limitations, CTREsim takes another approach to prevent adding the removed connections which have a high cosine similarity to the network, as follows. In step c, we add the connections with high similarities to the network; however, if some connections with high cosine similarity are earlier removed based on their magnitude in step b, we add random connections to the network. In other words, we split our budget between similarity-based and random exploration. More importantly, we let the network dynamically decide how much budget should be allocated to each exploration at each epoch. The benefits from this approach are twofold; we prevent early stopping of the topology search, and also prevent re-adding connections that have shown to be unhelpful for the network’s performance. Algorithm 2 summarizes this method.

4 Experiments and results

In this section, we evaluate our proposed algorithms and compare them with several state-of-the-art algorithms for obtaining a sparse neural network. First, we describe the settings of the conducted experiments, including the hyperparameter values, implementation details, and datasets. Then, we compare them in terms of the classification accuracy on several datasets and networks with different sizes and sparsity levels.

4.1 Settings

This section gives a brief overview of the experiment settings, including hyperparameter values, implementation details, and datasets used for the evaluation of the methods.

4.1.1 Hyperparameters

The network that we use to perform experiments is a 3-layer MLP as described in Sect. 3.1. The activation functions used for hidden and output layers are “Relu” and “Softmax”, respectively, and the loss function used is “CrossEntropy”. The values for most hyperparameters have been selected using a grid search over a limited number of values. The hyperparameter \(\zeta \) has been set to 0.2. In Algorithm 1, \(e_{early\;stop}\) has been set to 40. We train the network with Stochastic Gradient Decent (SGD) with momentum and L\(_2\) regularizer. The momentum coefficient, the regularization coefficient, and learning rate are 0.9, 0.0001, and 0.01, respectively. All the experiments are performed using 500 training epochs. The datasets have been preprocessed using the Min-Max Scaler so that each feature is normalized between 0 and 1, except for Madelon, where we use standard scaler (each feature will have zero mean and unit variance). For the image datasets, data augmentation has not been performed unless it has been explicitly stated.

4.1.2 Comparison

We compare the results with three state-of-the-art methods for obtaining sparse neural networks, including, SNIP, RigL, and SET.

  • SNIP Lee et al. (2019). Single-shot network pruning (SNIP) is a dense-to-sparse sparsification algorithm that prunes the network prior to initialization based on connection sensitivity. It calculates this metric after a few iteration of dense training. After pruning, SNIP starts the training with the sparse neural network.

  • RigL Evci et al. (2020). The rigged lottery (RigL) is a sparse-to-sparse algorithm for obtaining a sparse neural network that uses the gradient information as the weight addition criteria.

  • SET Mocanu et al. (2018). Sparse evolutionary training (SET) is a sparse-to-sparse training algorithm that uses random weight addition for updating the topology.

Besides, we measure the classification performance of a fully-connected MLP as the baseline method.

4.1.3 Implementation

We evaluate our proposed methods and the considered baselines on eight datasets. We implemented our proposed method using Tensorflow (Abadi et al., 2015). The baseline of this implementation is the RigL code from Github.Footnote 2 It also includes the implementation for SNIP, SET, and fully-connected MLP. This code uses a binary mask over weights to implement sparsity. In addition, we provide a purely sparse implementation that uses Scipy library sparse matrices. This code is developed from the sparse implementation of SET, which is available on Github.Footnote 3 For all the experiments, we use the Tensorflow implementation to have a fair comparison among methods. However, we provide the results using the sparse implementation in Appendix C. Most experiments were run on a CPU (Dell R730). For image datasets, we used a Tesla-P100 GPU. All the experiments were repeated with three random seeds. The only exception is the experiments from Sect. 4.2 where we run 15 random seeds to analyze the statistical significance of the obtained results with respect to the considered algorithms (Sect. 4.2.1). To ensure a fair comparison, for the sparse training methods (SET, RigL, and CTRE), the sparsity mask is updated at the end of each epoch, and drop fraction (\(\zeta \)) and learning rate are constant during training.

Table 1 Datasets characteristics

4.1.4 Datasets

We conducted our experiments on eight benchmark datasets as follows:

  • Madelon Guyon et al. (2008) is an artificial dataset with 20 informative features , and 480 noise features.

  • Isolet Fanty and Cole (1991) has been created with the spoken name of each letter of the English alphabet.

  • MNIST LeCun (1998) is a database of \(28\times 28\) images of handwritten digits.

  • Fashion_MNIST Xiao et al. (2017) is a database of \(28\times 28\) images of Zalando’s articles.

  • CIFAR10/100 Krizhevsky et al. (2009) are two datasets of 32\(\times\)32 colour images categorized in 10/100 classes.

  • PCMAC & BASEHOCK Lang (1995) are two subsets of the 20 Newsgroups data.

More details about the datasets is presented in Table 1.

Table 2 Classification accuracy (%) comparison among methods on networks with various sizes and sparsity levels

4.2 Performance evaluation

In this experiment, we compare the methods in terms of classification accuracy on networks with varying sizes and sparsity levels. We consider three MLPs, each having three hidden layers with 100, 500, and 1000 hidden neurons, respectively. By changing the value of \(\varepsilon\) for each MLP, we study the effect of sparsity level on the performance of the methods. Table 2 summarizes the results of these experiments that are carried out on the five datasets, including tabular and image datasets that have different characteristics. We have also included the density (as percentage) and the number of connections (divided by \(10^3\)) for each network in this table. For training on each dataset, we allocate \(10\%\) of the training set to a validation set. During training, each MLP is trained on the new training set. At each epoch, we measure the performance on the validation set. Finally, Table 2 presents the results of each algorithm on an unseen test set and using the model that gives the highest validation accuracy during training. The learning curves regarding each case are presented in Appendix A; however, we present some interesting cases in Fig. 2.

First, we analyze the performance of methods on the two tabular datasets. As can be seen in Table 2, on Madelon dataset, CTREsim is the best performer in most cases. Interestingly, the accuracy increases when the network becomes sparser. However, this can be explained intuitively; since the Madelon dataset contains many noise features (\(> 95\%\)), the higher the number of the connections is, the higher the risk for over-fitting the noise features will be. CTREsim can find the most important information paths in the network, which most likely start from the input neurons corresponding to the informative features. As a result, it can reach an accuracy of \(78.8\%\) with only \(0.3\%\) of total connections of the equivalent dense network (\(n^l=1000\)), while the maximum accuracy achieved by other methods considered is \(61.9\%\) (SET). On the second tabular dataset, Isolet, CTREsim is the best performers on two very sparse models, including \(0.4\%\) \((n^l=500)\) and \(0.3\%\) \((n^l=1000)\) densities. In addition, in all the other cases, CTREsim and CTREseq are the second and third-best performers. In terms of learning speed, we can observe in Fig. 2 that CTREsim can find a good topology much faster than other methods, which results in an increase in the accuracy within a short period after the training starts. From Fig. 2, it can be seen that RigL fails to find an informative sub-network in these cases (\(D<0.3\%\)). This indicates that gradient information might not be informative in highly sparse networks.

Fig. 2
figure 2

Classification accuracy (%) comparison among methods on a highly large and sparse 3-layer MLP with a density lower than \(0.3\%\) (\(n^l=1000\), \(\varepsilon =1\))

On the image dataset, CTREsim and CTREseq are the best and second-best performers in most of the cases considered. When the network size is small (\(n^l=100\)), SET is the major competitor of CTRE. However, when the model size increases, CTRE outperforms SET. This indicates that the pure random weight addition policy in SET can perform well in networks with a higher density, while it is hard to find such sub-network randomly in high sparsity scenarios due to the very large search space. RigL also has a comparable performance to SET, except for very sparse models. As discussed in the previous paragraph, on a highly sparse network (\(D<0.3\%\)), RigL has poor performance. Besides, as shown in Fig. 2, SNIP starts with a steep increase in the accuracy due to the few iterations of training a dense network and thus, starting with good topology. However, as the training proceeds, this topology cannot achieve the same performance as other methods. Therefore, it indicates that dynamic weight update is an essential factor in the sparse training of neural networks.

These observations confirm that the cosine similarity is an informative criterion for adding weight in the network compared to random (SET) and gradient-based addition (RigL) in very sparse neural networks. CTRE can reach a better performance than state-of-the-art sparse training algorithms in terms of learning speed and accuracy when the network is highly sparse. Besides, by comparing the results with the dense network, it is clear that it is possible to reach a comparable performance to the dense network even with a network with 100 times fewer connections which is an excellent choice for low-resource devices on edge. We further compare the computational cost of the algorithms in Appendix B and their learning speed in Appendix A.

Table 3 Statistical significance of the results

4.2.1 Statistical significance analysis

In this section, we analyze the statistical significance of the results obtained by CTRE compared to the other algorithms. To measure this, we perform Kolmogorov-Smirnov test (KS-test). The null hypothesis is that the two independent results/samples are drawn from the same continuous distribution. If the p-value is very small (p-value \(< 0.05\)), it suggests that the difference between the two sets of results is significant and the hypothesis is rejected. Otherwise, the obtained results are close together and the hypothesis is true.

We perform KS-test between the results obtained by CTRE (for simplicity, we consider maximum results of \(CTRE_{seq}\) and \(CTRE_{sim}\)) and the other considered algorithms for the experiments in Table 2. The results of the KS-test is summarized in Table 3. In this table, Reject shows that the results are sufficiently distinct and True means that the obtained results are close together. The * sign in Table 3 shows that an algorithm has achieved the maximum accuracy in the corresponding experiment. Finally, the entries colored as red shows an experiment where a compared method obtains a close result to CTRE while having lower mean accuracy.

From Table 3, we can observe that in majority of the experiments, CTRE obtains higher mean accuracy than the other methods while being statistically different from them. The only dataset where the results in most cases are close is the Fashion-MNIST dataset where SET has comparable results to CTRE on this dataset. In addition, in high sparsity regime and large network size (\(n^l=1000\), \(\varepsilon = 1\)), CTRE achieves the highest accuracy among the methods while being significantly distinct from them. Overall, Table 3 indicates that CTRE is a well performing algorithm in terms of the classification accuracy that achieves significantly different results from the other methods.

Fig. 3
figure 3

Sparsity-accuracy trade-off on highly sparse neural networks on three datasets

4.3 Sparsity-performance trade-off analysis in highly sparse MLPs

We carry out another experiment to study the trade-off between sparsity and accuracy on very high sparsity cases. We perform this experiment for two difficult classification tasks including, image classification on CIFAR100, which is considered as a more difficult dataset than the earlier considered image datasets, and text classification on PCMAC and BASESHOCK that are subsets of 20-newsgroup dataset; they have a high number of features and a low number of samples. This experiment uses a 3-layer MLP with 1000 and 3000 hidden neurons for text datasets and CIFAR100 dataset, respectively. We change the density value between 0 and 1 and compare our proposed approaches to SNIP, RigL, and SET (due to the close performance of CTREsim and CTREseq on earlier considered image datasets, on CIFAR100, we perform the experiments with CTREsim). We use data augmentation for CIFAR100. Also, as the network is considerably large on this dataset, we set the learning rate to 0.05 to speed up the training. The results are presented in Fig. 3.

As shown in Fig. 3, in highly sparse networks (\(D<0.5\%\)), CTREsim outperforms other methods by a large gap. As discussed in Sect. 4.2, RigL performs poorly in these scenarios. SNIP outperforms SET and RigL at the very low densities while still has lower results than CTREsim in all cases. While SET outperforms other methods for larger density values on CIFAR100 and BASESHOCK, it performs poorly on a very sparse network. On text datasets, CTREseq has comparable performance to CTREsim and SET on higher densities, and it achieves the highest accuracy on PCMAC. Overall, we can observe that CTREsim has decent performance on these three datasets with a density value between \(0.3\%\) and \(0.5\%\).

5 Discussion

In this section, we perform an in-depth analysis to understand the behavior of CTRE better. First, in Sect. 5.1, we perform two ablation studies to study the effectiveness of both random topology search and similarity importance metric in the performance of CTRE. In Sect. 5.2, we discuss why we have chosen magnitude over cosine similarity for the weight removal step. In Sect. 5.3, we discuss why the insensitivity of cosine similarity to the vector’s magnitude is important in the performance of CTRE. Finally, we discuss the convergence of CTRE in Sect. 5.4.

5.1 Ablation study: analysis of topology search policies

This section presents and discusses the results of two ablation studies designed to understand better the effect of different topology search policies in CTRE. In the following, we describe each ablation experiment separately.

5.1.1 Ablation Study 1: random topology search

The first ablation study aims to analyze the effect of random connection addition on the behavior of CTRE. Therefore, instead of using the similarity information and random search (simultaneously in CTREsim and sequentially in CTREseq), we only use the cosine similarity information at each epoch. We call this approach CTRE\(_{w/oRandom}\) and repeat the experiments from Sect. 4.2. The detailed results are available in Table 4.

Table 4 Classification accuracy (%) comparison among Cosine similarity-based methods

As can be seen in Table 4, in most cases considered, CTRE\(_{w/oRandom}\) has been outperformed by CTREsim and CTREseq. On the other hand, we can observe that on image datasets, CTRE\(_{w/oRandom}\) has comparable performance to the other two methods; this indicates the effectiveness of similarity information on the image datasets. However, on tabular datasets, it performs poorly on high sparsity cases (\(\varepsilon = 1\)). Therefore, using only cosine information in these scenarios can cause the topology search to be stuck in a local minimum. This might have been originated by the early stopping of changes in the activation values, which leads to an early stop in topology search. CTREseq solves this by changing the weight update policy to random search. However, there is a risk of early switching to random search when the cosine information has not been fully exploited. Finally, by considering both random and cosine information in each epoch, the CTREsim algorithm will minimize the risk of staying in the local minimum or switching to a completely random search, both of which might slow the training process. In the context of network topology search, these components can also be characterized as exploitation (local information based on the similarity between neurons) and exploration (random search). As a result, CTREsim can mitigate the limitations of CTREseq and find a performant sub-network by leveraging these two components, which outperform state-of-the-art algorithms.

5.1.2 Ablation Study 2: cosine similarity-based topology search

To study the effectiveness of cosine similarity addition in the performance of CTRE, we design an experiment; in this experiment, we add connections in the reverse order of importance to the network. We expect that adding weights in this order would result in poor performance. We perform this experiment on CTREsim. Concretely, at each step, we add a number of weights with the lowest similarity among the corresponding neurons; if a weight with a very low similarity has been removed in the last weight removal step, we add a random connection instead. We call this method CTREsim/LTH (LTH refers to low to high importance).

As can be seen in Table 4, CTREsim/LTH has been outperformed by CTREsim and CTREseq in most of the cases considered. This shows that cosine similarity is a useful metric to detect the most important weights in the network. By comparing CTREsim/LTH with SET (Table 2), it is clear that in most cases CTREsim/LTH has a close or slightly worse accuracy than SET. Therefore, it can be inferred that CTREsim/LTH is selecting non-informative weights, which can be similar to or worse than a random search. As a result, this can indicate the effectiveness of the introduced similarity metric (Eq. 2) in finding a well-performing sparse neural network. It is worth noting that on the Isolet dataset, CTREsim/LTH outperforms CTREsim and CTREseq in some cases, particularly in the networks with higher density. This is similar to the results of SET as well. Therefore, we can conclude that random search outperforms other methods on the Isolet dataset and low sparsity levels. However, it is not easy to find a highly sparse network using the random search policy.

5.2 Analysis of weight removal policy

In this section, we aim to analyze the weight removal policy and further explain the reason behind choosing magnitude-based pruning over the cosine similarity (discussed in Sect. 3.2). In many previous studies, magnitude-based pruning has been commonly used as a criterion to remove unimportant weight from a neural network. We design an experiment to compare the performance of magnitude-based and cosine similarity-based pruning in neural networks.

Fig. 4
figure 4

Effect of weight removal using three criteria including magnitude, cosine similarity, and random, on the classification accuracy (%) at different epochs. The lines with higher transparency corresponds to the weight removal of the SET-MLP and the lines with lower transparency corresponds to the dense-MLP

In this experiment, we start with a trained network and gradually remove weights based on the magnitude and cosine similarity value (Using Eq. 2) of the corresponding connection. We also consider random pruning as the baseline.

Settings We perform this experiment using two networks: (1) A 3-layer dense MLP with 1000 neurons in each layer, and (2) A 3-layer sparse MLP with 1000 neurons in each layer that is trained using the SET approach (Mocanu et al., 2018) (3.2% density). The choice of SET instead of CTRE was made to avoid any biases on the cosine similarity weight removal, as CTRE uses cosine information to add weights. Both of these networks are trained on the MNIST dataset.

Weight removal We remove weights with two orders on each of the sparse and dense networks: least to most important and vice versa. We gradually remove weights; at each step, we remove 1% of the connections and measure the accuracy of the pruned network until no connection remains in the network.

Results The results when the two networks are trained for 10, 30, 50, and 100 epochs are available in Fig. 4. In this figure, the lines with higher transparency correspond to the weight removal of the SET-MLP, and the lines with lower transparency correspond to the dense-MLP. This experiment has been repeated with three seeds for each case.

As shown in Fig. 4, when weights are removed from least to most important, magnitude-based pruning can order weights better than cosine similarity-based pruning. When the networks are trained for 100 epochs, by dropping the unimportant weights using magnitude, the major accuracy drop starts almost after removing 70% of the connections, while it happens after removing 30% for cosine similarity. This behavior exists in both the dense and the sparse networks. As expected, the drop for random removal happens from the beginning of the pruning procedure. In earlier epochs (10, 30, and 50), the drop in the accuracy happens earlier for both magnitude and cosine similarity.

It can be seen in Fig. 4, by removing weights in the opposite order (from most to least important), the behavior of drop in the accuracy is almost similar for cosine similarity-based and magnitude-based pruning in SET-MLP, particularly in the earlier epochs. Therefore, both magnitude and cosine similarity can identify the most important connections in good order. However, this behavior is different in the dense network; magnitude-based pruning can better detect the most important weights. In the dense network, the drop in the accuracy for magnitude-based pruning happens earlier than cosine similarity pruning.

Conclusions These observations can lead us to conclude that, firstly, the magnitude can be a good metric for weight removal in sparse training. Secondly, it can be inferred that cosine similarity can be a good metric for adding the most important connections in the weight addition phase in sparse neural networks in the absence of magnitude. As discussed earlier, the cosine similarity information of each connection is an informative criterion to detect the most important weights in a sparse neural network and has similar behavior to magnitude-based pruning in these scenarios. Therefore, in the absence of magnitude for non-existing connections in a sparse neural network (during weight addition), cosine similarity can be a useful criterion to detect the most important weights without requiring computing dense gradient information.

5.3 Magnitude insensitivity: the favorable feature of cosine similarity in noisy environments

This section further discusses why cosine similarity has been chosen as a metric to determine the importance of non-existing connections. Specifically, we mainly focus on analyzing the importance of normalization in Eq. 2 in the performance of the algorithm. While based on the Hebbian learning rule, the connection among a pair of neurons with high activations should be strengthened, we argue that in the search for a performant sparse neural network, the magnitude of the activations should be ignored.

Based on Hebb’s rule (Sect. 2.2), the connection among the neurons with high activations receives higher synaptic updates. Therefore, if we evolve the topology using this rule (without any normalization) the importance of a non-existing connection should be determined by: \(\left| { {{\varvec{A}}}_{:, p}^{l-1} \cdot {{\varvec{A}}}_{:, q}^{l}}\right|\). We evaluate the performance of this metric by replacing it with Eq. 2 in CTREsim and CTREseq; we name these algorithms CTREsim-Hebb and CTREseq-Hebb, respectively.

We evaluate these methods on the Madelon dataset. The reason behind choosing this dataset is due to its interesting properties; it contains 480 noisy features (out of the 500 features). Therefore, finding informative information paths through the network is considered to be a challenging task. The settings of this experiment are similar to Sect. 4.2; we measure the performance on networks with different sizes and sparsity levels. The results are presented in Table 5 and the accuracy during training is plotted in Fig. 5. CTREsim-Hebb and CTREseq-Hebb have been outperformed by CTREsim and CTREseq in all cases considered. Particularly, we can observe that as the network becomes sparser, the gap between the performance of the pure Hebbian-based methods and the cosine similarity-based methods increases.

Table 5 Classification accuracy (%) comparison of Cosine similarity-based methods and pure Hebbian-based evolution, on the Madelon dataset
Fig. 5
figure 5

Classification accuracy (%) comparison on Madelon for CTRE and pure Hebbian-based updates

The poor performance of CTREsim-Hebb and CTREseq-Hebb on the Madelon dataset is resulted from their sensitivity to the magnitude of activation values. As Madelon contains many noisy features, some uninformative neurons likely receive a high activation value. Therefore, if we use only the activation magnitude to find the informative paths of information, the algorithm will be biased on the neurons with very high activation, which might not be informative. Therefore, it is likely to assign new connections to noisy features with high activation. This would cause the algorithm to be stuck in a local minimum which might be difficult to escape as these neurons continue to receive more and more connections at each epoch. Furthermore, as the networks become sparser, the informative features have a lower chance of receiving more connections (there are more noisy features compared to the informative ones). Therefore, in sparse networks, the gap between the performance of these methods is much larger than in denser networks. Based on these observations, it can be concluded that the insensitivity of cosine similarity to the vector’s magnitude helps CTRE to be more robust in noisy environments.

5.4 Convergence analysis

This section discusses the convergence of the proposed algorithm for training sparse neural networks from scratch, CTRE. In short, we first discuss the effect of the weight evolution process on the algorithm’s convergence. Secondly, we explore whether cosine similarity causes CTRE to converge into a local minimum or not.

First, we analyze if the weight evolution process in the CTRE algorithm interferes with the convergence of the back-propagation algorithm or not. In the CTRE algorithm, a number of connections of the weights are removed at each training epoch, and the same number of connections are added based on the cosine or random search policies. The weight evolution process is performed at each epoch after the standard feed-forward and back-propagation steps. The removed connections have a small magnitude compared to the other connections, and newly activated connections also get a small value. Therefore, they do not change the loss value significantly. The new weights will be updated in the next feed-forward and back-propagation step, and they will grow or shrink. Therefore, the weight evolution process does not disrupt the convergence of the model.

To validate this, we depict the test loss during training in Fig. 6 for the high sparsity regime and a large network (\(\varepsilon =1\), \(n^l=1000\)). It can be observed that the loss function converges for the CTRE algorithm on all the datasets. In addition, in most cases, its convergence speed is much faster than for the other algorithms.

Fig. 6
figure 6

Test loss comparison during training for the high sparsity regime and a large network (\(\varepsilon =1\), \(n^l=1000\))

Secondly, we analyze whether CTRE is prone to converge to a local optima. As discussed in Sect. 5.2, cosine similarity is very successful at determining the most and least important connections in the network. However, in the mid-importance range, it might not be able to rank connections as well as the magnitude criterion; therefore, it might add some connections that do not contribute to decreasing the network loss. In such cases, the cosine similarity metric might prevent topology exploration and get stuck in local minima. CTRE explores other weights and exits this local minimum by using a random search. To validate this, in Fig. 7, we have presented the loss during training for CTREseq, CTREsim, and CTREw/oRandom on three highly sparse neural networks trained on the Isolet dataset. The fast decrease in the loss in these plots indicates that all three methods quickly find a good-performing sub-network. However, the loss value of CTREw/oRandom does not improve significantly after 200 epochs, and it converges to a higher value than the other two methods. Therefore, it is important to use random exploration to keep improving the topology and avoid local minima as it is done in CTRE.

Fig. 7
figure 7

Test loss comparison during training for the high sparsity regime (\(\varepsilon =1\)) on the Isolet dataset

6 Conclusion and broader impacts

In this research, we introduced a new biologically plausible sparse training algorithm named CTRE. CTRE exploits both the similarity of neurons as an importance measure of the connections and random search, sequentially (CTREseq) or simultaneously (CTREsim), to explore a performant sparse topology. The findings of this study indicate that the cosine similarity between neurons’ activations can help to evolve a sparse network in a purely sparse manner even in highly sparse scenarios, while most state-of-the-art methods may fail in these cases. In our view, by using the neurons’ similarity to evolve the topology, our proposed approach can be an excellent initial step toward explainable sparse neural networks. Overall, due to the ability of CTRE to extract highly sparse neural networks, it can be a viable alternative for saving energy in both low-resource devices and data centers and pave the way to achieving environmentally friendly AI systems. Nevertheless, the trade-off between accuracy and sparsity, with CTRE deployed on real-world applications, should be considered carefully; particularly, if any loss of accuracy may pose safety risks to the user, the sparsity level of the network needs to be analyzed with greater care.

An interesting future direction of this research is to extend CTRE to CNNs; driven by the decent performance of CTRE on image datasets, we believe that it has the potential to be extended to CNN architectures. However, in-depth theoretical analysis and systematic experiments are required to adapt this similarity metric to CNN architectures. This is due to the fact that CNNs require weight sharing, which does not exist in real neurons, and consequently, it is not straightforward to apply Hebbian learning directly (Pogodin et al., 2021). There have been some efforts to make CNNs more biologically plausible (Pogodin et al., 2021; Bartunov et al., 2018). Therefore, applying CTRE to CNNs should be done with great care and theoretical analysis that we believe is in the scope of future works.