Abstract
Convolutional neural networks (CNNs) have shown good performance in many practical applications. However, their high computational and storage requirements make them difficult to deploy on resource-constrained devices. To address this issue, in this paper, we propose a novel iterative structured pruning algorithm for CNNs based on the recursive least squares (RLS) optimization. Our algorithm combines inverse input autocorrelation matrices with weight matrices to evaluate and prune unimportant input channels or nodes in each CNN layer and performs the next pruning operation when the testing loss is tuned down to the last unpruned level. Our algorithm can be used to prune feedforward neural networks (FNNs) as well. The fast convergence speed of the RLS optimization allows our algorithm to prune CNNs and FNNs multiple times in a small number of epochs. We validate its effectiveness in pruning VGG-16 and ResNet-50 on CIFAR-10 and CIFAR-100 and pruning a three-layer FNN on MNIST. Compared with four popular pruning algorithms, our algorithm can adaptively prune CNNs according to the learning task difficulty and can effectively prune CNNs and FNNs with a small or even no reduction in accuracy. In addition, our algorithm can prune the original sample features in the input layer.
Similar content being viewed by others
1 Introduction
Convolutional neural networks (CNNs) are the most widely used class of deep neural networks (DNNs) [1,2,3]. CNNs are well suited for handling computer vision tasks [4,5,6,7] since they can extract sample features from images at different abstract levels through convolutional and pooling mechanisms [8]. However, CNNs generally have high computational and storage costs, which hinder their widespread applications to some extent [9, 10]. In particular, in the past decade, mobile devices, such as smartphones, wearable devices and drones, have been increasingly used. There remains a growing demand to deploy CNNs with these devices, but their computational and storage capacities are much lower than those of conventional computers [11]. Therefore, how to compress CNNs has become an important research focus in deep learning.
At present, five categories of model compression algorithms have been proposed for CNNs [12]. The first category is network pruning, which prunes redundant channels or nodes [13, 14]. The second category is parameter quantization, which reduces the bits of all parameters to reduce the computational and storage costs [15, 16]. The third category is low-rank factorization, which decomposes three-dimensional filters into two-dimensional filters [17]. The fourth category is filter compacting, which uses compact filters to replace loose and overparameterized filters [18]. The last category is knowledge distillation, in which knowledge is acquired from the original network and used to generate a smaller network [19, 20]. Among these categories, network pruning has received the most research attention [21].Thus, we focus on this type of model compression algorithms in this paper.
Network pruning methods can be further classified into unstructured and structured pruning methods [22]. In theory, unstructured pruning methods can prune arbitrary redundant nodes in convolutional layers and achieve high compression ratios. However, unstructured pruning methods are difficult to implement since they destroy the form of weight matrices. To address this issue, almost all existing unstructured pruning algorithms, such as Optimal Brain Damage [23], Soft Channel Pruning [24] and \(\ell _0\) Minimization [25], zero out the unimportant weights instead of really pruning the redundant nodes in simulation experiments. In contrast, structured pruning methods aim to remove unimportant channels. These methods preserve the structure of weight matrices and are thus more practical and popular, although their pruning granularity is coarse.
Structured pruning methods usually include three stages: training, pruning and fine-tuning (also called retraining in some papers) [26, 27]. According to the number of pruning operations, these methods can be divided into one-shot structured pruning and iterative structured pruning [28]. The former performs pruning and fine-tuning only once and thus requires fewer epochs to obtain the compressed model. However, its compression ratio and accuracy rely heavily on the given pruning ratio. In other words, it is often difficult to obtain the optimal compressed model with one-shot pruning. In contrast, the latter performs multiple pruning and fine-tuning operations, which may lead to better results; however, multiple operations are very time-consuming, especially for large-scale neural networks. There is still much debate about what kind of structured pruning approach is best for different scenarios.
In recent years, researchers have proposed many structured pruning algorithms. For example, Li et al. proposed a one-shot pruning algorithm called the \(\ell _1\)-norm [29], which evaluates and prunes unimportant output channels by using \(\ell _1\) regularization for the weights of the convolutional layers. Liu et al. proposed another one-shot pruning algorithm called Network Slimming [30], which prunes channels by using \(\ell _1\) regularization for the scaling factors in the batch normalization layers. Molchanov et al. proposed two iterative pruning algorithms called Taylor FO [31] and Taylor SO [32], which use the first- and second-order Taylor expansions to estimate the contribution of each channel to the final loss, respectively, and remove the channels with scores smaller than a given threshold. Chen et al. proposed another iterative pruning algorithm called Collaborative Channel Pruning [33], which evaluates and removes unimportant channels by combining the convolution layer weights and batch normalization layer scaling factors.
Although researchers claim that these algorithms can effectively compress CNNs, they still have three common shortcomings. The first shortcoming is that they use stochastic gradient descent (SGD) to optimize CNNs during the training and fine-tuning stages. It is well known that SGD converges slowly and can be difficult to tune [34], which results in these algorithms requiring more training epochs. The second shortcoming is that they mainly use the weight magnitude to prune unimportant output channels. In fact, the training and fine-tuning results are influenced by the dataset, and the redundancy of the input features in each layer is the main reason to prune channels. Therefore, the weight magnitude cannot be used to evaluate the redundancy accurately. The third shortcoming is that the pruning ratio is manually and empirically set to a fixed value by users, which may cause underpruning or overpruning. In addition to these three shortcomings, existing iterative structured pruning algorithms have another shortcoming in that the pruning times and the repruning timing are manually set by users.
To overcome the above shortcomings, we propose a novel iterative structured pruning algorithm in this paper. In our previous work [35], we proposed a recursive least squares (RLS) optimization algorithm, which can be viewed as a special SGD algorithm with the inverse input autocorrelation matrix as the learning rate. Compared with SGD and Adam optimization, the RLS optimization has better convergence speed and quality. Our proposed algorithm is based on this optimization algorithm. In addition to using the RLS optimization to train and fine-tune CNNs, it combines inverse input autocorrelation matrices with weight matrices to evaluate and prune unimportant input channels or nodes and automatically performs the next pruning operation when the testing loss is tuned down to the last unpruned level. Our algorithm can also be used for pruning feedforward neural networks (FNNs). We validate its effectiveness in pruning VGG-16, ResNet-50 and an FNN on the CIFAR-10, CIFAR-100 and MNIST datasets. Compared with existing iterative pruning algorithms, our algorithm can prune CNNs and FNNs multiple times in a smaller number of epochs and more effectively prune CNNs and FNNs with a smaller accuracy loss. In addition, it can adaptively prune CNNs according to the difficulty of the learning task.
The key contributions of this paper can be summarized as follows:
-
1)
We use the RLS optimization rather than SGD to accelerate our algorithm and all comparative pruning algorithms used in the experiments.
-
2)
We present a novel iterative structured pruning algorithm that combines inverse input autocorrelation matrices and weight matrices to evaluate and prune unimportant input channels or nodes.
-
3)
We suggest the testing loss as the repruning criterion of our algorithm and all comparative iterative pruning algorithms. Each repruning operation is performed when the testing loss is tuned down to the last unpruned level.
-
4)
We conduct extensive experiments to verify the effectiveness of our algorithm. Our algorithm can effectively reduce both the number of floating-point operations (FLOPs) and number of parameters for CNNs and FNNs with a small accuracy loss. In particular, the experiments on FNNs show that our algorithm can prune the original sample features.
The remainder of this paper is organized as follows: In Section 2, we review the RLS algorithm and the RLS optimization for CNNs. In Section 3, we introduce our algorithm in detail. In Section 4, we present the experimental settings and results. Finally, we summarize this paper in Section 5.
2 Background
In this section, we introduce the background knowledge and some notations used in this paper. We first review the derivation of the RLS algorithm and then review the learning mechanism of CNNs with the RLS optimization.
2.1 Recursive least squares
RLS is a popular adaptive filtering algorithm with fast convergence speed. This algorithm recursively determines the weights that minimize the weighted linear least squares loss function based on the input signal [36]. Compared with the linear least squares algorithm, it is more suitable for online learning.
Let \(\mathbb {X}_t=\{{{\textbf {x}}}_1,\cdots ,{{\textbf {x}}}_t\}\) denote all sample inputs from the starting step to the current step, and let \(\mathbb {Y}_t^* = \{y_1^{*},\cdots ,y_t^{*}\}\) denote the corresponding target outputs. On this basis, the quadratic minimization problem solved by the RLS algorithm over time t is defined as
where \({{\textbf {w}}}\) is the weight vector, and \(\lambda \in (0,1]\) is the forgetting factor which enhances the importance of recent data over older data [37]. Let \(\nabla _{\varvec{w}} \frac{1}{2}\sum _{i=1}^t\lambda ^{t-i}{(y_i^{*}-{\varvec{w}}^\text {T}{\varvec{x}}_i)}^2 =\varvec{0}\). Then, we easily obtain
We define \({{\textbf {A}}}_t\) and \({{\textbf {b}}}_t\) as follows:
Based on (2), the solution \({{\textbf {w}}}_t\) to (1) can be derived as
To avoid calculating the inverse of \({{\textbf {A}}}_t\) in (5), we define the inverse input autocorrelation matrix \({{\textbf {P}}}_t = ({{\textbf {A}}}_t)^{-1}\). Equations (3) and (4) show that A\(_t\) and \({{\textbf {b}}}_t\) can be computed recursively as follows:
Using Sherman-Morrison matrix inversion formula [38] with (6), we obtain
where \({\textbf {u}}_t\) and \(h_t\) are defined as follows:
Substituting (7) and (8) into (5), we obtain
where \(e_t\) is defined as
Finally, we obtain the RLS algorithm, which is defined by (8) to (12).
2.2 CNNs with RLS optimization
The RLS optimization is a special type of SGD algorithm with the inverse input autocorrelation matrix as the learning rate [35]. Due to the fast convergence speed of the RLS algorithm, it can efficiently optimize CNNs. A CNN generally consists of an input layer followed by some number of convolutional layers, pooling layers and fully-connected layers [39]. Since the pooling layers have no learnable weights, we need to review the RLS optimization only for the convolutional and fully-connected layers.
Let \({\textbf {Y}}_{t}^{0}\) and \({\textbf {Y}}_{t}^{*}\) denote the input and the target output of the current training minibatch, respectively, and let L denote the total number of convolutional and fully-connected layers. The forward-propagation learning of the \(m^{th}\) sample in the current minibatch in the \(l^{th}\) CNN layer is illustrated in Fig. 1. For brevity, we omit the bias term in each layer. Based on these notations, we briefly introduce the RLS optimization rules for CNNs. According to [35], the recursive update rule of the inverse input autocorrelation matrix \({\textbf {P}}_t^l\) in the \(l^{th}\) layer is defined as
where \(k>0\) is the average scaling factor and \({\textbf {u}}_t^l\) and \(h_t^l\) are defined as follows:
where \({{{\textbf {x}}}}_{t}^l\) is the average vector. If the \(l^{th}\) layer is a convolutional layer, \({\textbf {{x}}}_{t}^l \in \mathbb {R}^{C_{l-1} H_l W_l}\) is defined as
where \(H_l\) and \(W_l\) denote the height and width of the filters, \(M_t\) denotes the current minibatch size, \(U_l\) and \(V_l\) denote the height and width of the output channels, and \(flatten(\cdot )\) denotes reshaping the given matrix or tensor into a column vector. If the \(l^{th}\) layer is a fully-connected layer, \({{{\textbf {x}}}}_{t}^l\) is defined as
Note that if the preceding layer of this layer is a convolutional or pooling layer, \(\textbf{Y}_{t(m,:)}^{l-1}\) will denote the flattened vector of the preceding layer’s output. In the RLS optimization algorithm, tensor \(\textbf{W}_{t-1}^{l}\) is converted to matrix \(\textbf{W}_{t-1}^{l}\) by defining \(\mathbf {{W}}_{t-1(:,j)}^{l}=flatten(\textbf{W}_{t-1(:,j,:,:)}^{l})\). In addition, the algorithm uses the momentum to accelerate the convergence rate. Thus, regardless of whether the \(l^{th}\) layer is a convolutional layer or a fully-connected layer, the recursive update rule of \(\textbf{W}_{t-1}^{l}\) is defined as follows:
where \(\varvec{\Psi }_t^l\) is the velocity matrix of the \(l^{th}\) layer at step t, \(\alpha \) is the momentum factor, \(\eta ^l > 0\) is the gradient scaling factor, and \(\mathbf {\nabla }_{\textbf{W}_{t-1}^{l}}\) is the equivalent gradient of the linear output loss function \(\mathcal {L}_t\) with respect to \({\textbf {W}}_{t-1}^{l}\). \(\mathcal {L}_t\) is defined as
where \(\textbf{Z}_{t}^{L} =f_L^{-1}({\textbf {Y}}_{t}^L)\) is the linear output matrix and \(\textbf{Z}_{t}^{*}=f_L^{-1}({\textbf {Y}}_{t}^{*})\) is the desired linear output matrix corresponding to \(\textbf{Z}_{t}^{L}\). Note that the RLS optimization assumes that \(f_L(\cdot )\) is monotonic in the output layer [35]. In addition, the RLS optimization can be used for FNNs, since the above equations except for (16) can also be used for fully-connected layers and the last part of a CNN can generally be viewed as an FNN.
3 The proposed algorithm
In this section, we present our iterative structured pruning algorithm. We first introduce the theoretical foundation of our algorithm. Then, we describe our pruning strategy in detail and show the overall pseudocode of our algorithm.
3.1 Theoretical foundation
As discussed in Section 2.1, \({\textbf {A}}_t\) is the input autocorrelation matrix. Suppose that \({{\textbf {x}}}_t\) has n features, namely, \({{\textbf {x}}}_t=[{\textbf {x}}_{t(1)},{\textbf {x}}_{t(2)},\cdots ,{\textbf {x}}_{t(n)}]^\text {T}\). Then, \({{\textbf {x}}}_t {{\textbf {x}}}_t^\text {T}\) can be expressed as
Let \(s_{{\textbf {A}}_{t(i)}}\) denote the sum of the \(i^{th}\) row (or column) in \({\textbf {A}}_t\). According to (3), (6) and (21), \(s_{{\textbf {A}}_{t(i)}}\) can be defined as
where \(s_{{{\textbf {x}}}_t}\) is the sum of all features in \({{\textbf {x}}}_t\). In deep learning, we typically assume that all samples are independently and identically distributed [39]. CNNs are usually used for computer vision tasks, and they often use the ReLU activation function in the hidden layers; thus, we can suppose that \({{\textbf {x}}}_{t(i)}\ge 0\). In addition, the forgetting factor \(\lambda \) is generally close to 1. Therefore, \(s_{{{\textbf {x}}}_1},\cdots ,s_{{{\textbf {x}}}_t}\) are approximately equal, and \(s_{{{\textbf {A}}}_{t(i)}}\) is approximately proportional to the sum of \({{\textbf {x}}}_{1(i)},\cdots , {{\textbf {x}}}_{t(i)}\). In other words, if \(s_{{{\textbf {A}}}_{t(i)}}\) is small, the \(i^{th}\) features in \({{\textbf {x}}}_1,\cdots , {{\textbf {x}}}_t\) are probably small, and their influence on the outputs is probably small. Since \({\textbf {P}}_t\) is the inverse of \({\textbf {A}}_t\), we can easily draw the following conclusion: If the sum of the \(i^{th}\) row (or column) in \({\textbf {P}}_t\) is large, the importance of the \(i^{th}\) features in \({{\textbf {x}}}_1,\cdots , {{\textbf {x}}}_t\) will probably be small.
Thus, we can use \({\textbf {P}}_t^l\) to evaluate the importance of the input nodes in the \(l^{th}\) layer since \({\textbf {P}}_t^l\) has the same meaning as \({\textbf {P}}_t\). For fully-connected layers in CNNs, we can directly use this conclusion. However, for convolutional layers in CNNs, structured pruning methods aim to prune unimportant channels rather than nodes. Fortunately, according to (16) and Fig. 1(a), we know that \({{\textbf {x}}}_{t((i-1)H_l W_l+1)},{{\textbf {x}}}_{t((i-1)H_l W_l+2)},\cdots ,{{\textbf {x}}}_{t(iH_l W_l)}\) in \({{{\textbf {x}}}}_{t}^l\) comes from the \(i^{th}\) input channel. Thus, we can use the sum of the \((i-1)H_l W_l+1\) to \(iH_l W_l\) rows (columns) in \({\textbf {P}}_t^l\) to evaluate the importance of the \(i^{th}\) input channel.
Based on the above analysis, we define a vector \({\textbf {s}}_{{\textbf {P}}_t^l}\) to evaluate the importance of the input channels or nodes in the \(l^{th}\) layer. The \(i^{th}\) element in \({{\textbf {s}}}_{{{\textbf {P}}}_{t}^l}\) is defined as
where \(N_{l(i)}^{cb}=(i-1)H_l W_l+1\), \(N_{l(i)}^{ce}=iH_l W_l\), \(N_{l}^{re}=C_{l-1}H_l W_l\), and \(L_c\) denotes the total number of convolutional layers. Equation (23) can be explained as follows: If the \(l^{th}\) layer is a convolutional layer (i.e. \(l\le L_c\)), \({\textbf {s}}_{{{{\textbf {P}}}_t^l}(i)}\) denotes the importance of the \(i^{th}\) input channel. If the \(l^{th}\) layer is a fully-connected layer (i.e. \(l>L_c\)), \({{\textbf {s}}}_{{{{\textbf {P}}}_t^l}(i)}\) denotes the importance of the \(i^{th}\) input node. Furthermore, the larger the value of \({{\textbf {s}}}_{{\textbf {P}}_{t}^{l}(i)}\) is, the more likely it is that the \(i^{th}\) input channel or node is unimportant.
3.2 RLS-based pruning
As mentioned in Section 1, existing structured pruning algorithms have some shortcomings. In this subsection, we consider how to overcome these shortcomings and present our pruning algorithm.
One shortcoming of existing algorithms is that they cannot prune the unimportant channels accurately since most algorithms use only the weight magnitude to evaluate the channel importance. To address this problem, we combine inverse input autocorrelation matrices with weight matrices to evaluate and prune the unimportant input channels in the convolutional layers and unimportant nodes in the fully-connected layers.
In Section 3.1, we defined \({{\textbf {s}}}_{{{{\textbf {P}}}_t^l}}\) by using the inverse input autocorrelation matrix \({{\textbf {P}}}_{t}^l\). Next, we define another vector \({{\textbf {s}}}_{{{{\textbf {W}}}_t^l}}\) by using the weight matrix \({{\textbf {W}}}_{t}^l\). Li et al. proposed the \(\ell _1\)-norm algorithm and demonstrated that the sum of the absolute filter weights can be used to evaluate the importance of the output channels [29]. It is well known that the output of the \((l-1)^{th}\) layer is the input of the \(l^{th}\) layer in CNNs. Thus, we can modify this approach to evaluate the importance of the input channels or nodes in the \(l^{th}\) layer. By using this approach, the \(i^{th}\) element in \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}}\) is defined as
where \(N_{l-1}^{re}=C_{l-2}H_{l-1} W_{l-1}\), \(c_{(i)}= \lfloor i/(U_{l-1}V_{l-1})\rfloor +1\), and \(|\cdot |\) denotes the absolute value of a real number. According to [29], the \(i^{th}\) input channel or node is more likely to be unimportant when \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}(i)}\) is smaller. Note that we only consider the convolutional and fully-connected layers in this paper since a pooling layer has no learnable weights and the same number of output channels as its preceding convolutional layer.
Next, we use \({{\textbf {s}}}_{{\textbf {P}}_{t}^{l}}\) and \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}}\) to select unimportant input channels and nodes. Let \(\mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},\xi )\) and \(\mathbb {B}_{t}^l ({{\textbf {s}}}_{{\textbf {W}}_t^{l}},\xi )\) denote the index sets of the top \(\xi \) unimportant input channels and nodes, which are determined by \({{\textbf {s}}}_{{\textbf {P}}_{t}^{l}}\) and \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}}\) , respectively. Then, the index set of the pruned input channels or nodes in the \(l^{th}\) layer can be defined as
where \(\xi \) is the preset pruning ratio. Note that we only use \(\mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},0.5\xi )\) to prune the input channels in the first layer. This is because (24) shows that \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}}\) cannot be used to evaluate the importance of these channels. In addition, we find that the size of \(\mathbb {I}_t^l(\xi )\) is approximately one half of the size of \(\mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},\xi )\) or \(\mathbb {B}_{t}^l ({{\textbf {s}}}_{{\textbf {W}}_t^{l}},\xi )\) if \(l\ge 2\) in the following experiments presented in Section 4.2. Some readers may argue that our preset pruning ratio \(\xi \) is also set to a fixed value. However, according to (25), \(\mathbb {I}_t^l(\xi )\) is the intersection of \(\mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},\xi )\) and \(\mathbb {B}_{t}^l ({{\textbf {s}}}_{{\textbf {W}}_t^{l}},\xi )\), except for in the first layer. Hence, our actual pruning ratio is lower than \(\xi \) during each pruning operation. In fact, in our experiments, we find that our actual pruning ratio gradually decreases as the iterative pruning process continues, and the final compression ratio of the CNNs is adaptively adjusted according to the difficulty of the learning task.
The remaining shortcomings of existing algorithms are addressed with the following two approaches: 1) We use the RLS optimization to improve the convergence rate of the training and fine-tuning stages. This is also convenient for the implementation of our pruning strategy. 2) We use the testing loss to determine the repruning timing at the end of a training epoch. Each repruning operation is performed only when the testing loss is tuned down to the last unpruned level.
Based on the above information, our iterative structured pruning algorithm is summarized in Algorithm 1, where \(epoch_{prune}\) denotes the first pruning epoch, and loss denotes the testing loss defined by the mean squared error (MSE). Our algorithm can also be used to prune FNNs since it includes the training and pruning of the fully-connected layers. In addition, the RLS optimization can be used in existing pruning algorithms to replace the SGD optimization. To clarify how our algorithm prunes CNNs, the detailed pruning process of each layer in a five-layer CNN is illustrated in Fig. 2. It clearly shows that our algorithm can prune the input layer in theory. To the best of our knowledge, there is no existing pruning algorithm that can do this.
4 Experiments
In this section, we validate the effectiveness of our algorithm. We first introduce our experimental settings. Then, we evaluate the influence of the preset pruning ratio \(\xi \) on the performance of our algorithm. Finally, we present the comparison results of our algorithm versus four popular pruning algorithms.
4.1 Experimental settings
In our experiments, we use three benchmark datasets: CIFAR-10, CIFAR-100 [40] and MNIST [41]. CIFAR-10 and CIFAR-100 both include 60,000 \(32\times 32\) three-channel colour images, with 50,000 images used for training and 10,000 images used for testing. The CIFAR-100 classification task is more difficult than the CIFAR-10 classification task since CIFAR-100 has 100 image classes, while CIFAR-10 has only 10 image classes. MNIST consists of 10 classes of 70,000 \(28\times 28\) greyscale images, with 60,000 images used for training and 10,000 images used for testing.
For the CIFAR-10 and CIFAR-100 datasets, we use VGG-16 and ResNet-50 models, which have the same architectures as the models presented in [42] and [43], respectively. The VGG-16 model has \(3.4\times 10^7\) parameters and requires \(3.3\times 10^8\) FLOPs, and the ResNet-50 model has \(2.3\times 10^7\) parameters and requires \(1.3\times 10^9\) FLOPs. For using (20) as the training loss function, we delete their Softmax activation function. In addition, we prune only the second layer in each residual block of the ResNet-50 model, since the first and last layers use \(1\times 1\) filters to adjust the number of channels and have a small number of parameters and FLOPs. For the MNIST dataset, we use a three-layer FNN model, which has 1024 ReLU nodes, 512 ReLU nodes and 10 Identity nodes. The FNN model has \(1.3\times 10^6\) parameters and requires \(1.3\times 10^6\) FLOPs.
We compare the performance of our algorithm with that of four popular pruning algorithms: \(\ell _1\)-norm, Network Slimming, Taylor FO and Taylor SO. The first two algorithms are one-shot pruning algorithms, and the last two are iterative pruning algorithms. These algorithms originally used SGD for training and fine-tuning and were designed to prune only convolutional layers. In some CNNs, such as VGG-16, fully-connected layers have more parameters than convolutional layers, and they are more likely to cause overfitting than convolutional layers. To ensure a fair comparison, we modify these algorithms to prune convolutional and fully-connected layers and replace the SGD optimization with the RLS optimization. In addition, for the Taylor FO and Taylor SO algorithms, we also use the testing loss as the repruning criterion.
All experiments are conducted by using PyTorch on an NVIDIA GeForce 1080Ti GPU with a minibatch size of 64. All algorithms use the RLS optimization, in which \({\textbf {W}}_0^l\) is initialized to the default value by PyTorch, and \(\lambda \), k, \(\alpha \), \(\eta ^l\), \(\varvec{\Psi }_0^l\) and \({\textbf {P}}_0^l\) are set or initialized to 1, 0.1, 0.5, 1, the zero matrix and the identity matrix, respectively. Their pruning ratios are determined and presented in Section 4.2. For the VGG-16, ResNet-50 and three-layer FNN models, all algorithms are run for 200, 300 and 200 epochs, all the first or one-shot pruning operations are performed at the end of the \(30^{th}\), \(60^{th}\) and \(30^{th}\) epoch, and all the repruning operations are performed before the \(160^{th}\), \(250^{th}\) and \(170^{th}\) epoch, respectively. This is because all iterative pruning algorithms require some epochs to finely tune the models after the last repruning operation and the ResNet-50 model converges slower than the VGG-16 and FNN models.
4.2 Influence of the preset pruning ratio
The pruning ratio is perhaps the most important hyperparameter in pruning algorithms. In this subsection, we evaluate the influence of the preset pruning ratio \(\xi \) on the performance of our algorithm and determine the pruning ratios of the other four comparative pruning algorithms.
The comparison results of the reductions in FLOPs, parameters and testing accuracy for the VGG-16, ResNet-50 and three-layer FNN models pruned by our algorithm with different \(\xi \) values on CIFAR-10, CIFAR-100 and MNIST datasets are summarized in Table 1. As \(\xi \) increases, the FLOPs\(\downarrow \) and the Parameters\(\downarrow \) of all models increase significantly, but their pruned accuracies have small or even no reductions. These results prove that our algorithm can effectively prune the unimportant input channels and nodes in CNNs and FNNs. Moreover, the pruned VGG-16 and pruned ResNet-50 models have higher FLOPs\(\downarrow \) and Parameters\(\downarrow \) on CIFAR-10 than on CIFAR-100. In other words, by using our algorithm, the models retain more parameters for CIFAR-100 than for CIFAR-10. This proves that our algorithm can adaptively prune CNNs according to the learning task difficulty.
In addition, the comparison results of retained parameters and testing loss versus the number of epochs for all models pruned by our algorithm with different \(\xi \) values are shown in Figs. 3, 4 and 5, respectively. Our algorithm can prune all models multiple times in 30 to 160 or 60 to 250 epochs. After each pruning operation, the testing loss is reduced in a few epochs. This proves that our algorithm with different \(\xi \) values retains the fast convergence speed of the RLS optimization. As \(\xi \) increases, the pruning times gradually decrease in most cases. In addition, as the iterative pruning process continues, the actual pruning ratio gradually deceases. For example, for the VGG-16 model on CIFAR-10, our algorithm performs four pruning operations when \(\xi \) is 0.4. The actual pruning ratios of these pruning operations are 0.220, 0.220, 0.219 and 0.202, respectively. These results prove that our algorithm can prevent overpruning to some extent when \(\xi \) is too large.
Based on Table 1 and Figs. 3 to 5, considering the trade-off between the accuracy loss and the compression ratio, we set \(\xi \) to be 0.4 and 0.5 in the following comparative experiments. According to the above statistical data from the VGG-16 model on CIFAR-10, our actual pruning ratio is approximately one half of \(\xi \). Thus, to ensure a fair comparison, we set the pruning ratios of Taylor FO and Taylor SO to be 0.2. However, in contrast to our algorithm, the \(\ell _1\)-norm and Network Slimming methods are one-shot pruning algorithms, meaning that they prune the CNNs and FNNs only once. Thus, we experimentally determine their pruning ratios to be 0.4 for the VGG-16 and ResNet-50 models and to be 0.6 for the three-layer FNN model.
4.3 Comparison with other pruning algorithms
4.3.1 Pruning VGG-16 on CIFAR-10 and CIFAR-100
The comparison results of the reductions in FLOPs, parameters and testing accuracy for the VGG-16 model pruned by our algorithm and other four algorithms on CIFAR-10 and CIFAR-100 are summarized in Table 2, with the best results shown in bold. Our algorithm effectively reduces the FLOPs and the number of parameters for the VGG-16 model with a small accuracy loss. Under similar pruned accuracy conditions, the FLOPs\(\downarrow \) and Parameters\(\downarrow \) of our algorithm are significantly higher than those of the \(\ell _1\)-norm and Network Slimming algorithms. Taylor FO and Taylor SO also effectively reduce the number of parameters for the VGG-16 model. However, they cannot reduce the FLOPs as effectively as they reduce the number of parameters, and the FLOPs\(\downarrow \) of Taylor SO is even smaller than those of the \(\ell _1\)-norm and Network Slimming algorithms. In addition, in Section 4.2, we discuss that our algorithm can adaptively prune CNNs according to the learning task difficulty. Comparing the Parameters\(\downarrow \) values of the four comparative algorithms in Table 2, we determine that only Taylor FO makes the VGG-16 model retain more parameters for CIFAR-100 than for CIFAR-10, and the \(\ell _1\)-norm, Network Slimming and Taylor SO algorithms do not have this property.
In addition, the comparison results of the reductions in channels or nodes for each layer in the VGG-16 model pruned by all algorithms on CIFAR-10 and CIFAR-100 are summarized in Table 3. Our algorithm, \(\ell _1\)-norm and Network Slimming prune each VGG-16 layer more uniformly than Taylor FO and Taylor SO. This is the reason why our algorithm achieves both high FLOP reduction and high parameter reduction.
4.3.2 Pruning ResNet-50 on CIFAR-10 and CIFAR-100
The comparison results of the reductions in FLOPs, parameters and testing accuracy for the ResNet-50 model pruned by our algorithm and other four algorithms on CIFAR-10 and CIFAR-100 are summarized in Table 4, with the best results shown in bold. Our algorithm with \(\xi =0.5\) effectively reduces the FLOPs and the number of parameters for the ResNet-50 model with a small accuracy loss. Under similar pruned accuracy conditions, the FLOPs\(\downarrow \) and the Parameters\(\downarrow \) of the \(\ell _1\)-norm and Network Slimming algorithms are significantly lower than those of our algorithm. Moreover, Taylor FO and Taylor SO cannot reduce the FLOPs as effectively as they reduce the number of parameters. Although Taylor SO has the best FLOPs\(\downarrow \) and Parameters\(\downarrow \) on CIFAR-100, its pruning accuracy is reduced by 4.22%. Among the considered algorithms, our algorithm with \(\xi =0.4\) has the best pruned accuracies on CIFAR-10 and CIFAR-100. In contrast, the four comparative algorithms have considerably larger accuracy losses for pruning the ResNet-50 model than for pruning the VGG-16 model. This proves that our algorithm has better adaptability for pruning different CNNs. In addition, our algorithm and Taylor FO can make the ResNet-50 model retain more parameters for CIFAR-100, but the FLOPs\(\downarrow \) with Taylor FO on CIFAR-100 is only 23.58%.
In addition, the comparison results of the reductions in channels or nodes for each layer in the ResNet-50 model pruned by all algorithms on CIFAR-10 and CIFAR-100 are summarized in Table 5. Our algorithm, \(\ell _1\)-norm and Network Slimming prune each ResNet-50 layer more uniformly than Taylor FO and Taylor SO.
4.3.3 Pruning FNNs on MNIST
The comparison results of the reductions in FLOPs, parameters and testing accuracy for the three-layer FNN model pruned by different pruning algorithms on MNIST are summarized in Table 6, with the best results shown in bold. Our algorithm achieves the best pruning accuracy, FLOPs\(\downarrow \) and Parameters\(\downarrow \). In particular, the FLOPs\(\downarrow \) and the Parameters\(\downarrow \) of our algorithm are significantly higher than those of the four comparative algorithms. These results prove that our algorithm can effectively prune FNNs with a small accuracy loss.
In addition, the comparison results of the reductions in nodes for each layer in the three-layer FNN model pruned by all algorithms on MNIST are summarized in Table 7. Our algorithm, \(\ell _1\)-norm and Network Slimming prune each FNN layer more uniformly than Taylor FO and Taylor SO. More importantly, the results show that our algorithm can not only prune the nodes in fully-connected hidden layers but also prune the original sample features in the input layer. In fact, according to (25), our algorithm can also prune the original channels of multichannel samples if the number of channels is greater than or equal to 2/\(\xi \). In Sections 4.3.1 and 4.3.2, our algorithm does not prune the input layers of the VGG-16 and ResNet-50 models because all images in CIFAR-10 and CIFAR-100 have only three channels.
5 Conclusion
In this paper, we studied structured pruning algorithms for CNNs. To address the shortcomings of existing algorithms, we proposed an RLS-based iterative structured pruning algorithm. Our algorithm employs the RLS optimization to accelerate the convergence rate of the training and fine-tuning stages, combines inverse input autocorrelation matrices with weight matrices to evaluate and prune unimportant input channels or nodes in CNN layers, and uses the testing loss to automatically determine the repruning timing. We demonstrated the effectiveness of our algorithm in pruning VGG-16 and ResNet-50 on CIFAR-10 and CIFAR-100 and pruning an FNN on MNIST. The experimental results show that our algorithm can prune CNNs and FNNs multiple times in a small number of epochs. Compared with four popular pruning algorithms, our algorithm can effectively reduce both the FLOPs and number of parameters for CNNs and FNNs with small or even no reduction in the accuracy. Furthermore, our algorithm can adaptively prune CNNs according to the learning task difficulty and has better adaptability for pruning different networks. Moreover, our algorithm can prune the original sample features.
Data Availability
All data included in this study are available upon request by contact with the corresponding author.
References
Gabor M, Zdunek R (2023) Compressing convolutional neural networks with hierarchical tucker-2 decomposition. Appl Soft Comput 132:109856. https://doi.org/10.1016/j.asoc.2022.109856
Liu H, Liu T, Zhang Z, Sangaiah AK, Yang B, Li Y (2022) ARHPE: Asymmetric relation-aware representation learning for head pose estimation in industrial human-computer interaction. IEEE Trans Industr Inform 18(10):7107–7117. https://doi.org/10.1109/TII.2022.3143605
Liu, H, Liu, T, Chen, Y, Zhang, Z, Li Y-F(2022) EHPE: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans Multimedia, pp 1–12. https://doi.org/10.1109/TMM.2022.3197364
Liu T, Wang J, Yang B, Wang X (2021) NGDNet: Nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220. https://doi.org/10.1016/j.neucom.2020.12.090
Liu H, Fang S, Zhang Z, Li D, Lin K, Wang J (2022) MFDNet: Collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Trans Multimedia 24:2449–2460. https://doi.org/10.1109/TMM.2021.3081873
Liu H, Nie H, Zhang Z, Li Y (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322. https://doi.org/10.1016/j.neucom.2020.09.068
Li Z, Liu H, Zhang Z, Liu T, Xiong NN (2022) Learning knowledge graph embedding with heterogeneous relation attention networks. IEEE Trans Neural Netw Learn Syst 33(8):3961–3973. https://doi.org/10.1109/TNNLS.2021.3055147
LeCun Y, Bengio Y, Hinton GE (2015) Deep learning. Nat 521(7553):436–444. https://doi.org/10.1038/nature14539
Li, S, Sun, Y, Yen, GG, Zhang M (2021) Automatic design of convolutional neural network architectures under resource constraints. IEEE Trans Neural Netw Learn Syst, pp 1–15 . https://doi.org/10.1109/TNNLS.2021.3123105
Liu H, Zheng C, Li D, Shen X, Lin K, Wang J, Zhang Z, Zhang Z, Xiong NN (2022) EDMF: Efficient deep matrix factorization with review feature learning for industrial recommender system. IEEE Trans Industr Inform 18(7):4361–4371. https://doi.org/10.1109/TII.2021.3128240
Kocacinar B, Tas B, Akbulut FP, Catal C, Mishra D (2022) A real-time cnn-based lightweight mobile masked face recognition system. IEEE Access 10:63496–63507. https://doi.org/10.1109/ACCESS.2022.3182055
Cheng J, Wang P, Li G, Hu Q, Lu H (2018) Recent advances in efficient computation of deep convolutional neural networks. Front Inf Technol Electron Eng 19(1):64–77. https://doi.org/10.1631/FITEE.1700789
Liang T, Glossner J, Wang L, Shi S, Zhang X (2021) Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461:370–403. https://doi.org/10.1016/j.neucom.2021.07.045
Cheng Y, Wang X, Xie X, Li W, Peng S (2022) Channel pruning guided by global channel relation. Appl Intell 52(14):1–12. https://doi.org/10.1007/s10489-022-03198-9
Hasan MS, Alam R, Adnan MA (2023) Compressed neural architecture utilizing dimensionality reduction and quantization. Appl Intell 53(2):1271–1286. https://doi.org/10.1007/s10489-022-03221-z
Yu Z, Shi Y (2022) Kernel quantization for efficient network compression. IEEE Access 10:4063–4071. https://doi.org/10.1109/ACCESS.2022.3140773
Wang J, Zhu L, Dai T, Xu Q, Gao T (2021) Low-rank and sparse matrix factorization with prior relations for recommender systems. Appl Intell 51(6):3435–3449. https://doi.org/10.1007/s10489-020-02023-5
Chen Y, Wu H, Chen Y, Liu R, Ye H, Liu S (2021) Design of new compact multi-layer quint-band bandpass filter. IEEE Access 9:139438–139445. https://doi.org/10.1109/ACCESS.2021.3116807
Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: A survey. Int J Comput Vis 129:1789–1819. https://doi.org/10.1007/s11263-021-01453-z
Xu C, Gao W, Li T, Bai N, Li G, Zhang Y (2023) Teacher-student collaborative knowledge distillation for image classification. Appl Intell 53(2):1997–2009. https://doi.org/10.1007/s10489-022-03486-4
Yang W, Xiao Y (2022) Structured pruning via feature channels similarity and mutual learning for convolutional neural network compression. Appl Intell 52(12):14560–14570. https://doi.org/10.1007/s10489-022-03403-9
Yang C, Liu H (2022) Channel pruning based on convolutional neural network sensitivity. Neurocomputing 507:97–106. https://doi.org/10.1016/j.neucom.2022.07.051
LeCun, Y, Denker, JS, Solla SA (1989) Optimal brain damage. In: Touretzky, DS (ed) Advances in neural information processing systems 2, NIPS Conference, Denver, Colorado, USA, November 27-30, 1989, pp 598–605. https://dl.acm.org/doi/10.5555/109230.109298
He Y, Dong X, Kang G, Fu Y, Yan C, Yang Y (2020) Asymptotic soft filter pruning for deep convolutional neural networks. IEEE Trans Cybern 50(8):3594–3604. https://doi.org/10.1109/TCYB.2019.2933477
Li G, Xu G (2021) Providing clear pruning threshold: A novel CNN pruning method via \(\ell _0\) regularisation. IET Image Process 15(2):405–418. https://doi.org/10.1049/ipr2.12030
Xu S, Chen H, Gong X, Liu K, Lü J, Zhang B (2021) Efficient structured pruning based on deep feature stabilization. Neural Comput Appl 33(13):7409–7420. https://doi.org/10.1007/s00521-021-05828-8
Wei H, Wang Z, Hua G, Sun J, Zhao Y (2022) Automatic group-based structured pruning for deep convolutional networks. IEEE Access 10:128824–128834. https://doi.org/10.1109/ACCESS.2022.3227619
Frankle, J, Carbin M (2019) The lottery ticket hypothesis: Finding sparse, trainable neural networks. Paper presented at the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019
Li, H, Kadav, A, Durdanovic, I, Samet, H, Graf HP (2017) Pruning filters for efficient convNets. Paper presented at the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017
Liu, Z, Li, J, Shen, Z, Huang, G, Yan, S, Zhang C (2017) Learning efficient convolutional networks through network slimming. Paper presented at the IEEE International conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017
Molchanov, P, Tyree, S, Karras, T, Aila, T, Kautz J (2017) Pruning convolutional neural networks for resource efficient inference. Paper presented at the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017
Molchanov, P, Mallya, A, Tyree, S, Frosio, I, Kautz J (2019) Importance estimation for neural network pruning. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019
Chen Y, Wen X, Zhang Y, Shi W (2021) CCPrune: Collaborative channel pruning for learning compact convolutional networks. Neurocomputing 451:35–45. https://doi.org/10.1016/j.neucom.2021.04.063
Li X (2018) Preconditioned stochastic gradient descent. IEEE Trans Neural Netw Learn Syst 29(5):1454–1466. https://doi.org/10.1109/TNNLS.2017.2672978
Zhang, C, Song, Q, Zhou, H, Ou, Y, Deng, H, Yang LT (2021) Revisiting recursive least squares for training deep neural networks. Preprint at https://arxiv.org/abs/2109.03220
Chen Y, Hero AO (2012) Recursive \(\ell _{1,\infty }\) group lasso. IEEE Trans Signal Process 60(8):3978–3987. https://doi.org/10.1109/TSP.2012.2192924
Bruce, AL, Goel, A, Bernstein DS (2020) Recursive least squares with matrix forgetting. Paper presented at the 2020 American Control Conference, ACC 2020, Denver, CO, USA, July 1-3, 2020
Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21:124–127. https://doi.org/10.1214/aoms/1177729893
Goodfellow I, Bengio Y, Courville A (2018) Deep learning. MIT press. https://doi.org/10.1007/s10710-017-9314-z
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, Computer Science Department, University of Toronto
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Simonyan, K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Paper presented at the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015
Zhang G, Xu S, Li J, Guo AJX (2022) Group-based network pruning via nonlinear relationship between convolution filters. Appl Intell 52(8):9274–9288. https://doi.org/10.1007/s10489-021-02907-0
Acknowledgements
This work is supported by the National Natural Science Foundation of China (grant nos. 61762032 and 11961018).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yu, T., Zhang, C., Ma, M. et al. Recursive least squares method for training and pruning convolutional neural networks. Appl Intell 53, 24603–24618 (2023). https://doi.org/10.1007/s10489-023-04740-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04740-z