1 Introduction

Convolutional neural networks (CNNs) are the most widely used class of deep neural networks (DNNs) [1,2,3]. CNNs are well suited for handling computer vision tasks [4,5,6,7] since they can extract sample features from images at different abstract levels through convolutional and pooling mechanisms [8]. However, CNNs generally have high computational and storage costs, which hinder their widespread applications to some extent [9, 10]. In particular, in the past decade, mobile devices, such as smartphones, wearable devices and drones, have been increasingly used. There remains a growing demand to deploy CNNs with these devices, but their computational and storage capacities are much lower than those of conventional computers [11]. Therefore, how to compress CNNs has become an important research focus in deep learning.

At present, five categories of model compression algorithms have been proposed for CNNs [12]. The first category is network pruning, which prunes redundant channels or nodes [13, 14]. The second category is parameter quantization, which reduces the bits of all parameters to reduce the computational and storage costs [15, 16]. The third category is low-rank factorization, which decomposes three-dimensional filters into two-dimensional filters [17]. The fourth category is filter compacting, which uses compact filters to replace loose and overparameterized filters [18]. The last category is knowledge distillation, in which knowledge is acquired from the original network and used to generate a smaller network [19, 20]. Among these categories, network pruning has received the most research attention [21].Thus, we focus on this type of model compression algorithms in this paper.

Network pruning methods can be further classified into unstructured and structured pruning methods [22]. In theory, unstructured pruning methods can prune arbitrary redundant nodes in convolutional layers and achieve high compression ratios. However, unstructured pruning methods are difficult to implement since they destroy the form of weight matrices. To address this issue, almost all existing unstructured pruning algorithms, such as Optimal Brain Damage [23], Soft Channel Pruning [24] and \(\ell _0\) Minimization [25], zero out the unimportant weights instead of really pruning the redundant nodes in simulation experiments. In contrast, structured pruning methods aim to remove unimportant channels. These methods preserve the structure of weight matrices and are thus more practical and popular, although their pruning granularity is coarse.

Structured pruning methods usually include three stages: training, pruning and fine-tuning (also called retraining in some papers) [26, 27]. According to the number of pruning operations, these methods can be divided into one-shot structured pruning and iterative structured pruning [28]. The former performs pruning and fine-tuning only once and thus requires fewer epochs to obtain the compressed model. However, its compression ratio and accuracy rely heavily on the given pruning ratio. In other words, it is often difficult to obtain the optimal compressed model with one-shot pruning. In contrast, the latter performs multiple pruning and fine-tuning operations, which may lead to better results; however, multiple operations are very time-consuming, especially for large-scale neural networks. There is still much debate about what kind of structured pruning approach is best for different scenarios.

In recent years, researchers have proposed many structured pruning algorithms. For example, Li et al. proposed a one-shot pruning algorithm called the \(\ell _1\)-norm [29], which evaluates and prunes unimportant output channels by using \(\ell _1\) regularization for the weights of the convolutional layers. Liu et al. proposed another one-shot pruning algorithm called Network Slimming [30], which prunes channels by using \(\ell _1\) regularization for the scaling factors in the batch normalization layers. Molchanov et al. proposed two iterative pruning algorithms called Taylor FO [31] and Taylor SO [32], which use the first- and second-order Taylor expansions to estimate the contribution of each channel to the final loss, respectively, and remove the channels with scores smaller than a given threshold. Chen et al. proposed another iterative pruning algorithm called Collaborative Channel Pruning [33], which evaluates and removes unimportant channels by combining the convolution layer weights and batch normalization layer scaling factors.

Although researchers claim that these algorithms can effectively compress CNNs, they still have three common shortcomings. The first shortcoming is that they use stochastic gradient descent (SGD) to optimize CNNs during the training and fine-tuning stages. It is well known that SGD converges slowly and can be difficult to tune [34], which results in these algorithms requiring more training epochs. The second shortcoming is that they mainly use the weight magnitude to prune unimportant output channels. In fact, the training and fine-tuning results are influenced by the dataset, and the redundancy of the input features in each layer is the main reason to prune channels. Therefore, the weight magnitude cannot be used to evaluate the redundancy accurately. The third shortcoming is that the pruning ratio is manually and empirically set to a fixed value by users, which may cause underpruning or overpruning. In addition to these three shortcomings, existing iterative structured pruning algorithms have another shortcoming in that the pruning times and the repruning timing are manually set by users.

To overcome the above shortcomings, we propose a novel iterative structured pruning algorithm in this paper. In our previous work [35], we proposed a recursive least squares (RLS) optimization algorithm, which can be viewed as a special SGD algorithm with the inverse input autocorrelation matrix as the learning rate. Compared with SGD and Adam optimization, the RLS optimization has better convergence speed and quality. Our proposed algorithm is based on this optimization algorithm. In addition to using the RLS optimization to train and fine-tune CNNs, it combines inverse input autocorrelation matrices with weight matrices to evaluate and prune unimportant input channels or nodes and automatically performs the next pruning operation when the testing loss is tuned down to the last unpruned level. Our algorithm can also be used for pruning feedforward neural networks (FNNs). We validate its effectiveness in pruning VGG-16, ResNet-50 and an FNN on the CIFAR-10, CIFAR-100 and MNIST datasets. Compared with existing iterative pruning algorithms, our algorithm can prune CNNs and FNNs multiple times in a smaller number of epochs and more effectively prune CNNs and FNNs with a smaller accuracy loss. In addition, it can adaptively prune CNNs according to the difficulty of the learning task.

The key contributions of this paper can be summarized as follows:

  1. 1)

    We use the RLS optimization rather than SGD to accelerate our algorithm and all comparative pruning algorithms used in the experiments.

  2. 2)

    We present a novel iterative structured pruning algorithm that combines inverse input autocorrelation matrices and weight matrices to evaluate and prune unimportant input channels or nodes.

  3. 3)

    We suggest the testing loss as the repruning criterion of our algorithm and all comparative iterative pruning algorithms. Each repruning operation is performed when the testing loss is tuned down to the last unpruned level.

  4. 4)

    We conduct extensive experiments to verify the effectiveness of our algorithm. Our algorithm can effectively reduce both the number of floating-point operations (FLOPs) and number of parameters for CNNs and FNNs with a small accuracy loss. In particular, the experiments on FNNs show that our algorithm can prune the original sample features.

The remainder of this paper is organized as follows: In Section 2, we review the RLS algorithm and the RLS optimization for CNNs. In Section 3, we introduce our algorithm in detail. In Section 4, we present the experimental settings and results. Finally, we summarize this paper in Section 5.

2 Background

In this section, we introduce the background knowledge and some notations used in this paper. We first review the derivation of the RLS algorithm and then review the learning mechanism of CNNs with the RLS optimization.

2.1 Recursive least squares

RLS is a popular adaptive filtering algorithm with fast convergence speed. This algorithm recursively determines the weights that minimize the weighted linear least squares loss function based on the input signal [36]. Compared with the linear least squares algorithm, it is more suitable for online learning.

Let \(\mathbb {X}_t=\{{{\textbf {x}}}_1,\cdots ,{{\textbf {x}}}_t\}\) denote all sample inputs from the starting step to the current step, and let \(\mathbb {Y}_t^* = \{y_1^{*},\cdots ,y_t^{*}\}\) denote the corresponding target outputs. On this basis, the quadratic minimization problem solved by the RLS algorithm over time t is defined as

$$\begin{aligned} {{\textbf {w}}}_t =\underset{{\varvec{w}}}{\arg \min } \frac{1}{2}\sum \limits _{i=1}^t\lambda ^{t-i}{(y_i^{*}-{\varvec{w}}^\text {T}{\varvec{x}}_i)}^2 \end{aligned}$$
(1)

where \({{\textbf {w}}}\) is the weight vector, and \(\lambda \in (0,1]\) is the forgetting factor which enhances the importance of recent data over older data [37]. Let \(\nabla _{\varvec{w}} \frac{1}{2}\sum _{i=1}^t\lambda ^{t-i}{(y_i^{*}-{\varvec{w}}^\text {T}{\varvec{x}}_i)}^2 =\varvec{0}\). Then, we easily obtain

$$\begin{aligned} \varvec{w}=(\sum \limits _{i=1}^t\lambda ^{t-i}{\varvec{x}}_i{\varvec{x}}_i^\text {T})^{-1}\sum \limits _{i=1}^t\lambda ^{t-i}{\varvec{x}}_iy_i^{*} \end{aligned}$$
(2)

We define \({{\textbf {A}}}_t\) and \({{\textbf {b}}}_t\) as follows:

$$\begin{aligned} {{\textbf {A}}}_t =\sum \limits _{i=1}^t\lambda ^{t-i}{{\textbf {x}}}_i{{\textbf {x}}}_i^\text {T}\end{aligned}$$
(3)
$$\begin{aligned} {{\textbf {b}}}_t =\sum \limits _{i=1}^t\lambda ^{t-i}{{\textbf {x}}}_iy_i^{*}~ \end{aligned}$$
(4)

Based on (2), the solution \({{\textbf {w}}}_t\) to (1) can be derived as

$$\begin{aligned} {{\textbf {w}}}_t = {{\textbf {A}}}_t^{-1}{{\textbf {b}}}_t \end{aligned}$$
(5)

To avoid calculating the inverse of \({{\textbf {A}}}_t\) in (5), we define the inverse input autocorrelation matrix \({{\textbf {P}}}_t = ({{\textbf {A}}}_t)^{-1}\). Equations (3) and (4) show that A\(_t\) and \({{\textbf {b}}}_t\) can be computed recursively as follows:

$$\begin{aligned} {{\textbf {A}}}_t =\lambda {{\textbf {A}}}_{t-1}+{{\textbf {x}}}_t{{\textbf {x}}}_t^\text {T} \end{aligned}$$
(6)
$$\begin{aligned} {{\textbf {b}}}_t =\lambda {{\textbf {b}}}_{t-1}+{{\textbf {x}}}_ty_t^{*} ~ \end{aligned}$$
(7)

Using Sherman-Morrison matrix inversion formula [38] with (6), we obtain

$$\begin{aligned} {\textbf {P}}_t = \frac{1}{\lambda }{} {\textbf {P}}_{t-1} - \frac{1}{\lambda h_t}{} {\textbf {u}}_t({\textbf {u}}_t )^\text {T} \end{aligned}$$
(8)

where \({\textbf {u}}_t\) and \(h_t\) are defined as follows:

$$\begin{aligned} {\textbf {u}}_t = {\textbf {P}}_{t-1}{} {\textbf {x}}_t ~~ \end{aligned}$$
(9)
$$\begin{aligned} h_t = \lambda +{\textbf {u}}_t^\text {T}{} {\textbf {x}}_t \end{aligned}$$
(10)

Substituting (7) and (8) into (5), we obtain

$$\begin{aligned} {{\textbf {w}}}_t = {{\textbf {w}}}_{t-1} - \frac{1}{h_t}{{\textbf {u}}}_te_t \end{aligned}$$
(11)

where \(e_t\) is defined as

$$\begin{aligned} e_t = {{\textbf {w}}}^\text {T}_{t-1}{{\textbf {x}}}_t -y_t^{*} \end{aligned}$$
(12)

Finally, we obtain the RLS algorithm, which is defined by (8) to (12).

2.2 CNNs with RLS optimization

The RLS optimization is a special type of SGD algorithm with the inverse input autocorrelation matrix as the learning rate [35]. Due to the fast convergence speed of the RLS algorithm, it can efficiently optimize CNNs. A CNN generally consists of an input layer followed by some number of convolutional layers, pooling layers and fully-connected layers [39]. Since the pooling layers have no learnable weights, we need to review the RLS optimization only for the convolutional and fully-connected layers.

Fig. 1
figure 1

Forward-propagation learning of the \(m^{th}\) sample in the current minibatch in the \(l^{th}\) CNN layer. (a): \({\textbf {Y}}_{t(m,i)}^{l-1}\) and \(\textbf{Y}_{t(m,j)}^{l}\) are the \(i^{th}\) channel input and the \(j^{th}\) channel output, \(\textbf{W}_{t-1(i,j)}^{l}\) is the filter weight matrix between the \(i^{th}\) input channel and the \(j^{th}\) output channel, \(\textbf{R}_{t(m,i,u,v)}^{l-1}\) is the receptive field matrix in the \(i^{th}\) input channel for the output \(\textbf{Y}_{t(m,j,u,v)}^{l}\), \(C_{l-1}\) and \(C_{l}\) are the number of input and output channels, and \(f_l(\cdot )\) is the activation function. (b): \(\textbf{Y}_{t(m,i)}^{l-1}\) and \(\textbf{Y}_{t(m,j)}^{l}\) are the \(i^{th}\) node input and the \(j^{th}\) node output, \(\textbf{W}_{t-1(i,j)}^{l}\) is the weight between the \(i^{th}\) input node and the \(j^{th}\) output node, and \(N_{l-1}\) and \(N_{l}\) are the number of input and output nodes

Let \({\textbf {Y}}_{t}^{0}\) and \({\textbf {Y}}_{t}^{*}\) denote the input and the target output of the current training minibatch, respectively, and let L denote the total number of convolutional and fully-connected layers. The forward-propagation learning of the \(m^{th}\) sample in the current minibatch in the \(l^{th}\) CNN layer is illustrated in Fig. 1. For brevity, we omit the bias term in each layer. Based on these notations, we briefly introduce the RLS optimization rules for CNNs. According to [35], the recursive update rule of the inverse input autocorrelation matrix \({\textbf {P}}_t^l\) in the \(l^{th}\) layer is defined as

$$\begin{aligned} {\textbf {P}}_t^l \approx \frac{1}{\lambda } {\textbf {P}}_{t-1}^l - \frac{k}{\lambda h_t^l} {\textbf {u}}_{t}^l ({\textbf {u}}_{t}^l)^{\text {T}} \end{aligned}$$
(13)

where \(k>0\) is the average scaling factor and \({\textbf {u}}_t^l\) and \(h_t^l\) are defined as follows:

$$\begin{aligned} {\textbf {u}}_t^l={\textbf {P}}_{t-1}^l {{{\textbf {x}}}}_{t}^l ~~~~ \end{aligned}$$
(14)
$$\begin{aligned} h_{t}^l=\lambda +k({{{\textbf {x}}}}_{t}^l)^{\text {T}}{} {\textbf {u}}_t^l \end{aligned}$$
(15)

where \({{{\textbf {x}}}}_{t}^l\) is the average vector. If the \(l^{th}\) layer is a convolutional layer, \({\textbf {{x}}}_{t}^l \in \mathbb {R}^{C_{l-1} H_l W_l}\) is defined as

$$\begin{aligned} {{\textbf {x}}}_{t}^l=\frac{1}{M_t U_l V_l} \sum \limits _{m=1}^{M_t} \sum \limits _{u=1}^{U_l} \sum \limits _{v=1}^{V_l} (flatten({\textbf {R}}_{t(m,:,u,v,:,:)}^{l-1}))^{\text {T}} \end{aligned}$$
(16)

where \(H_l\) and \(W_l\) denote the height and width of the filters, \(M_t\) denotes the current minibatch size, \(U_l\) and \(V_l\) denote the height and width of the output channels, and \(flatten(\cdot )\) denotes reshaping the given matrix or tensor into a column vector. If the \(l^{th}\) layer is a fully-connected layer, \({{{\textbf {x}}}}_{t}^l\) is defined as

$$\begin{aligned} {{\textbf {x}}}_{t}^l=\frac{1}{M_t} \sum \limits _{m=1}^{M_t} ({\textbf {Y}}_{t(m,:)}^{l-1})^{\text {T}} \end{aligned}$$
(17)

Note that if the preceding layer of this layer is a convolutional or pooling layer, \(\textbf{Y}_{t(m,:)}^{l-1}\) will denote the flattened vector of the preceding layer’s output. In the RLS optimization algorithm, tensor \(\textbf{W}_{t-1}^{l}\) is converted to matrix \(\textbf{W}_{t-1}^{l}\) by defining \(\mathbf {{W}}_{t-1(:,j)}^{l}=flatten(\textbf{W}_{t-1(:,j,:,:)}^{l})\). In addition, the algorithm uses the momentum to accelerate the convergence rate. Thus, regardless of whether the \(l^{th}\) layer is a convolutional layer or a fully-connected layer, the recursive update rule of \(\textbf{W}_{t-1}^{l}\) is defined as follows:

$$\begin{aligned} \varvec{\Psi }_t^l = \alpha \varvec{\Psi }_{t-1}^l - \frac{\eta ^l}{ h_t^l} {\textbf {P}}_{t-1}^l {\mathbf {\nabla }}_{{\textbf {W}}_{t-1}^{l}} \end{aligned}$$
(18)
$$\begin{aligned} \textbf{W}_t^l \approx {\textbf {W}}_{t-1}^l + \varvec{\Psi }_t^l ~~~~~~~~~ \end{aligned}$$
(19)

where \(\varvec{\Psi }_t^l\) is the velocity matrix of the \(l^{th}\) layer at step t, \(\alpha \) is the momentum factor, \(\eta ^l > 0\) is the gradient scaling factor, and \(\mathbf {\nabla }_{\textbf{W}_{t-1}^{l}}\) is the equivalent gradient of the linear output loss function \(\mathcal {L}_t\) with respect to \({\textbf {W}}_{t-1}^{l}\). \(\mathcal {L}_t\) is defined as

$$\begin{aligned} \mathcal {L}_t =\frac{1}{2M_t} \left\| {\textbf {Z}}_{t}^{L}-{\textbf {Z}}_{t}^{*}\right\| _\textsc {F}^2 \end{aligned}$$
(20)

where \(\textbf{Z}_{t}^{L} =f_L^{-1}({\textbf {Y}}_{t}^L)\) is the linear output matrix and \(\textbf{Z}_{t}^{*}=f_L^{-1}({\textbf {Y}}_{t}^{*})\) is the desired linear output matrix corresponding to \(\textbf{Z}_{t}^{L}\). Note that the RLS optimization assumes that \(f_L(\cdot )\) is monotonic in the output layer [35]. In addition, the RLS optimization can be used for FNNs, since the above equations except for (16) can also be used for fully-connected layers and the last part of a CNN can generally be viewed as an FNN.

3 The proposed algorithm

In this section, we present our iterative structured pruning algorithm. We first introduce the theoretical foundation of our algorithm. Then, we describe our pruning strategy in detail and show the overall pseudocode of our algorithm.

3.1 Theoretical foundation

As discussed in Section 2.1, \({\textbf {A}}_t\) is the input autocorrelation matrix. Suppose that \({{\textbf {x}}}_t\) has n features, namely, \({{\textbf {x}}}_t=[{\textbf {x}}_{t(1)},{\textbf {x}}_{t(2)},\cdots ,{\textbf {x}}_{t(n)}]^\text {T}\). Then, \({{\textbf {x}}}_t {{\textbf {x}}}_t^\text {T}\) can be expressed as

$$\begin{aligned} {\textbf {x}}_t {\textbf {x}}_t^\text {T} = \left[ \begin{array}{cccc} {\textbf {x}}_{t(1)}{} {\textbf {x}}_{t(1)} &{} {\textbf {x}}_{t(1)}{} {\textbf {x}}_{t(2)} &{} \cdots &{} {\textbf {x}}_{t(1)}{} {\textbf {x}}_{t(n)} \\ {\textbf {x}}_{t(2)}{} {\textbf {x}}_{t(1)} &{} {\textbf {x}}_{t(2)}{} {\textbf {x}}_{t(2)} &{} \cdots &{} {\textbf {x}}_{t(2)}{} {\textbf {x}}_{t(n)} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ {\textbf {x}}_{t(n)}{} {\textbf {x}}_{t(1)} &{} {\textbf {x}}_{t(n)}{} {\textbf {x}}_{t(2)} &{} \cdots &{} {\textbf {x}}_{t(n)}{} {\textbf {x}}_{t(n)} \\ \end{array} \right] \end{aligned}$$
(21)

Let \(s_{{\textbf {A}}_{t(i)}}\) denote the sum of the \(i^{th}\) row (or column) in \({\textbf {A}}_t\). According to (3), (6) and (21), \(s_{{\textbf {A}}_{t(i)}}\) can be defined as

$$\begin{aligned} s_{{{\textbf {A}}}_{t(i)}} = \sum \limits _{k=1}^t\lambda ^{t-k} {\textbf {x}}_{k(i)} s_{{{\textbf {x}}}_k}=\lambda s_{{{\textbf {A}}}_{t-1(i)}} + {\textbf {x}}_{t(i)} s_{{{\textbf {x}}}_t} \end{aligned}$$
(22)

where \(s_{{{\textbf {x}}}_t}\) is the sum of all features in \({{\textbf {x}}}_t\). In deep learning, we typically assume that all samples are independently and identically distributed [39]. CNNs are usually used for computer vision tasks, and they often use the ReLU activation function in the hidden layers; thus, we can suppose that \({{\textbf {x}}}_{t(i)}\ge 0\). In addition, the forgetting factor \(\lambda \) is generally close to 1. Therefore, \(s_{{{\textbf {x}}}_1},\cdots ,s_{{{\textbf {x}}}_t}\) are approximately equal, and \(s_{{{\textbf {A}}}_{t(i)}}\) is approximately proportional to the sum of \({{\textbf {x}}}_{1(i)},\cdots , {{\textbf {x}}}_{t(i)}\). In other words, if \(s_{{{\textbf {A}}}_{t(i)}}\) is small, the \(i^{th}\) features in \({{\textbf {x}}}_1,\cdots , {{\textbf {x}}}_t\) are probably small, and their influence on the outputs is probably small. Since \({\textbf {P}}_t\) is the inverse of \({\textbf {A}}_t\), we can easily draw the following conclusion: If the sum of the \(i^{th}\) row (or column) in \({\textbf {P}}_t\) is large, the importance of the \(i^{th}\) features in \({{\textbf {x}}}_1,\cdots , {{\textbf {x}}}_t\) will probably be small.

Thus, we can use \({\textbf {P}}_t^l\) to evaluate the importance of the input nodes in the \(l^{th}\) layer since \({\textbf {P}}_t^l\) has the same meaning as \({\textbf {P}}_t\). For fully-connected layers in CNNs, we can directly use this conclusion. However, for convolutional layers in CNNs, structured pruning methods aim to prune unimportant channels rather than nodes. Fortunately, according to (16) and Fig. 1(a), we know that \({{\textbf {x}}}_{t((i-1)H_l W_l+1)},{{\textbf {x}}}_{t((i-1)H_l W_l+2)},\cdots ,{{\textbf {x}}}_{t(iH_l W_l)}\) in \({{{\textbf {x}}}}_{t}^l\) comes from the \(i^{th}\) input channel. Thus, we can use the sum of the \((i-1)H_l W_l+1\) to \(iH_l W_l\) rows (columns) in \({\textbf {P}}_t^l\) to evaluate the importance of the \(i^{th}\) input channel.

Fig. 2
figure 2

Detail pruning of each layer in a five-layer CNN with our algorithm. The \(l^{th}\) dashed border is to show how \(\mathbb {I}_t^l(\xi )\) is computed and the input channels or nodes are pruned in the \(l^{th}\) layer

Based on the above analysis, we define a vector \({\textbf {s}}_{{\textbf {P}}_t^l}\) to evaluate the importance of the input channels or nodes in the \(l^{th}\) layer. The \(i^{th}\) element in \({{\textbf {s}}}_{{{\textbf {P}}}_{t}^l}\) is defined as

$$\begin{aligned} {{\textbf {s}}}_{{{{\textbf {P}}}_t^l}(i)} = \left\{ \begin{aligned} \sum \limits _{j=N_{l(i)}^{cb}}^{N_{l(i)}^{ce}}\sum \limits _{k=1}^{N_{l}^{re}}{} {\textbf {P}}_{t(k,j)}^l ~~~~~l\le L_c \\ \sum \limits _{k=1}^{N_{l-1}}{} {\textbf {P}}_{t(k,i)}^l ~~~~~~~~~~l>L_c \end{aligned} \right. \end{aligned}$$
(23)

where \(N_{l(i)}^{cb}=(i-1)H_l W_l+1\), \(N_{l(i)}^{ce}=iH_l W_l\), \(N_{l}^{re}=C_{l-1}H_l W_l\), and \(L_c\) denotes the total number of convolutional layers. Equation (23) can be explained as follows: If the \(l^{th}\) layer is a convolutional layer (i.e. \(l\le L_c\)), \({\textbf {s}}_{{{{\textbf {P}}}_t^l}(i)}\) denotes the importance of the \(i^{th}\) input channel. If the \(l^{th}\) layer is a fully-connected layer (i.e. \(l>L_c\)), \({{\textbf {s}}}_{{{{\textbf {P}}}_t^l}(i)}\) denotes the importance of the \(i^{th}\) input node. Furthermore, the larger the value of \({{\textbf {s}}}_{{\textbf {P}}_{t}^{l}(i)}\) is, the more likely it is that the \(i^{th}\) input channel or node is unimportant.

figure e

RLS Method for Training and Pruning CNNs.

3.2 RLS-based pruning

As mentioned in Section 1, existing structured pruning algorithms have some shortcomings. In this subsection, we consider how to overcome these shortcomings and present our pruning algorithm.

Table 1 Comparison on the reductions in FLOPs, parameters and testing accuracy for all models pruned by our algorithm with different \(\xi \) values on different datasets

One shortcoming of existing algorithms is that they cannot prune the unimportant channels accurately since most algorithms use only the weight magnitude to evaluate the channel importance. To address this problem, we combine inverse input autocorrelation matrices with weight matrices to evaluate and prune the unimportant input channels in the convolutional layers and unimportant nodes in the fully-connected layers.

In Section 3.1, we defined \({{\textbf {s}}}_{{{{\textbf {P}}}_t^l}}\) by using the inverse input autocorrelation matrix \({{\textbf {P}}}_{t}^l\). Next, we define another vector \({{\textbf {s}}}_{{{{\textbf {W}}}_t^l}}\) by using the weight matrix \({{\textbf {W}}}_{t}^l\). Li et al. proposed the \(\ell _1\)-norm algorithm and demonstrated that the sum of the absolute filter weights can be used to evaluate the importance of the output channels [29]. It is well known that the output of the \((l-1)^{th}\) layer is the input of the \(l^{th}\) layer in CNNs. Thus, we can modify this approach to evaluate the importance of the input channels or nodes in the \(l^{th}\) layer. By using this approach, the \(i^{th}\) element in \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}}\) is defined as

$$\begin{aligned} {{\textbf {s}}}_{{\textbf {W}}_t^{l}(i)} =\left\{ \begin{aligned} \sum \limits _{k=1}^{N_{l-1}^{re}} |{\textbf {W}}_{t(k,i)}^{l-1}|~~~~~~~2\le l\le L_c \\ \sum \limits _{k=1}^{N_{l-1}^{re}} |{\textbf {W}}_{t(k,c_{(i)})}^{l-1}|~~~~~ l = L_c+1\\ \sum \limits _{k=1}^{N_{l-1}} |{\textbf {W}}_{t(k,i)}^{l-1}|~~~~~~~~ l>L_c+1 \end{aligned} \right. \end{aligned}$$
(24)

where \(N_{l-1}^{re}=C_{l-2}H_{l-1} W_{l-1}\), \(c_{(i)}= \lfloor i/(U_{l-1}V_{l-1})\rfloor +1\), and \(|\cdot |\) denotes the absolute value of a real number. According to [29], the \(i^{th}\) input channel or node is more likely to be unimportant when \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}(i)}\) is smaller. Note that we only consider the convolutional and fully-connected layers in this paper since a pooling layer has no learnable weights and the same number of output channels as its preceding convolutional layer.

Fig. 3
figure 3

Comparison on retained parameters and testing loss versus the number of epochs for the VGG-16 model pruned by our algorithm with different \(\xi \) values on CIFAR-10 and CIFAR-100. The marker point denotes the model is pruned at the corresponding epoch

Fig. 4
figure 4

Comparison on retained parameters and testing loss versus the number of epochs for the ResNet-50 model pruned by our algorithm with different \(\xi \) values on CIFAR-10 and CIFAR-100

Fig. 5
figure 5

Comparison on retained parameters and testing loss versus the number of epochs for the three-layer FNN model pruned by our algorithm with different \(\xi \) values on MNIST

Next, we use \({{\textbf {s}}}_{{\textbf {P}}_{t}^{l}}\) and \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}}\) to select unimportant input channels and nodes. Let \(\mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},\xi )\) and \(\mathbb {B}_{t}^l ({{\textbf {s}}}_{{\textbf {W}}_t^{l}},\xi )\) denote the index sets of the top \(\xi \) unimportant input channels and nodes, which are determined by \({{\textbf {s}}}_{{\textbf {P}}_{t}^{l}}\) and \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}}\) , respectively. Then, the index set of the pruned input channels or nodes in the \(l^{th}\) layer can be defined as

$$\begin{aligned} \mathbb {I}_t^l(\xi ) =\left\{ \begin{aligned} \mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},\xi ) \cap \mathbb {B}_{t}^l ({{\textbf {s}}}_{{\textbf {W}}_t^{l}},\xi )~~~~~~~ l\ge 2 \\ \mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},0.5\xi )~~~~~~~~~~~~~~~~~~~~ l=1 \end{aligned} \right. \end{aligned}$$
(25)

where \(\xi \) is the preset pruning ratio. Note that we only use \(\mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},0.5\xi )\) to prune the input channels in the first layer. This is because (24) shows that \({{\textbf {s}}}_{{\textbf {W}}_{t}^{l}}\) cannot be used to evaluate the importance of these channels. In addition, we find that the size of \(\mathbb {I}_t^l(\xi )\) is approximately one half of the size of \(\mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},\xi )\) or \(\mathbb {B}_{t}^l ({{\textbf {s}}}_{{\textbf {W}}_t^{l}},\xi )\) if \(l\ge 2\) in the following experiments presented in Section 4.2. Some readers may argue that our preset pruning ratio \(\xi \) is also set to a fixed value. However, according to (25), \(\mathbb {I}_t^l(\xi )\) is the intersection of \(\mathbb {A}_{t}^l ({{\textbf {s}}}_{{\textbf {P}}_t^{l}},\xi )\) and \(\mathbb {B}_{t}^l ({{\textbf {s}}}_{{\textbf {W}}_t^{l}},\xi )\), except for in the first layer. Hence, our actual pruning ratio is lower than \(\xi \) during each pruning operation. In fact, in our experiments, we find that our actual pruning ratio gradually decreases as the iterative pruning process continues, and the final compression ratio of the CNNs is adaptively adjusted according to the difficulty of the learning task.

The remaining shortcomings of existing algorithms are addressed with the following two approaches: 1) We use the RLS optimization to improve the convergence rate of the training and fine-tuning stages. This is also convenient for the implementation of our pruning strategy. 2) We use the testing loss to determine the repruning timing at the end of a training epoch. Each repruning operation is performed only when the testing loss is tuned down to the last unpruned level.

Table 2 Comparison on the reductions in FLOPs, parameters and testing accuracy for the VGG-16 model pruned by different pruning algorithms on CIFAR-10 and CIFAR-100
Table 3 Comparison on the reductions in channels or nodes for each layer in the VGG-16 model pruned by different pruning algorithms on CIFAR-10 and CIFAR-100

Based on the above information, our iterative structured pruning algorithm is summarized in Algorithm 1, where \(epoch_{prune}\) denotes the first pruning epoch, and loss denotes the testing loss defined by the mean squared error (MSE). Our algorithm can also be used to prune FNNs since it includes the training and pruning of the fully-connected layers. In addition, the RLS optimization can be used in existing pruning algorithms to replace the SGD optimization. To clarify how our algorithm prunes CNNs, the detailed pruning process of each layer in a five-layer CNN is illustrated in Fig. 2. It clearly shows that our algorithm can prune the input layer in theory. To the best of our knowledge, there is no existing pruning algorithm that can do this.

Table 4 Comparison on the reductions in FLOPs, parameters and testing accuracy for the ResNet-50 model pruned by different pruning algorithms on CIFAR-10 and CIFAR-100
Table 5 Comparison on the reductions in channels or nodes for each layer in the ResNet-50 model pruned by different pruning algorithms on CIFAR-10 and CIFAR-100

4 Experiments

In this section, we validate the effectiveness of our algorithm. We first introduce our experimental settings. Then, we evaluate the influence of the preset pruning ratio \(\xi \) on the performance of our algorithm. Finally, we present the comparison results of our algorithm versus four popular pruning algorithms.

4.1 Experimental settings

In our experiments, we use three benchmark datasets: CIFAR-10, CIFAR-100 [40] and MNIST [41]. CIFAR-10 and CIFAR-100 both include 60,000 \(32\times 32\) three-channel colour images, with 50,000 images used for training and 10,000 images used for testing. The CIFAR-100 classification task is more difficult than the CIFAR-10 classification task since CIFAR-100 has 100 image classes, while CIFAR-10 has only 10 image classes. MNIST consists of 10 classes of 70,000 \(28\times 28\) greyscale images, with 60,000 images used for training and 10,000 images used for testing.

Table 6 Comparison on the reductions in FLOPs, parameters and testing accuracy for the three-layer FNN model pruned by different pruning algorithms on MNIST
Table 7 Comparison on the reductions in nodes for each layer in the three-layer FNN model pruned by different pruning algorithms on MNIST

For the CIFAR-10 and CIFAR-100 datasets, we use VGG-16 and ResNet-50 models, which have the same architectures as the models presented in [42] and [43], respectively. The VGG-16 model has \(3.4\times 10^7\) parameters and requires \(3.3\times 10^8\) FLOPs, and the ResNet-50 model has \(2.3\times 10^7\) parameters and requires \(1.3\times 10^9\) FLOPs. For using (20) as the training loss function, we delete their Softmax activation function. In addition, we prune only the second layer in each residual block of the ResNet-50 model, since the first and last layers use \(1\times 1\) filters to adjust the number of channels and have a small number of parameters and FLOPs. For the MNIST dataset, we use a three-layer FNN model, which has 1024 ReLU nodes, 512 ReLU nodes and 10 Identity nodes. The FNN model has \(1.3\times 10^6\) parameters and requires \(1.3\times 10^6\) FLOPs.

We compare the performance of our algorithm with that of four popular pruning algorithms: \(\ell _1\)-norm, Network Slimming, Taylor FO and Taylor SO. The first two algorithms are one-shot pruning algorithms, and the last two are iterative pruning algorithms. These algorithms originally used SGD for training and fine-tuning and were designed to prune only convolutional layers. In some CNNs, such as VGG-16, fully-connected layers have more parameters than convolutional layers, and they are more likely to cause overfitting than convolutional layers. To ensure a fair comparison, we modify these algorithms to prune convolutional and fully-connected layers and replace the SGD optimization with the RLS optimization. In addition, for the Taylor FO and Taylor SO algorithms, we also use the testing loss as the repruning criterion.

All experiments are conducted by using PyTorch on an NVIDIA GeForce 1080Ti GPU with a minibatch size of 64. All algorithms use the RLS optimization, in which \({\textbf {W}}_0^l\) is initialized to the default value by PyTorch, and \(\lambda \), k, \(\alpha \), \(\eta ^l\), \(\varvec{\Psi }_0^l\) and \({\textbf {P}}_0^l\) are set or initialized to 1, 0.1, 0.5, 1, the zero matrix and the identity matrix, respectively. Their pruning ratios are determined and presented in Section 4.2. For the VGG-16, ResNet-50 and three-layer FNN models, all algorithms are run for 200, 300 and 200 epochs, all the first or one-shot pruning operations are performed at the end of the \(30^{th}\), \(60^{th}\) and \(30^{th}\) epoch, and all the repruning operations are performed before the \(160^{th}\), \(250^{th}\) and \(170^{th}\) epoch, respectively. This is because all iterative pruning algorithms require some epochs to finely tune the models after the last repruning operation and the ResNet-50 model converges slower than the VGG-16 and FNN models.

4.2 Influence of the preset pruning ratio

The pruning ratio is perhaps the most important hyperparameter in pruning algorithms. In this subsection, we evaluate the influence of the preset pruning ratio \(\xi \) on the performance of our algorithm and determine the pruning ratios of the other four comparative pruning algorithms.

The comparison results of the reductions in FLOPs, parameters and testing accuracy for the VGG-16, ResNet-50 and three-layer FNN models pruned by our algorithm with different \(\xi \) values on CIFAR-10, CIFAR-100 and MNIST datasets are summarized in Table 1. As \(\xi \) increases, the FLOPs\(\downarrow \) and the Parameters\(\downarrow \) of all models increase significantly, but their pruned accuracies have small or even no reductions. These results prove that our algorithm can effectively prune the unimportant input channels and nodes in CNNs and FNNs. Moreover, the pruned VGG-16 and pruned ResNet-50 models have higher FLOPs\(\downarrow \) and Parameters\(\downarrow \) on CIFAR-10 than on CIFAR-100. In other words, by using our algorithm, the models retain more parameters for CIFAR-100 than for CIFAR-10. This proves that our algorithm can adaptively prune CNNs according to the learning task difficulty.

In addition, the comparison results of retained parameters and testing loss versus the number of epochs for all models pruned by our algorithm with different \(\xi \) values are shown in Figs. 3, 4 and 5, respectively. Our algorithm can prune all models multiple times in 30 to 160 or 60 to 250 epochs. After each pruning operation, the testing loss is reduced in a few epochs. This proves that our algorithm with different \(\xi \) values retains the fast convergence speed of the RLS optimization. As \(\xi \) increases, the pruning times gradually decrease in most cases. In addition, as the iterative pruning process continues, the actual pruning ratio gradually deceases. For example, for the VGG-16 model on CIFAR-10, our algorithm performs four pruning operations when \(\xi \) is 0.4. The actual pruning ratios of these pruning operations are 0.220, 0.220, 0.219 and 0.202, respectively. These results prove that our algorithm can prevent overpruning to some extent when \(\xi \) is too large.

Based on Table 1 and Figs. 3 to 5, considering the trade-off between the accuracy loss and the compression ratio, we set \(\xi \) to be 0.4 and 0.5 in the following comparative experiments. According to the above statistical data from the VGG-16 model on CIFAR-10, our actual pruning ratio is approximately one half of \(\xi \). Thus, to ensure a fair comparison, we set the pruning ratios of Taylor FO and Taylor SO to be 0.2. However, in contrast to our algorithm, the \(\ell _1\)-norm and Network Slimming methods are one-shot pruning algorithms, meaning that they prune the CNNs and FNNs only once. Thus, we experimentally determine their pruning ratios to be 0.4 for the VGG-16 and ResNet-50 models and to be 0.6 for the three-layer FNN model.

4.3 Comparison with other pruning algorithms

4.3.1 Pruning VGG-16 on CIFAR-10 and CIFAR-100

The comparison results of the reductions in FLOPs, parameters and testing accuracy for the VGG-16 model pruned by our algorithm and other four algorithms on CIFAR-10 and CIFAR-100 are summarized in Table 2, with the best results shown in bold. Our algorithm effectively reduces the FLOPs and the number of parameters for the VGG-16 model with a small accuracy loss. Under similar pruned accuracy conditions, the FLOPs\(\downarrow \) and Parameters\(\downarrow \) of our algorithm are significantly higher than those of the \(\ell _1\)-norm and Network Slimming algorithms. Taylor FO and Taylor SO also effectively reduce the number of parameters for the VGG-16 model. However, they cannot reduce the FLOPs as effectively as they reduce the number of parameters, and the FLOPs\(\downarrow \) of Taylor SO is even smaller than those of the \(\ell _1\)-norm and Network Slimming algorithms. In addition, in Section 4.2, we discuss that our algorithm can adaptively prune CNNs according to the learning task difficulty. Comparing the Parameters\(\downarrow \) values of the four comparative algorithms in Table 2, we determine that only Taylor FO makes the VGG-16 model retain more parameters for CIFAR-100 than for CIFAR-10, and the \(\ell _1\)-norm, Network Slimming and Taylor SO algorithms do not have this property.

In addition, the comparison results of the reductions in channels or nodes for each layer in the VGG-16 model pruned by all algorithms on CIFAR-10 and CIFAR-100 are summarized in Table 3. Our algorithm, \(\ell _1\)-norm and Network Slimming prune each VGG-16 layer more uniformly than Taylor FO and Taylor SO. This is the reason why our algorithm achieves both high FLOP reduction and high parameter reduction.

4.3.2 Pruning ResNet-50 on CIFAR-10 and CIFAR-100

The comparison results of the reductions in FLOPs, parameters and testing accuracy for the ResNet-50 model pruned by our algorithm and other four algorithms on CIFAR-10 and CIFAR-100 are summarized in Table 4, with the best results shown in bold. Our algorithm with \(\xi =0.5\) effectively reduces the FLOPs and the number of parameters for the ResNet-50 model with a small accuracy loss. Under similar pruned accuracy conditions, the FLOPs\(\downarrow \) and the Parameters\(\downarrow \) of the \(\ell _1\)-norm and Network Slimming algorithms are significantly lower than those of our algorithm. Moreover, Taylor FO and Taylor SO cannot reduce the FLOPs as effectively as they reduce the number of parameters. Although Taylor SO has the best FLOPs\(\downarrow \) and Parameters\(\downarrow \) on CIFAR-100, its pruning accuracy is reduced by 4.22%. Among the considered algorithms, our algorithm with \(\xi =0.4\) has the best pruned accuracies on CIFAR-10 and CIFAR-100. In contrast, the four comparative algorithms have considerably larger accuracy losses for pruning the ResNet-50 model than for pruning the VGG-16 model. This proves that our algorithm has better adaptability for pruning different CNNs. In addition, our algorithm and Taylor FO can make the ResNet-50 model retain more parameters for CIFAR-100, but the FLOPs\(\downarrow \) with Taylor FO on CIFAR-100 is only 23.58%.

In addition, the comparison results of the reductions in channels or nodes for each layer in the ResNet-50 model pruned by all algorithms on CIFAR-10 and CIFAR-100 are summarized in Table 5. Our algorithm, \(\ell _1\)-norm and Network Slimming prune each ResNet-50 layer more uniformly than Taylor FO and Taylor SO.

4.3.3 Pruning FNNs on MNIST

The comparison results of the reductions in FLOPs, parameters and testing accuracy for the three-layer FNN model pruned by different pruning algorithms on MNIST are summarized in Table 6, with the best results shown in bold. Our algorithm achieves the best pruning accuracy, FLOPs\(\downarrow \) and Parameters\(\downarrow \). In particular, the FLOPs\(\downarrow \) and the Parameters\(\downarrow \) of our algorithm are significantly higher than those of the four comparative algorithms. These results prove that our algorithm can effectively prune FNNs with a small accuracy loss.

In addition, the comparison results of the reductions in nodes for each layer in the three-layer FNN model pruned by all algorithms on MNIST are summarized in Table 7. Our algorithm, \(\ell _1\)-norm and Network Slimming prune each FNN layer more uniformly than Taylor FO and Taylor SO. More importantly, the results show that our algorithm can not only prune the nodes in fully-connected hidden layers but also prune the original sample features in the input layer. In fact, according to (25), our algorithm can also prune the original channels of multichannel samples if the number of channels is greater than or equal to 2/\(\xi \). In Sections 4.3.1 and 4.3.2, our algorithm does not prune the input layers of the VGG-16 and ResNet-50 models because all images in CIFAR-10 and CIFAR-100 have only three channels.

5 Conclusion

In this paper, we studied structured pruning algorithms for CNNs. To address the shortcomings of existing algorithms, we proposed an RLS-based iterative structured pruning algorithm. Our algorithm employs the RLS optimization to accelerate the convergence rate of the training and fine-tuning stages, combines inverse input autocorrelation matrices with weight matrices to evaluate and prune unimportant input channels or nodes in CNN layers, and uses the testing loss to automatically determine the repruning timing. We demonstrated the effectiveness of our algorithm in pruning VGG-16 and ResNet-50 on CIFAR-10 and CIFAR-100 and pruning an FNN on MNIST. The experimental results show that our algorithm can prune CNNs and FNNs multiple times in a small number of epochs. Compared with four popular pruning algorithms, our algorithm can effectively reduce both the FLOPs and number of parameters for CNNs and FNNs with small or even no reduction in the accuracy. Furthermore, our algorithm can adaptively prune CNNs according to the learning task difficulty and has better adaptability for pruning different networks. Moreover, our algorithm can prune the original sample features.