1 Introduction

In the past few years, we have witnessed a rapid development of convolutional neural networks [1,2,3,4,5,6,7]. In order to achieve higher accuracy, the general strategy is to make deeper and more complicated networks [8,9,10,11,12]. However, these strategies to improve accuracy are not efficient with respect to model size and speed. In many mobile terminal devices such as robotics, self-driving car and augmented reality, the recognition tasks need to be carried out in a timely fashion on a computationally limited platform [13,14,15,16].

There has been rising interest in building small and efficient neural networks in the recent literature [17,18,19,20,21,22,23,24,25,26,27,28]. Many different approaches can be generally categorized as two groups: (1) training small networks directly [22,23,24,25,26,27, 29]; (2) compressing pre-trained networks [17,18,19,20,21, 28, 30].

The former aims to train a small network structure [22,23,24,25,26,27, 29], where the popular method is MobileNets including three versions. More specifically, MobileNet-V1 adopts the depthwise separable convolution to greatly reduce the amount of computation and the number of parameters, thereby improving the computation efficiency. Based on MobileNet-V1, MobilieNet-V2 introduces the inverted residual structure with a linear bottleneck. Based on MobileNet-V1 and MobileNet-V2, MobileNet-V3 is proposed recently. However, the above MobileNets do not consider the case of redundant filters. In fact, the redundancy in the filters takes up the computation in the process of forward and back propagation. Generally speaking, convolutional neural network has the high redundancy of filters [8, 31]. Therefore, it would reduce the running time of neural networks by removing the redundant filters.

The latter aims to compress the pre-trained convolutional neural network (CNN), where the popular method is pruning. Pruning includes parameter pruning and channel/filter pruning. For most CNNs, convolutional layers are the most time-consuming part, while fully connected layers involve massive network parameters. Therefore, the parameter pruning aims to reduce the storage, while the channel/filter pruning aims to reduce the computation cost. Generally, parameter pruning may suffer from the irregular memory acquisition and eliminates the possibility of improving efficiency. Therefore, special hardware or software is needed to assist with the calculation, which may increase computation time [19, 32,33,34]. To avoid the limitations of parameter pruning mentioned above, this paper focuses on studying the channel/filter pruning by removing the entire channels/filters [12, 18,19,20,21, 35, 36], whose benefits of removing the redundant channels/filters can be seen from [12, 35, 36]. Lebedev and Lempitsky [18], Wen et al. [19] employ the group sparsity to select the redundant filters, but the bad convergence speed and structured filter generating speed will heavily influence the pruning efficiency. Max response [20] uses the \(\ell _{1}\)-norm to calculate the sum of its absolute weights of a filter, and the high absolute weight sum means that the filter is important. Since max response measures the importance of filter one by one, it may ignore the correlations between different filters. To this end, channel pruning [21] aims to use \(\ell _{1}\)-norm to indirectly select the redundant filters by using the feature map of current layer and the filter of the next layer to reconstruct the feature map of the next layer, which needs to solve a lasso problem and thus has a high computation complexity in terms of optimal solution.

In this paper, we propose a two-step feature map reconstruction method to prune the redundant filters and channels. In the proposed method, both the reconstruction term and the regularization term employ the \(\ell _{2,1}\)-norm to implement the learning task of filter pruning under the robust reconstruction. To the best of our knowledge, we are the first one to propose a filter pruning method based on two-step feature map reconstruction, where robust reconstruction and filter selection are simultaneously performed. Unlike most of filter pruning methods, our method is able to select the representative filters by two-step feature map reconstruction, so that the removed filters would not influence the following layers.

The remainder of this paper is organized as follows: In Sect. 2, we present the background. In Sect. 3, we present the proposed method and its optimal solution. In Sect. 4, we give the theoretical analysis of our method. In Sect. 5, we perform the experiments to demonstrate the effectiveness and efficiency of our method. Finally, a conclusion is drawn in Sect. 6.

2 Background

To prune a feature map with \(n_{i}\) channels, \(n_{i+1}\times n_{i}\times k_{h}\times k_{w}\) convolutional filters \({\varvec{W}}\) are often applied on \(N\times n_{i}\times k_{h}\times k_{w}\) input volumes \({\varvec{X}}\) sampled from this feature map of i-th layer, which produces \(N\times n_{i+1}\) output matrix \({\varvec{Y}_{i+1}}\). Here, N is the number of samples, \(n_{i+1}\) is the number of output channels, and \(k_{h}\), \(k_{w}\) are the kernel size. For simple representation, bias term is ignored in the filter pruning methods. To prune the input channels from \(n_{i}\) to desired \(n_{i}^{'}\) (\(0\le n_{i}^{'} \le n_{i}\)), while minimizing reconstruction error, the channel pruning method [21] is proposed as follows:

$$\begin{aligned} \begin{aligned}&\underset{\varvec{\beta },\varvec{W}}{\mathop { \min }}\,{{\left\| \varvec{Y}_{i+1}-\sum _{c=1}^{n_{i}}\beta _{c}\varvec{X}_{c}\varvec{W}_{c}^{T} \right\| }_{F}^{2}}+\lambda {{\left\| \varvec{\beta }\right\| }_{1}},\\&\quad s.t., {{\left\| \varvec{\beta } \right\| }_{0}\le n_{i}^{'}},\forall i {{\left\| \varvec{W}_{c} \right\| }_{F}}=1. \end{aligned} \end{aligned}$$
(1)

\({{\left\| \cdot \right\| }_{F}}\) is Frobenius norm. \({\varvec{X}_{c}}\) is \(N\times k_{h}\times k_{w}\) matrix sliced from c-th channel of input volumes \({\varvec{X}}\), \(c=1,2,\ldots ,n_{i}\). \({\varvec{W}_{c}}\) is \(n_{i}\times k_{h}\times k_{w}\) filter weights sliced from c-th channel of \({\varvec{W}}\). \({\varvec{\beta }}\) is coefficient vector of length \(n_{i}\) for channel selection, and \(\beta _{c}\) (c-th entry of \({\varvec{\beta }}\)) is a scalar mask to c-th channel (i.e., to drop the whole channel or not).

Similar to the above channel pruning method [21], some other filter-level pruning methods [12, 20, 30, 35] also have been explored. The core of the filter pruning is to measure the importance of each filter. The major difference of filter pruning is the selection strategy: Max response [20] calculates the absolute weight sum of each filter (i.e., \({\sum {\varvec{W(i,:,:,:)}}}\), where i means the i-th filter, \(i\in \{1,2,\ldots ,n_{i+1}\}\)) as its importance score. ThiNet [12, 35] first uses a greedy strategy to search a subset of feature map such that the output by some channels is almost same with that by all the channels. More specifically, ThiNet aims to search a subset of feature map by minimizing the following reconstruction error.

$$\begin{aligned} \begin{aligned}&\underset{{S}}{\mathop { \min }}\,{\sum _{d=1}^{N}{\left( {\varvec{Y}_{i+1}^{d}}-\sum _{c\in {S}}\varvec{X}_{c}^{d}{\varvec{W}_{c}^{d}}^{T}\right) ^{2}}},\\&\quad s.t., |S|=n_{i}\times r, S\subset \left\{ 1,2,\ldots ,n_{i}\right\} \end{aligned} \end{aligned}$$
(2)

where d is the sampling number, r is the compression ratio, S is the subset of feature map-based channels, and |S| is the number of elements in a subset S.

After obtaining the subset S, the redundant channels of feature map \({\varvec{X}_{c}^{d}}\) and filter \({\varvec{W}_{c}^{d}}\) are removed. For simplicity, we call the feature map and filter without redundancy as \({\varvec{\hat{X}}_{c}^{d}}\) and \({\varvec{\hat{W}}_{c}^{d}}\). ThiNet further minimizes the reconstruction error by assigning weights \(\varvec{q}\) for \({\varvec{\hat{W}}_{c}^{d}}\).

$$\begin{aligned} \begin{aligned}&\underset{{\varvec{w}}}{\mathop { \min }}\,{\sum _{d=1}^{N}{({\varvec{Y}_{i+1}^{d}}-\sum _{c\in {S}}\varvec{\hat{X}}_{c}^{d}{\varvec{\hat{W}}_{c}^{d}}^{T}\varvec{q})^{2}}},\\&\quad s.t., |S|=n_{i}\times r, S\subset \{1,2,\ldots ,n_{i}\}. \end{aligned} \end{aligned}$$
(3)

It is worth noting that both channel pruning method and ThiNet method are driven by data to demonstrate the effectiveness of filter selection strategy, and first k and max response are non-data-driven methods. Besides, HRank [30], as a data-driven method, is proposed as follows:

$$\begin{aligned} \begin{aligned}&\underset{{\delta _{ij}}}{\mathop { \min }}\,{\sum _{i=1}^{K}\sum _{j=1}^{n_{i}}{\delta _{ij}(\varvec{w_{j}^{i}})\sum _{t=1}^{g}Rank(o_{j}^{i}(t,:,:))}}, s.t., \sum _{j=1}^{n_{i}}\delta _{ij}=n_{i2}. \end{aligned} \end{aligned}$$
(4)

where K means the number of convolutional layers, \(n_{i}\) represents the number of filters in the i-th convolutional layer, \(\delta _{ij}\) is an indicator which is 1 if the j-th filter in the i-th layer (i.e., \(\varvec{w_{j}^{i}}\)) is unimportant or 0 if \(\varvec{w_{j}^{i}}\) is important, g means the number of input images, \(o_{j}^{i}(t,:,:)\) means the feature map generated by \(\varvec{w_{j}^{i}}\), and \(n_{i2}\) means the number of least important filters in the i-th layer.

3 Building model of filter pruning

Formally, for one input image, let \(n_i\) denote the number of input channels for the i-th convolutional layer and \(h_i\), \(w_i\) be the height and width of the input feature maps. The convolutional layer transforms the input feature map \(\varvec{y}_{i}\in \mathbb {R}^{n_i\times h_i \times w_i}\) into the output feature map \(\varvec{y}_{i+1}\in \mathbb {R}^{n_{i+1}\times h_{i+1} \times w_{i+1}}\), which are used as input feature maps for the next convolutional layer. This is achieved by applying \(n_{i+1}\) 3D filters \({\varvec{F}_{i,j}}\in \mathbb {R}^{n_{i}\times k \times k}\) (All the filters, together, constitute the filter matrix \({\varvec{F}_{i+1}}\in \mathbb {R}^{n_{i+1}\times n_{i}\times k \times k}\)) on the \(n_i\) input channels, in which one filter generates one feature map channel. The number of operations of the convolutional layer is \(n_{i+1}n_ik^{2}h_{i+1}w_{i+1}\). If a filter \({\varvec{F}}_{i,j}\) is pruned, its corresponding feature map \(x_{i+1,j}\) is removed, which reduces \(n_{i}k^{2}h_{i+1}w_{i+1}\) operations. The filters that apply on the removed feature map channels from the filters of the next convolutional layer are also removed, which saves an additional \(n_{i+2}k^{2}h_{i+2}w_{i+2}\) operations.

Furthermore, if there are m input images, they will produce the feature map, such as the i-th feature map \(\varvec{y}_{i}\in \mathbb {R}^{m\times n_i\times h_i \times w_i}\), and the \(i+1\)-th feature map \(\varvec{y}_{i+1}\in \mathbb {R}^{m\times n_{i+1}\times h_{i+1} \times w_{i+1}}\). For simplicity, we sample from \(\varvec{y}_{i}\) and generate \({\varvec{Y}_{i}}\in \mathbb {R}^{N_{i}\times n_{i}}\). The detailed sampling way can refer [21]. Here, \(N_{i}\) is the number of samples of i-th layer, and \(n_{i}\) is the channel number of feature map of i-th layer. To prune the output channels from \(n_i\) to desired \(n_i^{'}\), while minimizing reconstruction error, we formulate the proposed objective function as follows:

$$\begin{aligned} \begin{aligned}&\underset{\varvec{b},\varvec{A}}{\mathop { \min }}\,{{\left\| (\varvec{Y}_{i}^{T}-\varvec{b}\varvec{1}^{T})-\varvec{A}(\varvec{Y}_{i}^{T}-\varvec{b}\varvec{1}^{T}) \right\| }_{2,1}}+\lambda {{\left\| \varvec{A} \right\| }_{2,1}}.\\ \end{aligned} \end{aligned}$$
(5)

where \({\varvec{Y}_i}\) is \(N_{i}\times n_{i}\) matrix. \({\varvec{A}}\in \mathbb {R}^{n_{i}\times n_{i}}\) is a coefficient representation matrix. The designed objective function can make \({\varvec{A}}\) be column-sparse, and thus, it can indicate the redundancy of channels of feature map and filters in current layer (see Fig. 1).

Fig. 1
figure 1

Illustration of redundant filters learned automatically by our method

Without loss of generality, we remove the layer index i, and thus, our objective function can be rewritten as follows:

$$\begin{aligned} \begin{aligned}&\underset{\varvec{b},\varvec{A}}{\mathop { \min }}\,{{\left\| (\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})-\varvec{A}(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T}) \right\| }_{2,1}}+\lambda {{\left\| \varvec{A} \right\| }_{2,1}}.\\ \end{aligned} \end{aligned}$$
(6)

Using some mathematical techniques, problem (6) can be rewritten as

$$\begin{aligned} \begin{aligned}&\underset{\varvec{b},\varvec{A}}{\mathop { \min }}\,{{\left\| \left( \left( \varvec{Y}^{T}-\varvec{b}\varvec{1}^{T}\right) -\varvec{A}\left( \varvec{Y}^{T}-\varvec{b}\varvec{1}^{T}\right) \right) \sqrt{\varvec{W}_1}\right\| }_F^{2}}+\lambda {{\left\| \varvec{A}\sqrt{\varvec{W}_2} \right\| }_F^{2}},\\ \end{aligned} \end{aligned}$$
(7)

where \({\varvec{W}_2} \in \mathbb {R}^{n\times n}\) and \({\varvec{W}_1} \in \mathbb {R}^{N\times N}\) are two diagonal matrices, whose diagonal elements are \({\varvec{W}_2^{cc}=\frac{1}{2{{\left\| {(\varvec{A})^{c}} \right\| }_{2}}}}\) and \({\varvec{W}_1^{cc}=\frac{1}{2{{\left\| ((\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})-\varvec{A}(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T}))^{c} \right\| }_{2}}}}\), respectively. \({(\varvec{A})^{c}}\) means the c-th column of matrix \({\varvec{A}}\). When \({{\left\| {(\varvec{A})^{c}} \right\| }_{2}=0}\), we let \({\varvec{W}_2^{cc}=\frac{1}{2{{\left\| {(\varvec{A})^{c}} \right\| }_{2}}+\zeta }}\). (\(\zeta \) is a very small constant.) Similarly, when \({{\left\| ((\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})-\varvec{A}(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T}))^{c} \right\| }_{2}}\) \(=0\), we let \({\varvec{W}_1^{cc}=\frac{1}{2{{\left\| ((\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})-\varvec{A}(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T}))^{c} \right\| }_{2}}+\zeta }}\). In this way, the smaller \({\varvec{W}_1^{cc}}\) is, the higher possibility to be outliers the c-th response has. The smaller \({\varvec{W}_2^{cc}}\) is, the more important the c-th filter is. Here, \({\sqrt{{\varvec{W}_{1}}}}\) gives the weights of the responses. The clean responses are weighted more heavily, while the responses that are outliers are weighted less heavily. This leads to the robustness of our method to outliers. On the other hand, the regularization term \({\varvec{A}\sqrt{{\varvec{W}_{2}}}}\) can guide the selection of filters. Through adjusting the parameter \(\lambda \), our method can select the effective filters under the robust reconstruction criterion. Moreover, it can be seen that the minimization of \({2tr((\varvec{Y}^{T}-(\varvec{A}\varvec{Y}^{T}+\varvec{b}\varvec{1}^{T}))\varvec{W}_{1}(\varvec{Y}^{T}-(\varvec{A}\varvec{Y}^{T}+\varvec{b}\varvec{1}^{T}))^{T}}\) \({+2\lambda tr(\varvec{A}{\varvec{W}_{2}}\varvec{A}^{T})}\) forces \({{\left\| ((\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})-\varvec{A}(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T}))^{c} \right\| }_{2}}\) and \({{\left\| (\varvec{A})^{c} \right\| }_{2}}\) to be very small when \({\varvec{W}_2}\) and \({\varvec{W}_1}\) are large. Finally, some columns of \({(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})-\varvec{A}(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})}\) and \({\varvec{A}}\) may be close to zero, and thus, a column-sparse \({(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})-\varvec{A}(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})}\) and \({\varvec{A}}\) can be obtained.

Our goal is to remove some redundant output channels without the loss of the performance. After we design an algorithm to judge the redundant channels and filters and then prune them, we should assure that the feature map of next layer is almost kept so that the removed channels does not influence the final classification result. Therefore, we need to reconstruct the filters in next layer with current remaining channels by linear least squares, whose objective function is shown as follows:

$$\begin{aligned} \begin{aligned}&\underset{\varvec{F}_{i+1}^{'}}{\mathop { \min }}\,{{\left\| \varvec{Y}_{i+1}-{(\varvec{Y}_{i}^{'})}{(\varvec{F}_{i+1}^{'})}^{T} \right\| }_{F}^{2}}.\\ \end{aligned} \end{aligned}$$
(8)

where \({\varvec{Y}_{i+1}}\) means the feature map of \(i+1\)-th layer, \({\varvec{Y}_{i}^{'}}\) means the feature map of i-th layer after the removal of redundant channels, and \({\varvec{F}_{i+1}^{'}}\) means the filters of \(i+1\)-th layer after the removal of redundant channels. Here, \({\varvec{F}_{i+1}^{'}}\) is \(n_{i+1}\times n_{i}kk\) reshaped \({\varvec{F}_{i+1}}\). It is worth noting that if r channels are redundant in \({\varvec{Y}_{i}}\), \({\varvec{Y}_{i}^{'}\in \mathbb {R}^{N\times (n_{i}-r)}}\), \({\varvec{F}_{i+1}^{'}\in \mathbb {R}^{n_{i+1}\times (n_{i}-r)}}\).

To sum up, the flowchart is given in Fig. 2, which mainly includes two steps: One is to judge the redundant filters by reconstructing the feature map of the current layer, and the second step is to learn the new filters by reconstructing the feature map of the next layer. Our method is proceeded layer by layer. For one layer such as \(i+1\)-th layer, the original computation cost is \(n_{i+1}n_{i}k^{2}h_{i+1}w_{i+1}\) flops, while the remained computation cost is \((n_{i+1}-r_f)(n_{i}-r_{c})k^{2}h_{i+1}w_{i+1}\) flops.

Fig. 2
figure 2

Flowchart of the proposed neural network compression method

Discussion: Some recent works [20, 21] also introduce the sparse norm, such as \(\ell _{1}\)-norm [20] or Lasso [21]. However, we must emphasize that we use different formulations and different ideas. Lasso [21] uses the current filters and the previous feature map to reconstruct the feature map of current layer and add the sparse constraint on each channel, but the computation complexity of their model is very high. Moreover, both of them [20, 21] need to give the value of sparsity \(n_{i}^{'}\). Different from Lasso, we perform robust reconstruction for the feature map of current layer. If the feature map has the redundancy, our model can automatically conclude the redundant filters of its previous layer. Furthermore, we need to assure that the remaining filters can recover the feature map of next layer. Besides, they [20, 21] use \(\ell _{1}\)-norm to select the redundant channel, while we use \(\ell _{2,1}\)-norm to select the redundant channel from the perspective of feature map of current layer.

3.1 The optimal solution of problem (6)

The global optimal solution of problem (7) can be easily obtained by using an iterative re-weighting method, which includes the following two steps.

Step 1: Given \({\varvec{A}}\), we compute \(\varvec{b}\). The optimization problem (6) becomes,

$$\begin{aligned} \begin{aligned}&\underset{\varvec{b}}{\mathop { \min }}\,{{\left\| (\varvec{Y}-\varvec{b}\varvec{1}^{T})-\varvec{A}(\varvec{Y}-\varvec{b}\varvec{1}^{T}) \right\| }_{2,1}}.\\ \end{aligned} \end{aligned}$$
(9)

Setting the derivative of (9) with respect to \({\varvec{b}}\) to be zero, we get \({\varvec{v}=\frac{(\varvec{X}{\varvec{W}_1}-\varvec{A}\varvec{X}{\varvec{W}_1})\varvec{1}}{\varvec{1}^{T}\varvec{W}_1 \varvec{1}}}\).

Step 2: Given \(\varvec{b}\), we compute \({\varvec{A}}\). The optimization problem (7) becomes,

$$\begin{aligned} \begin{aligned}&\underset{\varvec{A}}{\mathop { \min }}\,{{\left\| \left( \left( \varvec{Y}-\varvec{b}\varvec{1}^{T}\right) -\varvec{A}(\varvec{Y}-\varvec{b}\varvec{1}^{T})\right) \sqrt{\varvec{W}_1}\right\| }_F^{2}}+\lambda {{\left\| \varvec{A}\sqrt{\varvec{W}_2} \right\| }_F^{2}}.\\ \end{aligned} \end{aligned}$$
(10)

Setting the derivative of (10) with \({\varvec{A}}\) to be zero, we get \(\varvec{A}=(\varvec{Y}-\varvec{b}\varvec{1}^{T})\varvec{W}_1(\varvec{Y}-\varvec{b}\varvec{1}^{T})^{T}(\lambda \varvec{W}_2+(\varvec{Y}-\varvec{b}\varvec{1}^{T})\varvec{W}_1(\varvec{Y}{-\varvec{b}\varvec{1}^{T})^{T})^{-1}}\).

Iterating the above two steps will reach the global optimal solution. Algorithm 1 gives more details.

Algorithm 1. Optimization Algorithm of Problem (6)

Input: Feature map \(\varvec{Y}\), parameter \(\lambda \);

   1: Initialize \({\varvec{W}_{1}}=\varvec{I}\), \({\varvec{W}_{2}}=\varvec{I}\) and \(\varvec{b}=\varvec{0}\);

   2: while not converge do

   2.1: Compute \(\varvec{A}=(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})\varvec{W}_1(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})^{T}\)

            \((\lambda \varvec{W}_2+(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})\varvec{W}_1(\varvec{Y}^{T}-\varvec{b}\varvec{1}^{T})^{T})^{-1}\);

   2.2: Compute \(\varvec{b}=\frac{\varvec{Y}^{T}{\varvec{W}_1}\varvec{1}}{\varvec{1}^{T}\varvec{W}_1 \varvec{1}}\);

   2.3: Compute \({\varvec{W}_{1}}\)

   2.4: Compute \(\varvec{W}_2\)

      end while

Output: Representation matrix \(\varvec{A}\), optimal mean vector \(\varvec{b}\).

4 Theoretical analysis

4.1 Convergence analysis

Before giving the convergence proof of the optimization algorithm, we need to first give Lemma 1 [37].

Lemma 1

For any nonzero vectors \({\varvec{U}},\varvec{q}\in {\mathbb {R}^{d}}\),

$$\begin{aligned} \begin{aligned} {{\left\| \varvec{U} \right\| }_{2}}-\frac{\left\| \varvec{U} \right\| _{2}^{2}}{2{{\left\| \varvec{q} \right\| }_{2}}}\le {{\left\| \varvec{q} \right\| }_{2}}-\frac{\left\| \varvec{q} \right\| _{2}^{2}}{2{{\left\| \varvec{q} \right\| }_{2}}}. \end{aligned} \end{aligned}$$
(11)

Based on Lemma 1, we prove Theorem 1.

Theorem 1

Algorithm 1 will monotonically decrease the value of the objective function of the optimization problem (7) in each iteration and converge to a local optimal solution.

Proof

For simplicity, we denote the updated \(\varvec{b}\) and \({\varvec{A}}\) by \(\widetilde{\varvec{b}}\) and \({\widetilde{\varvec{A}}}\). Since the updated \(\widetilde{\varvec{b}}\) and \({\widetilde{\varvec{A}}}\) are the optimal solution of problem (5), according to the definition of \({\varvec{W}_1}\) and \({\varvec{W}_2}\), we have

$$\begin{aligned} \begin{aligned}&tr\left( \sum \limits _{i=1}^{N}{\frac{\left\| \varvec{x}_i-\widetilde{\varvec{A}}\varvec{x}_i-\left( \varvec{I}-\widetilde{\varvec{A}}\right) \widetilde{\varvec{b}} \right\| _{2}^{2}}{2\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{{}}}}\right) +\lambda tr\left( \sum \limits _{i=1}^{n}{\frac{\left\| \widetilde{\varvec{a}_i} \right\| _{2}^{2}}{2\left\| \varvec{a}_{i} \right\| _{2}^{{}}}}\right) \\&\le \quad tr\left( \sum \limits _{i=1}^{N}{\frac{\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{2}}{2\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{{}}}}\right) +\lambda tr\left( \sum \limits _{i=1}^{n}{\frac{\left\| \varvec{a}_{i} \right\| _{2}^{2}}{2\left\| \varvec{a}_{i} \right\| _{2}^{{}}}}\right) . \\ \end{aligned} \end{aligned}$$
(12)

On the one hand, according to Lemma 1, we have

$$\begin{aligned} \begin{aligned}&\left\| \varvec{x}_i-\widetilde{\varvec{A}}\varvec{x}_i-\left( \varvec{I}-\widetilde{\varvec{A}}\right) \widetilde{\varvec{b}} \right\| _{2}^{{}}-\frac{\left\| \varvec{x}_i-\widetilde{\varvec{A}}\varvec{x}_i-\left( \varvec{I}-\widetilde{\varvec{A}}\right) \widetilde{\varvec{b}} \right\| _{2}^{2}}{2\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{{}}} \\&\le \quad \left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{{}}-\frac{\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{2}}{2\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{{}}}. \\ \end{aligned} \end{aligned}$$
(13)

Using matrix calculus for problem (13), we have the following formulation:

$$\begin{aligned} \begin{aligned}&\sum \limits _{i=1}^{N}{\left\| \varvec{x}_i-\widetilde{\varvec{A}}\varvec{x}_i-\left( \varvec{I}-\widetilde{\varvec{A}}\right) \widetilde{\varvec{b}} \right\| _{2}^{{}}}-\sum \limits _{i=1}^{N}{\frac{\left\| \varvec{x}_i-\widetilde{\varvec{A}}\varvec{x}_i-\left( \varvec{I}-\widetilde{\varvec{A}}\right) \widetilde{\varvec{b}} \right\| _{2}^{2}}{2\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{{}}}} \\&\le \quad \sum \limits _{i=1}^{N}{\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{{}}}-\sum \limits _{i=1}^{N}{\frac{\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{2}}{2\left\| \varvec{x}_i-\varvec{A}\varvec{x}_i-\left( \varvec{I}-\varvec{A}\right) {\varvec{b}} \right\| _{2}^{{}}}}. \\ \end{aligned} \end{aligned}$$
(14)

On the other hand, according to Lemma 1, we have

$$\begin{aligned} \begin{aligned} \left\| \widetilde{\varvec{a}_i} \right\| _{2}-\frac{\left\| \widetilde{\varvec{a}_i} \right\| _{2}^{2}}{\left\| \varvec{a}_{i} \right\| _{2}^{{}}}\le \left\| \varvec{a}_{i} \right\| _{2}^{{}}-\frac{\left\| \varvec{a}_{i} \right\| _{2}^{2}}{\left\| \varvec{a}_{i} \right\| _{2}^{{}}}. \end{aligned} \end{aligned}$$
(15)

Similarly, using matrix calculus for problem (15), we have the following formulation:

$$\begin{aligned} \begin{aligned}&\sum \limits _{i=1}^{n}\left( {\left\| \widetilde{\varvec{a}_i} \right\| _{2}-\frac{\left\| \widetilde{\varvec{a}_i} \right\| _{2}^{2}}{2\left\| \varvec{a}_i \right\| _{2}^{{}}}}\right) \\&\quad \le \sum \limits _{i=1}^{n}\left( {\left\| \varvec{a}_i \right\| _{2}}-\frac{\left\| \varvec{a}_i \right\| _{2}^{2}}{2\left\| \varvec{a}_i \right\| _{2}}\right) . \end{aligned} \end{aligned}$$
(16)

By combining problem (12) and problem (14) with problem (16), we have

$$\begin{aligned} \begin{aligned}&\left\| \varvec{x}_i-\widetilde{\varvec{A}}\varvec{x}_i-\left( \varvec{I}-\widetilde{\varvec{A}}\right) \widetilde{\varvec{b}} \right\| _{2,1}^{{}}+\lambda \left\| \widetilde{\varvec{A}} \right\| _{2,1}^{{}} \\&\quad \le \left\| \varvec{x}_i-{\varvec{A}}\varvec{x}_i-\left( \varvec{I}-{\varvec{A}}\right) \widetilde{\varvec{b}} \right\| _{2,1}^{{}}+\lambda \left\| \varvec{A} \right\| _{2,1}^{{}}. \\ \end{aligned} \end{aligned}$$
(17)

Since problem (5) has an obvious lower bound 0, the optimization problem (5) converges to the global optimal solution.

4.2 Computational complexity analysis

The main computational complexity of Problem (6) has two steps in each iteration: The first step is to compute \(\varvec{b}\), whose computational complexity is \(O(n^3)\); The second step is to compute \({\varvec{A}}\), whose computational complexity is also \(O(n^3)\) at most. Therefore, the computational complexity of one iteration will be up to \(O(n^3)\). If Algorithm 1 needs t iterations, the total computational complexity is on the order of \(O(t n^3)\).

5 Experiments

We prune the filters of three types of networks, i.e., VGG-16 [6], ResNet-50 [38] and MobileNet [22], which is implemented on ImageNet [39], CIFAR-10 [40] and CIFAR-100 [40]. ImageNet comprises 1.28 million training images and 50000 validation images from 1000 classes. We fine-tune networks on the training set and report the accuracy on the validation set with the shorter side of images resized to 256. For data augmentation, we follow the standard practice [21] and perform the random size cropping to 224\(\times \)224 and random horizontal flipping, and more experimental details can refer [21]. CIFAR-10 consists of 10 classes images, and each class consists of 6000 images, where 50000 images are for training and 10000 for validation. Similarly, CIFAR-100 consists of 100 classes images and each class consists of 600 images, where 50000 images are for training and 10000 for validation. On CIFAR-10 and CIFAR-100 datasets, we fine-tune networks with the size of training images which are resized to 32x32, and with the per-pixel mean subtracted on the training and validation set. For data augmentation, we adopt random horizontal flipping.

Our method is compared to the classical first-k and max response [20], the state-of-the-art channel pruning [21], Thin Net [35] and HRank [30] that are similar to our method to some extent.

Implementation: Our method is performed on the network layer by layer. In our method, there is a parameter \(\lambda \), and thus, our method involves in the process of parameter selection. More specifically, when our method performs parameter selection on a layer, the other layers are fixed as baseline. One common way of parameter determination is the grid searching method. We vary this parameter within a certain range \(\{10^{-6},\) \(10^{-5},\ldots ,10^{6}\}\), and the value of parameter with the highest classification result is considered as the most optimal parameter. After the parameters of our method on all the layers of a network are determined, the compressed network is obtained. Theoretically, if the redundant filters of i-th layer are removed, the updated filter of the \(i+1\)-th layer almost can recover the feature map of the \(i+1\)-th layer, and thus, the final classification performance can be preserved. For a fair comparison, all the methods adopt the same speedup ratio. For example, all the methods use the 2 times of speedup ratio (i.e., \(2\times \)) on the ResNet-50. More specifically, based on the 2 times of speedup ratio on the ResNet-50, the effective filters number is first acquired by channel pruning method [21], and then, the same effective filters number acquired by channel pruning method [21] is adopted by all the other methods including ours.

5.1 VGG-16 pruning

5.1.1 Experimental results of single layer

We implement three methods to compress VGG-16 network, and the experimental results are shown in Fig. 3. It can be seen from Conv2_1 layer that, with the increase in speedup ratio, the classification accuracy of three methods drops dramatically. However, our method outperforms the other methods when the speedup ratio is from \(2\times \) to \(4\times \), where \(2\times \) means that the running time of compressed network is 0.5 times of that of baseline network. At this moment, the classification accuracy drops about from 0.1 to \(0.84\%\) compared to the classification accuracy of baseline. More specifically, compared to baseline, when the classification accuracy drops \(0.84\%\), our method has a \(4\times \) speedup ratio (i.e., our flops are \(25\%\) of baseline).

Fig. 3
figure 3

Result of single layer under different speedup ratios (without fine-tuning)

Fig. 4
figure 4

Result of whole model under different speedup ratios on VGG-16(without fine-tuning)

5.1.2 Experimental results across all layers

Guided by the experimental results of single layer, we observe that there is a big redundancy on the first several layers of VGG-16, while its last layers are not very redundant. To this end, we prune more filters on shallow layers while remaining the origin filters on conv5_x layers, and the detailed filter pruning case can refer channel pruning [21]. The experimental results without fine-tuning are shown in Fig. 4, which shows that three methods obtain the similar results when the speedup ratio is small. With the increase in speedup ratio, the advantage of our method is highlighted. The experimental results with fine-tuning are shown in Tables 1, 2 and 3. It can be seen that, with the same speedup ratio (thus the flops (i.e., flops) and parameters (i.e., #Param) of all the methods are same), our method outperforms the other methods with respect to top-1 classification accuracy.

Table 1 Fine-tuning results (accuracy drops) of VGG-16 on ImageNet with \(4\times \) speedup ratio
Table 2 Fine-tuning results (accuracy drops) of VGG-16 on CIFAR-10 with \(4\times \) speedup ratio
Table 3 Fine-tuning results (accuracy drops) of VGG-16 on CIFAR-100 with \(4\times \) speedup ratio

5.2 ResNet pruning

We also apply the pruning methods on ResNet that is a multi-path network. The structure is more complex than VGG-16. Through the experiments of single layers, we observe that there is a big redundancy on the shallow layers. To this end, we prune the branch2a and branch2b layers but remain the branch2c layer in this network. The detailed filter pruning case on ResNet-50 can refer channel pruning [21]. We compared the four pruning methods (i.e., first k, max response [20], channel pruning [21], ThiNet [35] and HRank [30]) with \(2\times \) speedup ratio, and the experimental results are shown in Tables 4, 5 and 6, where the negative value means the accuracy is higher than baseline. It can be seen that with the same speedup ratio, our method is comparable with channel pruning, ThiNet and HRank, but outperforms the classical first k and max response methods. For example, in Table 4, our method obtains the better top-1 classification accuracy than first k, max response, channel pruning and HRank, but worse than ThiNet. However, we obtain the better top-5 classification accuracy than ThiNet. Therefore, we say our method is comparable with ThiNet. In Table 6, our method obtains the worse top-1 classification accuracy than channel pruning and HRank, while we obtain the better top-5 classification accuracy than HRank. Therefore, we say our method is comparable with HRank.

Table 4 Fine-tuning results (accuracy drops) of ResNet-50 on ImageNet with \(2\times \) speedup ratio
Table 5 Fine-tuning results (accuracy drops) of ResNet-50 on CIFAR-10 with \(2\times \) speedup ratio
Table 6 Fine-tuning results (accuracy drops) of ResNet-50 on CIFAR-100 with \(2\times \) speedup ratio

5.3 MobileNet pruning

As a lightweight network, MobileNet does not have a high degree of redundancy. The detailed filter pruning case on MobileNet can refer the strategy used in channel pruning [21]. We compared the four pruning methods (i.e., first k, max response [20], channel pruning [21], ThiNet [35] and HRank [30]) with \(1.5\times \) speedup ratio. The experimental results in Table 7 show that our method outperforms the other methods with the same speedup ratios.

Table 7 Fine-tuning results (accuracy drops) of MobileNet on ImageNet with \(1.5\times \) speedup ratio

6 Conclusion

In this paper, we propose a two-step feature map reconstruction method to prune the redundant filters and channels, which is used to compress the CNN networks, such as VGG-16, ResNet-50 and MobileNet. The experimental results on different networks with different datasets show the effectiveness of our method.