Filter pruning-based two-step feature map reconstruction

In deep neural network compression, channel/filter pruning is widely used for compressing the pre-trained network by judging the redundant channels/filters. In this paper, we propose a two-step filter pruning method to judge the redundant channels/filters layer by layer. The first step is to design a filter selection scheme based on ℓ2,1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{2,1}$$\end{document}-norm by reconstructing the feature map of current layer. More specifically, the filter selection scheme aims to solve a joint ℓ2,1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{2,1}$$\end{document}-norm minimization problem, i.e., both the regularization term and feature map reconstruction error term are constrained by ℓ2,1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{2,1}$$\end{document}-norm. The ℓ2,1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{2,1}$$\end{document}-norm regularization plays a role in the channel/filter selection, while the ℓ2,1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{2,1}$$\end{document}-norm feature map reconstruction error term plays a role in the robust reconstruction. In this way, the proposed filter selection scheme can learn a column-sparse coefficient representation matrix that can indicate the redundancy of filters. Since pruning the redundant filters in current layer might dramatically influence the output feature map of the following layer, the second step needs to update the filters of the following layer to assure output of feature map approximates to that of baseline. Experimental results demonstrate the effectiveness of this proposed method. For example, our pruned VGG-16 on ImageNet achieves 4×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4\times $$\end{document} speedup with 0.95% top-5 accuracy drop. Our pruned ResNet-50 on ImageNet achieves 2×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\times $$\end{document} speedup with 1.56% top-5 accuracy drop. Our pruned MobileNet on ImageNet achieves 2×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\times $$\end{document} speedup with 1.20% top-5 accuracy drop.


Introduction
In the past few years, we have witnessed a rapid development of convolutional neural networks [1][2][3][4][5][6][7]. In order to achieve  1 higher accuracy, the general strategy is to make deeper and more complicated networks [8][9][10][11][12]. However, these strategies to improve accuracy are not efficient with respect to model size and speed. In many mobile terminal devices such as robotics, self-driving car and augmented reality, the recognition tasks need to be carried out in a timely fashion on a computationally limited platform [13][14][15][16].
The former aims to train a small network structure [22][23][24][25][26][27]29], where the popular method is MobileNets including three versions. More specifically, MobileNet-V1 adopts the depthwise separable convolution to greatly reduce the amount of computation and the number of parameters, thereby improving the computation efficiency. Based on MobileNet-V1, MobilieNet-V2 introduces the inverted residual structure with a linear bottleneck. Based on MobileNet-V1 and MobileNet-V2, MobileNet-V3 is proposed recently. However, the above MobileNets do not consider the case of redundant filters. In fact, the redundancy in the filters takes up the computation in the process of forward and back propagation. Generally speaking, convolutional neural network has the high redundancy of filters [8,31]. Therefore, it would reduce the running time of neural networks by removing the redundant filters.
The latter aims to compress the pre-trained convolutional neural network (CNN), where the popular method is pruning. Pruning includes parameter pruning and channel/filter pruning. For most CNNs, convolutional layers are the most time-consuming part, while fully connected layers involve massive network parameters. Therefore, the parameter pruning aims to reduce the storage, while the channel/filter pruning aims to reduce the computation cost. Generally, parameter pruning may suffer from the irregular memory acquisition and eliminates the possibility of improving efficiency. Therefore, special hardware or software is needed to assist with the calculation, which may increase computation time [19,[32][33][34]. To avoid the limitations of parameter pruning mentioned above, this paper focuses on studying the channel/filter pruning by removing the entire channels/filters [12,[18][19][20][21]35,36], whose benefits of removing the redundant channels/filters can be seen from [12,35,36]. Lebedev and Lempitsky [18], Wen et al. [19] employ the group sparsity to select the redundant filters, but the bad convergence speed and structured filter generating speed will heavily influence the pruning efficiency. Max response [20] uses the 1 -norm to calculate the sum of its absolute weights of a filter, and the high absolute weight sum means that the filter is important. Since max response measures the importance of filter one by one, it may ignore the correlations between different filters. To this end, channel pruning [21] aims to use 1 -norm to indirectly select the redundant filters by using the feature map of current layer and the filter of the next layer to reconstruct the feature map of the next layer, which needs to solve a lasso problem and thus has a high computation complexity in terms of optimal solution.
In this paper, we propose a two-step feature map reconstruction method to prune the redundant filters and channels. In the proposed method, both the reconstruction term and the regularization term employ the 2,1 -norm to implement the learning task of filter pruning under the robust reconstruction. To the best of our knowledge, we are the first one to propose a filter pruning method based on two-step feature map reconstruction, where robust reconstruction and filter selection are simultaneously performed. Unlike most of filter pruning methods, our method is able to select the representative filters by two-step feature map reconstruction, so that the removed filters would not influence the following layers.
The remainder of this paper is organized as follows: In Sect. 2, we present the background. In Sect. 3, we present the proposed method and its optimal solution. In Sect. 4, we give the theoretical analysis of our method. In Sect. 5, we perform the experiments to demonstrate the effectiveness and efficiency of our method. Finally, a conclusion is drawn in Sect. 6.

Background
To prune a feature map with n i channels, n i+1 × n i × k h × k w convolutional filters W are often applied on N ×n i ×k h ×k w input volumes X sampled from this feature map of i-th layer, which produces N × n i+1 output matrix Y i+1 . Here, N is the number of samples, n i+1 is the number of output channels, and k h , k w are the kernel size. For simple representation, bias term is ignored in the filter pruning methods. To prune the input channels from n i to desired n i (0 ≤ n i ≤ n i ), while minimizing reconstruction error, the channel pruning method [21] is proposed as follows: β is coefficient vector of length n i for channel selection, and β c (c-th entry of β) is a scalar mask to c-th channel (i.e., to drop the whole channel or not). Similar to the above channel pruning method [21], some other filter-level pruning methods [12,20,30,35] also have been explored. The core of the filter pruning is to measure the importance of each filter. The major difference of filter pruning is the selection strategy: Max response [20] calculates the absolute weight sum of each filter (i.e., W (i, :, :, :), where i means the i-th filter, i ∈ {1, 2, . . . , n i+1 }) as its importance score. ThiNet [12,35] first uses a greedy strategy to search a subset of feature map such that the output by some channels is almost same with that by all the channels. More specifically, ThiNet aims to search a subset of feature map by minimizing the following reconstruction error.
where d is the sampling number, r is the compression ratio, S is the subset of feature map-based channels, and |S| is the number of elements in a subset S. After obtaining the subset S, the redundant channels of feature map X d c and filter W d c are removed. For simplicity, we call the feature map and filter without redundancy asX It is worth noting that both channel pruning method and ThiNet method are driven by data to demonstrate the effectiveness of filter selection strategy, and first k and max response are non-data-driven methods. Besides, HRank [30], as a data-driven method, is proposed as follows: where K means the number of convolutional layers, n i represents the number of filters in the i-th convolutional layer, :, :) means the feature map generated by w i j , and n i2 means the number of least important filters in the i-th layer.

Building model of filter pruning
Formally, for one input image, let n i denote the number of input channels for the i-th convolutional layer and h i , w i be the height and width of the input feature maps. The convolutional layer transforms the input feature map y i ∈ R n i ×h i ×w i into the output feature map y i+1 ∈ R n i+1 ×h i+1 ×w i+1 , which are used as input feature maps for the next convolutional layer. This is achieved by applying n i+1 3D filters F i, j ∈ R n i ×k×k (All the filters, together, constitute the filter matrix F i+1 ∈ R n i+1 ×n i ×k×k ) on the n i input channels, in which one filter generates one feature map channel. The number of operations of the convolutional layer is n i+1 n i k 2 h i+1 w i+1 . If a filter F i, j is pruned, its corresponding feature map x i+1, j is removed, which reduces n i k 2 h i+1 w i+1 operations. The filters that apply on the removed feature map channels from the filters of the next convolutional layer are also removed, which saves an additional n i+2 k 2 h i+2 w i+2 operations.
Furthermore, if there are m input images, they will produce the feature map, such as the i-th feature map y i ∈ R m×n i ×h i ×w i , and the i + 1-th feature map y i+1 ∈ R m×n i+1 ×h i+1 ×w i+1 . For simplicity, we sample from y i and generate Y i ∈ R N i ×n i . The detailed sampling way can refer [21]. Here, N i is the number of samples of i-th layer, and n i is the channel number of feature map of i-th layer. To prune the output channels from n i to desired n i , while minimizing reconstruction error, we formulate the proposed objective function as follows: The designed objective function can make A be column-sparse, and thus, it can indicate the redundancy of channels of feature map and filters in current layer (see Fig. 1). Without loss of generality, we remove the layer index i, and thus, our objective function can be rewritten as follows: Using some mathematical techniques, problem (6) can be rewritten as where W 2 ∈ R n×n and W 1 ∈ R N ×N are two diagonal matrices, whose diagonal elements are W cc . In this way, the smaller W cc 1 is, the higher possibility to be outliers the cth response has. The smaller W cc 2 is, the more important the c-th filter is. Here, √ W 1 gives the weights of the responses. The clean responses are weighted more heavily, while the responses that are outliers are weighted less heavily. This leads to the robustness of our method to outliers. On the other hand, the regularization term A √ W 2 can guide the selection of filters. Through adjusting the parameter λ, our method can select the effective filters under the robust reconstruction criterion. Moreover, it can be seen that the minimization of to be very small when W 2 and W 1 are large. Finally, some columns of (Y T − b1 T ) − A(Y T − b1 T ) and A may be close to zero, and thus, a column-sparse Our goal is to remove some redundant output channels without the loss of the performance. After we design an algorithm to judge the redundant channels and filters and then prune them, we should assure that the feature map of next layer is almost kept so that the removed channels does not influence the final classification result. Therefore, we need to reconstruct the filters in next layer with current remaining channels by linear least squares, whose objective function is shown as follows: where Y i+1 means the feature map of i +1-th layer, Y i means the feature map of i-th layer after the removal of redundant channels, and F i+1 means the filters of i + 1-th layer after the removal of redundant channels. Here, F i+1 is n i+1 × n i kk reshaped F i+1 . It is worth noting that if r channels are To sum up, the flowchart is given in Fig. 2, which mainly includes two steps: One is to judge the redundant filters by reconstructing the feature map of the current layer, and the second step is to learn the new filters by reconstructing the feature map of the next layer. Our method is proceeded layer by layer. For one layer such as i +1-th layer, the original computation cost is n i+1 n i k 2 h i+1 w i+1 flops, while the remained computation cost is (n i+1 − r f )(n i − r c )k 2 h i+1 w i+1 flops.
Discussion: Some recent works [20,21] also introduce the sparse norm, such as 1 -norm [20] or Lasso [21]. However, we must emphasize that we use different formulations and different ideas. Lasso [21] uses the current filters and the previous feature map to reconstruct the feature map of current layer and add the sparse constraint on each channel, but the computation complexity of their model is very high. Moreover, both of them [20,21] need to give the value of sparsity n i . Different from Lasso, we perform robust reconstruction for the feature map of current layer. If the feature map has the redundancy, our model can automatically conclude the redundant filters of its previous layer. Furthermore, we need to assure that the remaining filters can recover the feature map of next layer. Besides, they [20,21] use 1 -norm to select the redundant channel, while we use 2,1 -norm to select the

The optimal solution of problem (6)
The global optimal solution of problem (7) can be easily obtained by using an iterative re-weighting method, which includes the following two steps.
Step 1: Given A, we compute b. The optimization problem (6) becomes, Setting the derivative of (9) with respect to b to be zero, we Step 2: Given b, we compute A. The optimization problem (7) becomes, Setting the derivative of (10) with A to be zero, we get Iterating the above two steps will reach the global optimal solution. Algorithm 1 gives more details.

Convergence analysis
Before giving the convergence proof of the optimization algorithm, we need to first give Lemma 1 [37].

Lemma 1 For any nonzero vectors U, q ∈ R d ,
Based on Lemma 1, we prove Theorem 1.

Theorem 1 Algorithm 1 will monotonically decrease the value of the objective function of the optimization problem (7) in each iteration and converge to a local optimal solution.
Proof For simplicity, we denote the updated b and A by b and A. Since the updated b and A are the optimal solution of problem (5), according to the definition of W 1 and W 2 , we have On the one hand, according to Lemma 1, we have Using matrix calculus for problem (13), we have the following formulation: On the other hand, according to Lemma 1, we have Similarly, using matrix calculus for problem (15), we have the following formulation: By combining problem (12) and problem (14) with problem (16), we have Since problem (5) has an obvious lower bound 0, the optimization problem (5) converges to the global optimal solution.

Computational complexity analysis
The main computational complexity of Problem (6) has two steps in each iteration: The first step is to compute b, whose computational complexity is O(n 3 ); The second step is to compute A, whose computational complexity is also O(n 3 ) at most. Therefore, the computational complexity of one iteration will be up to O(n 3 ). If Algorithm 1 needs t iterations, the total computational complexity is on the order of O(tn 3 ).

Experiments
We prune the filters of three types of networks, i.e., VGG-16 [6], ResNet-50 [38] and MobileNet [22], which is implemented on ImageNet [39], CIFAR-10 [40] and CIFAR-100 [40]. ImageNet comprises 1.28 million training images and 50000 validation images from 1000 classes. We fine-tune networks on the training set and report the accuracy on the validation set with the shorter side of images resized to 256. For data augmentation, we follow the standard practice [21] and perform the random size cropping to 224×224 and random horizontal flipping, and more experimental details can refer [21]. CIFAR-10 consists of 10 classes images, and each class consists of 6000 images, where 50000 images are for training and 10000 for validation. Similarly, CIFAR-100 consists of 100 classes images and each class consists of 600 images, where 50000 images are for training and 10000 for validation. On CIFAR-10 and CIFAR-100 datasets, we finetune networks with the size of training images which are resized to 32x32, and with the per-pixel mean subtracted on the training and validation set. For data augmentation, we adopt random horizontal flipping.
Our method is compared to the classical first-k and max response [20], the state-of-the-art channel pruning [21], Thin Net [35] and HRank [30] that are similar to our method to some extent. Implementation: Our method is performed on the network layer by layer. In our method, there is a parameter λ, and thus, our method involves in the process of parameter selection. More specifically, when our method performs parameter selection on a layer, the other layers are fixed as baseline. One common way of parameter determination is the grid searching method. We vary this parameter within a certain range {10 −6 , 10 −5 , . . . , 10 6 }, and the value of parameter with the highest classification result is considered as the most optimal parameter. After the parameters of our method on all the layers of a network are determined, the compressed network is obtained. Theoretically, if the redundant filters of i-th layer are removed, the updated filter of the i + 1-th layer almost can recover the feature map of the i +1-th layer, and thus, the final classification performance can be preserved. For a fair comparison, all the methods adopt the same speedup ratio. For example, all the methods use the 2 times of speedup ratio (i.e., 2×) on the ResNet-50. More specifically, based on the 2 times of speedup ratio on the ResNet-50, the effective filters number is first acquired by channel pruning method [21], and then, the same effective filters number acquired by channel pruning method [21] is adopted by all the other methods including ours.

Experimental results of single layer
We implement three methods to compress VGG-16 network, and the experimental results are shown in Fig. 3. It can be seen from Conv2_1 layer that, with the increase in speedup ratio, the classification accuracy of three methods drops dramatically. However, our method outperforms the other methods when the speedup ratio is from 2× to 4×, where 2× means that the running time of compressed network is 0.5 times of that of baseline network. At this moment, the classification accuracy drops about from 0.1 to 0.84% compared to the classification accuracy of baseline. More specifically, compared to baseline, when the classification accuracy drops 0.84%, our method has a 4× speedup ratio (i.e., our flops are 25% of baseline).

Experimental results across all layers
Guided by the experimental results of single layer, we observe that there is a big redundancy on the first sev-  The bold fonts indicate the best result eral layers of VGG-16, while its last layers are not very redundant. To this end, we prune more filters on shallow layers while remaining the origin filters on conv5_x layers, and the detailed filter pruning case can refer channel pruning [21]. The experimental results without fine-tuning are shown in Fig. 4, which shows that three methods obtain the similar results when the speedup ratio is small. With the increase in speedup ratio, the advantage of our method is highlighted. The experimental results with fine-tuning are shown in Tables 1, 2 and 3. It can be seen that, with the same speedup ratio (thus the flops (i.e., flops) and parameters (i.e., #Param) of all the methods are same), our method outperforms the other methods with respect to top-1 classification accuracy. The bold fonts indicate the best result

ResNet pruning
We also apply the pruning methods on ResNet that is a multipath network. The structure is more complex than VGG-16.
Through the experiments of single layers, we observe that there is a big redundancy on the shallow layers. To this end, we prune the branch2a and branch2b layers but remain the branch2c layer in this network. The detailed filter pruning case on ResNet-50 can refer channel pruning [21]. We compared the four pruning methods (i.e., first k, max response [20], channel pruning [21], ThiNet [35] and HRank [30]) with 2× speedup ratio, and the experimental results are shown in Tables 4, 5 and 6, where the negative value means the accuracy is higher than baseline. It can be seen that with the same speedup ratio, our method is comparable with channel pruning, ThiNet and HRank, but outperforms the classical first k and max response methods. For example, in Table 4, our method obtains the better top-1 classification accuracy than first k, max response, channel pruning and HRank, but worse than ThiNet. However, we obtain the better top-5 classification accuracy than ThiNet. Therefore, we say our method is comparable with ThiNet. In Table 6, our method obtains the worse top-1 classification accuracy than channel pruning and HRank, while we obtain the better top-5 classification accuracy than HRank. Therefore, we say our method is comparable with HRank. The bold fonts indicate the best result

MobileNet pruning
As a lightweight network, MobileNet does not have a high degree of redundancy. The detailed filter pruning case on MobileNet can refer the strategy used in channel pruning [21]. We compared the four pruning methods (i.e., first k, max response [20], channel pruning [21], ThiNet [35] and HRank [30]) with 1.5× speedup ratio. The experimental results in Table 7 show that our method outperforms the other methods with the same speedup ratios.

Conclusion
In this paper, we propose a two-step feature map reconstruction method to prune the redundant filters and channels, which is used to compress the CNN networks, such as VGG- Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.