Differentiable channel pruning guided via attention mechanism: a novel neural network pruning approach

Neural network pruning offers great prospects for facilitating the deployment of deep neural networks on computational resource limited devices. Neural architecture search (NAS) provides an efficient way to automatically seek appropriate neural architecture design for compressed model. It is observed that, for existing NAS-based pruning methods, there is usually a lack of layer information when searching the optimal neural architecture. In this paper, we propose a new NAS approach, namely, differentiable channel pruning method guided via attention mechanism (DCP-A), where the adopted attention mechanism is able to provide layer information to guide the optimization of the pruning policy. The training process is differentiable with Gumbel-softmax sampling, while parameters are optimized under a two-stage training procedure. The neural network block with the shortcut is dedicatedly designed, which is of help to prune the network not only on its width but also on its depth. Extensive experiments are performed to verify the applicability and superiority of the proposed method. Detailed analysis with visualization of the pruned model architecture shows that our proposed DCP-A learns explainable pruning policies.


Introduction
Deep neural networks (DNNs) have achieved remarkable accomplishments in a variety of applications such as pattern recognition [5,18,31,41,60], and have also shown sustained superiorities in comparison to other methods. However, the large amount of model parameters and high performance demand on GPUs have also brought about great challenges on storage and time costs. Therefore, much research attention has been devoted to the operation problem of DNNs on computationally limited devices such as mobile equipments and embedded devices. As a rather popular approach, neural network pruning offers a great prospect for facilitating the deployment of DNNs on computational-resource-limited devices. In general, the widely applied neural network pruning approaches can be divided into two categories, namely, weight pruning [8,10,20,63,64] and channel pruning [4,6,12,21,61,62]. Since weight pruning cannot harvest obvious acceleration for modern networks due to its unstructured operation manner, we focus on channel pruning in this paper.
There are two types of channel pruning methods, i.e., criterion-and NAS-based channel pruning methods. The main procedure of the criterion-based channel pruning techniques is to first determine the basic criterion and then prune filters hierarchically, which would require us to manually set the pruning ratio for each layer. In practice, the pruning ratio is usually set to be equal for each layer so as to simplify the entire process. Unfortunately, such a simplification could Fig. 1 An illustration of DCP-A where the policy logits are designed for all filters in the network and, in each layer, the importance scores are assessed by a special mechanism to obtain the information among filters and then guide the optimization of policy logits lead to poor performance due to the fact that different layers possess different redundancies. On the other hand, neural architecture search provides a powerful tool to automatically seek efficient neural architecture. So far, extensive studies have been conducted on the neural architecture search problem with the aim to explore the optimal network structures in a large design space while taking into account the trade-off among the model size, the speed, and the accuracy. Note that, when utilizing traditional NAS-based methods, we usually confront difficulty in searching a large space with unacceptable searching speed. Although some effort has been devoted to reducing the searching space [12,22,24,28,38], the layer information, i.e., the information of filters in one layer, has seldom been taken into consideration when it comes to the optimization of the pruning policy. Basically, most criterionbased pruning methods fail to take the correlation among layers into consideration while NAS-based methods usually ignore the information of individual filters in a layer.
Inspired by the above discussion, in this paper, we propose a new differentiable channel pruning framework guided via attention mechanism (DCP-A), shown in Fig. 1, where certain policy is used to determine the pruning decision and the importance scores in a layer are used to guide the optimization of the policy logit. The importance scores can be obtained by any pruning criterion. In this paper, we choose the importance scores obtained by l 1 norm, l 2 norm and attention mechanism. Here, the policy logit guided by the attention mechanism shows the best experiment result. The attention mechanism is a concept derived from cognitive psychology that allows models to devote limited resources to more important channels [35]. Pruning policy of pruning-or-not is sampled from the policy logit which is defined for each filter in the network. To obtain layer information, attention score with attention-guided loss is adopted to regulate the optimization of policy logit. Hence, the attention score provides the correlation of filters in layer and, meanwhile, the attention guided loss limits the searching space for pruning policy. Moreover, a two-stage training procedure is proposed to ensure that the introduced attention modules are well-trained and easily removed (without increase of the final FLOPs of pruned network).
The main contributions of this paper can be highlighted as follows: (1) a new NAS-based differentiable channel pruning framework is proposed, where importance scores obtained by different mechanisms (including the attention mechanism) are adopted to provide a layer information for the optimization of pruning policy logit; (2) a two-stage training procedure with designed training objectives is proposed to optimize the network parameters, the policy logits and the attention modules; (3) for networks with shortcut structure (e.g. ResNet), the proposed DCP-A algorithm is capable of pruning networks not only on the width but also on the depth; (4) the proposed DCP-A can be easily extended into the multi-model case; (5) via extensive experiments, the effectiveness and efficiency of the proposed DCP-A framework are demonstrated in different databases, and detailed analysis is provided through structure visualization to show that the pruning policies learned by DCP-A are explainable.
The remainder of this paper is organized as follows. In "Related work" section, we introduce the related works of model pruning, neural architecture search and attention mechanism. In "Methodology" section, we describe our DCP-A framework in detail. The experimental study and the corresponding analysis are presented in "Experiments" section. "Conclusion" section gives the conclusions of this paper.

Related work
In terms of its objectives, the model pruning can be generally classified into two categories, namely, weight pruning and channel pruning. On one hand, weight pruning directly removes connections in filters, which might lead to unstructured sparsity and, furthermore, make it difficult to accelerate the inference with general-purpose hardware. On the other hand, channel pruning prunes entire filters to deploy existing basic linear algebra subprograms (BLASs) libraries, thereby achieving better acceleration. Considering how to design the pruning policy, we can roughly divide channel pruning methods into criterion-based pruning and NAS-based pruning.

Criterion-based pruning
Generally, criterion-based pruning methods assess the importance of filters by utilizing filter weights or filter activations. In [21], the importance of a filter has been calculated by the corresponding absolute weights sum, according to which the unimportant filters have been pruned. Filters with small l 2 norm have been slightly pruned in [13]. In [14], filters near geometric median have been pruned with the most replaceable contribution. In [4], three criteria have been utilized to find the important filters for satisfying the least replacing loss, the diversity and the high entropy of weights. It is worth noticing that all the aforementioned criterion-based methods use manual settings for the pruning ratio for layers.

NAS-based pruning
In early results concerning NAS, the optimal network structures have been found by resorting to the reinforcement learning [68] or evolutionary algorithms [55] which would consume substantial computation costs. Gradient-based NAS methods [29,53,54] have been exploited to reduce the cost by making the searching mechanism differentiable or approximately differentiable to enhance the searching efficiency. In [25], a partial order pruning method has been developed to automatically search the architectures with the best tradeoff between speed and accuracy. In [28], channel number in each layer has been searched based on the artificial bee colony algorithm. In [26], the designed hypernetwork has taken the latent vectors as the input and generated the weight parameters of the backbone network. It should be pointed out that, however, the aforementioned methods only take global network information into account, and there is still a lack of layer information when conducting the searching.

Attention mechanism
In [16,48], attention modules have been proposed to help DNNs focus on important channels and achieve a better performance. Recently, the attention mechanism has been considered in model pruning as an importance evaluation criterion of filters. In [52], an attention module has been embedded into model to generate scaling factors for channels that are considered as channels importance scores. In [6], a long short-term memory has been introduced to generate a strategy indicating the number of pruning filters for each layer. In this strategy, attention blocks have been embedded in the network, and filters with less attention scores have been forbidden in a feed-forward manner. In both the methods mentioned above, the attention score has been used directly to rank the filters in a layer.

Approach overview
For a network that needs to be pruned, it is our goal to learn a pruning policy that determines the filter to be pruned with the least performance loss. Attention module with an attention score is utilized to evaluate the importance of filters in the layer. Note that attention modules are not expected to directly influence the optimization of network parameters because they will be removed from the pruned network to avoid increasing the FLOPs. Therefore, we define pruning policies for all filters in the whole network and use the attention module as a guided tool only.
In Fig. 2, an overview of our proposed DCP-A training approach is illustrated, which consists of two stages in the training epochs: (1) the stage of training parameters of the network, and (2) the stage of optimizing attention modules (Squeeze-and-Excitation block used in this paper) as well as policy logits. To be more specific, such a two-stage approach is explained as follows.
(a) Stage one: In the first stage, policy logits and parameters of attention blocks are fixed, while the parameters of network are free (to be optimized). It should be mentioned that attention modules do not participate in feed-forward in this stage, and only the average attention score of each attention block is recorded. (b) Stage two: In the second stage, the parameters of network are fixed, while the parameters of attention blocks and policy logits are set to be free (to be optimized). Here, attention modules are activated for updating parameters. Attention scores obtained in the previous stage will be utilized as a guidance for optimizing the policy logits.
By repeating two stages alternately during training, optimal pruning pattern can be learned, resulting in a well-pruned network. Gumbel-softmax trick is utilized to make the training process differentiable. The details of our approach will be described in the following.

Attention mechanism
In this paper, the Squeeze-and-Excitation (SE) block proposed in [16] is employed to obtain the attention scores. The SE module (also known as the channel attention module) is able to select the most useful feature among channels, thereby improving the effectiveness of the feature representa- tions. Moreover, SE block is an effective attention block that can be flexibly embedded into most existing network structures and, consequently, the SE block has been widely used in computer vision applications [45]. An SE block contains two parts in its structure, namely, squeeze and excitation.
(a) Squeeze: In the squeeze part, the global information of each feature channel is obtained by an average pooling layer. Assume that the input of lth SE block is X l = [x 1 l , x 2 l , . . . , x C l ] ∈ R H ×W ×C , then the average global information of each channel is defined as where A(·) is the global average pooling function, and x k l (i, j) represents the pixel value. (b) Excitation: In the excitation part, the global information are fused as follows to obtain the attention score S l of each channel: where W 1 ∈ R C r ×C×1×1 and W 2 ∈ R C× C r ×1×1 are the correlation of channels; r is the reduction ratio; σ represents the activation function ReLU; and δ denotes the activation function Sigmoid.
In the literature, it has been shown that the SE block possesses the ability to generate importance scores for chan-nels and, therefore, enhancing the network performance. As shown in Fig. 3, pruning filters in one layer can be performed based on the attention scores. For example, we can set the threshold to be 0.5, and prune nearly half of filters with attention scores less than 0.5. However, in the whole network, such a technique is not applicable anymore as the attention score only reflects the relationship of filters in the same layer. In Fig. 4, it can be seen that attention scores of different layers are extremely separated, while those in the same layer are relatively concentrated within a very small area. Obviously, the network pruning would fail if we were to directly set a threshold for attention scores to prune the whole network.
With the purpose of conquering the above-mentioned difficulty, we define a policy of pruning-or-not for each filter in the network.

Network pruning policy
Assume that a neural network has L layers with weights represent the sizes of input and output channels, respectively.
For kth filter f l,k in lth layer, we introduce a binary-valued variable u l,k to determine pruning or not. It should be mentioned that the probability of pruning f l,k is sampled from a discrete probability distribution, and the back-propagation is not allowed because of non-differentiability problem. Hence, we employ the Gumbel-Softmax trick [17] to substitute the original non-differentiable sample (from a discrete distribution) with a differentiable sample (from a corresponding Gumbel-Softmax distribution) [12,44]. We use π l,k = [1 − α l,k , α l,k ] to represent the distribution vector of u l,k , where the logit α l,k indicates the possibility of pruning f l,k . Then, in Gumbel-softmax sampling, u l,k is generated as where is a standard Gumbel distribution with U l,k sampled from a uniform i.i.d. distribution U(0, 1). Then, the one-hot vector Fig. 4 An illustration of attention score distributions for different layers. If a threshold is set to be 0.6, then filters of three layers (red, blue and green) will be completely pruned. In contrast, filters of one layer (purple) will be entirely reserved of u l,k is reformulated to the soft decision v l,k with reparameterization trick as follows: where j ∈ {0, 1} and τ is the softmax temperature. When τ → ∞, the Gumbel-softmax distribution is smooth and α l,k can be optimized with gradient descent. When τ → 0, v l,k becomes one-hot.  Sample policy with θ π and τ based on (4). 7: Optimize θ W . 8: Record attention score S. 9: end for 10: Free θ SE and θ π . 11: Fix θ W . 12: for x in X 2 do 13: Sample policy with θ π and τ based on (4). 14: Optimize θ SE and θ π with S. 15: end for 16: Update τ 17: end for 18: Prune network with γ and updated policy logit θ π * . 19: Train the pruned network and obtain θ W * .

Training objectives
For training objectives, training losses of accuracy contain L(θ W ), L(θ SE ) and L(θ π ), which represent accuracy losses of optimizing parameters in network, SE block and policy logit, respectively.
In consideration of pruning mission, sparsity regularization L sparsit y (θ π ) is adopted to ensure the possibility of pruning filters, which is defined as where w l represents the influence imposed by lth layer on FLOPs of pruning filters. In most existing techniques, only L sparsit y (θ π ) and L(θ π ) are used to optimize θ π . Since attention score is introduced to provide the layer information in this paper, the attention score guided loss should be taken into consideration as an objective, which is defined as follows: where S l ∈ R 1×1×C is the average attention score of SE block obtained in the first stage, and dist(·) measures the cosine distance as follows: The exhibition of training objectives in stage two is shown in Fig. 5.
Finally, the total loss function is defined as where λ 1 and λ 2 control the weights of L sparsit y and L guided , respectively, and θ W , θ SE and θ π will be optimized alternately during training. Consequently, we describe the whole DCP-A framework in Algorithm 1.

Architectural design
Since network block with shortcut has been widely used nowadays, in this paper, two types of block architecture (basic block and bottleneck block) are considered with special design.
As shown in Fig. 6, for basic block consisting of two convolutional layers and a shortcut, we use the same policy logit for layers in the same block. For a bottleneck block with three convolutional layers, we use the same policy for the input and middle layer, and a new policy for the output layer. The pruning ratios of layers in the same block are set to be the same. Note that shortcut is protected in our method. Due to the special architecture of shortcut, the output equals the input if the policies are zero vectors, which is equivalent to skipping the whole block. Hence, protecting shortcut will help DCP-A skip network block and change the depth of the network.

Extension to multi-model pruning
As shown in Fig. 7, "Widen-Compression" is provided in DCP-A for multi-model pruning case. Assuming that the original layer in one model has 4 filters, the policy logit will be widened to 8 (doubled). When both strategies A and B choose to reserve a filter in the same position (e.g. the 7th position in Fig. 7), this filter will be shared in pruned models. Hence, DCP-A can help design the shared structure in multi-model pruning.

Experiments
Our implementation is in PyTorch [37] with an NVIDIA 2080Ti GPU. Experiments on different databases have proved the effectiveness of our method. We also exhibit various details of pruned model visually to further explore the rationality of DCP-A.

Databases
We evaluate our established DCP-A framework on the following databases: (1) CIFAR-10 and CIFAR-100 [19] that contain 60,000 color images in each database, with 50,000 training images and 10,000 testing images; (2) ILSVRC-2012 [39] (ImageNet) which is a large-scale dataset containing 1.28 million training images and 50,000 validation images of 1,000 classes; and (3) NYU-v2 [42] which is comprised of 1,449 video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras, and include 795 images for training and 654 images for validation. We use 40-class annotation for semantic segmentation. During the training, we resize the input images to 224 × 224 and test on the full resolution 256 × 512.

Performance metrics
To evaluate the network compression and testing performance, the following measures are applied: Acc.: The accuracy of testing on image classification. Acc. ↓ (%) is the accuracy drop between pruned and the baseline models. The smaller, the better. For CIFAR-10, top-1 accuracy is provided, while for ILSVRC-2012, both top-1 and top-5 accuracies are reported.

FLOPs:
The overall floating point operations (FLOPs) is used as an indicator of computation costs. We use FLOPs ↓ (%) to describe the percentage of reduced FLOPs. Pixel Acc.: Pixel Accuracy (Pixel Acc) on semantic segmentation. The higher, the better. It is defined as follows: where p i j means the number of pixels belonging to ith class but predicted to be in jth class; k is the number of classes. mIoU: Mean Intersection over Union (mIoU) on semantic segmentation. The higher, the better. It is defined as follows: T : Following [36,44], a single relative performance with respect to the baseline is defined for semantic segmentation of multiple metrics M as follows: where |M| represents the number of metrics.

Network architecture
We mainly focus on pruning ResNet [11] which has less redundancy than VGG-net [43]. An illustration of pruned MobileNet structure has also been provided in "Pruned result visualization" section.

Training setting
For image classification, we train the parameters of network and attention blocks with optimizer (Stochastic Gradient Descent algorithm, SGD), initial learning rate (0.1), momentum (0.9), batch size (256) and weight decay (0.0005). Following [44,53], Adam is used for optimizing policy logit and the constant learning rate is set to be 0.01. τ is initial- Bold indicates the best results of compressed models under similar compression ratios ized as 5 and then decayed to near 0. The loss constraint weights λ 1 and λ 2 are both set to be 0.5. On CIFAR, the network is trained for 50 epochs to learn the policy logit and the value is 10 for ImageNet. After training, we can obtain the optimal policy logit of network. Then, we prune the network according to the limit on FLOPs. Attention blocks will be removed from the pruned network, hence they will not increase the FLOPs. Pruned models will be trained for 200 epochs on CIFAR. Pre-trained model is used on ImageNet and the total epoch is 100. Baseline training schedule follows [14]. The learning rate is divided by 5× at epoch 60, 120 and 160. For segmentation, the learning rate of network parameters is set to be constant (0.001) with weight decay (0.0001), batch size (8), training epoch (50) for optimizing and training epoch (50) for warm-up. λ 1 and λ 2 are set to be 0.01 and 0.1, respectively. The total re-training epoch is 300. τ is also initialized as 5.
At training time, we randomly split the original training database into two sub-training databases for two stages.

Different guidance
In DCP-A, attention score provides layer information for optimizing policy logit. For comparison, we test DCP with another layer information calculation as well as without layer information. As exhibited in Table 1, DCP-WOL represents performing DCP without layer information. DCP-L1-norm and DCP-L2-norm describe replacing attention score S l with the l 1 norm and l 2 norm of weights, respectively. The results show that layer information has a positive impact on facilitating network performance (DCP-L1-norm, DCP-L2-norm versus DCP-WOL). Moreover, DCP with attention (DCP-A) performs best because the attention mechanism learns better layer information.  Bold indicates the best results of compressed models under similar compression ratios

Pruning on CIFAR-10
ResNet has a special design for CIFAR that contains basic blocks, while we use the same policy logit for layers in the same block as mentioned. We test DCP-A for ResNet with depth 32, 56, 110 on CIFAR-10 and compare the results with state-of-the-art

Pruning on CIFAR-100
We also provide similar experiments on CIFAR-100 with ResNet-56 and show the results in Table 3. It can be seen that DCP-A can achieve better results than other methods Bold indicates the best results of compressed models under similar compression ratios  We can see that two lines maintain a similar trend for channels, which can be obviously observed in dash boxes Fig. 9 An illustration of learned policy logit distributions for different layers. Compared to Fig. 4, if a threshold is set to be 0.6, each block will prune a proper number of filters

Pruning on semantic segmentation
We test DCP-A for semantic segmentation application on NYU-v2 database. The Deeplab-ResNet [2] with atrous convolution is used as a baseline network. For comparison, we apply uniform pruning with different FLOPs limitation as uniform baselines. DCP-A also outperforms the uniform baselines on semantic segmentation as shown in Table 4.

Pruning on ImageNet
The proposed framework is then tested on ILSVRC-2012 with ResNet-50. ResNet-50 has a standard bottleneck block and we use the same policy for the input and middle layer, and a new policy for the output layer. The results are described in Table 5 and compared with state-of-the-art methods.

Pruning in multi-model
Finally, the proposed framework is tested in multi-database case (CIFAR-10 and CIFAR-100) and compared with FPGM and MFIS. Note that FPGM is performed on CIFAR-10 and CIFAR-100 separately. MFIS is a multi-task pruning method and the multi-task pruning results are adopted for comparison. As shown in Table 6, DCP-A performs better on CIFAR-100 while MFIS gives better results on CIFAR-10. Specifically, for depth 32, although MFIS shows the best result of accuracy drop 0.22% on CIFAR-10, DCP-A can improve the performance of 5.19% on CIFAR-100 which is much better than MFIS and FPGM. 'ALL' shows the average accuracy decline on all databases. We can see that the proposed DCP-A achieves the best results on all depths.

Pruned result visualization
Our approach designs the pruned network automatically and experiments on several databases have proved the effectiveness of DCP-A. Next, we are interested in the learned pruning results. Here, details of pruning results are exhibited to further exploit our method in the following. Attention score guidance Figure 8 shows the attention score and policy logit in one layer, where the values of both are normalized into [0, 1] to exhibit the variation tendency clearly. We can see that the two lines maintain a similar trend for channels, which can be obviously observed in the dash boxes. The pruned filters in the same layer will be similar when using the attention score or policy logit as pruning criterion. Hence, the decision of pruning filters has been affected by the attention score-guided loss. This means that attention score acts as a guidance for optimizing policy logit in DCP-A training. Figure 4 illustrates why we do not directly use attention score as a pruning criterion. Here, for comparison, an illustration of policy logit distributions has been presented in Fig. 9 with the same layers. Apparently, policy logit can be utilized for pruning. For example, if we set the threshold to be 0.6, then each layer will prune proper filters under this constraint.

Pruned results with different limitations
In the following, we will prove that DCP-A does not require repeated optimizing processes of policy logits under different FLOPs constraints. As exhibited in Fig. 10, the top figure shows the accuracies of pruned networks before re-training with different FLOPs constraints. To verify that the pruned network structures can lead to good performances, we retrain 5 pruned networks under FLOPs reduced constraints from 40% to 60%, and show the results in the bottom figure.
It can be seen that all the pruned networks achieve acceptable performances with different limitations.

Pruned network structure
The pruned network structures for ResNet-56 and MobileNet v2 [40] under different FLOPs limitations have been exhibited in Figs. 11 and 12. We can observe that significant peeks exist in the pruned network, when there is a down-sampling operation with a stride 2 depth-wise convolution. Such a phenomenon also occurs in MetaPruning [30] when pruning MobileNet, which is mainly because network tries to make up for the loss of information caused by the resolution degradation in the feature map size. Hence, it proves that our DCP-A can learn an explainable policy for network architecture.

Skip block
There exist skipping blocks when pruning network is shown in Fig. 13. According to the learned pruning policy, all filters in the 3rd block and 7th block will be pruned and only the shortcut will be reserved. This equals skipping the 3rd block and 7th block. Hence, DCP-A can shrink the network structure not only in width but also in-depth when pruning the network with a shortcut.

Conclusion
In this paper, a new differentiable channel pruning framework guided via an attention mechanism has been proposed and verified with experiments. Attention mechanism has been adopted as a guidance to provide layer information for policy optimization. The training process is differentiable with Gumbel-softmax sampling and a two-stage training procedure has been proposed to optimize the network parameters, policy logit and attention modules alternately. Special design has been provided for network blocks with shortcut and showed that protecting shortcut can assist DCP-A prune the network not only in width but also in depth. Detailed analysis has been given with pruned model visualization. Limitations also exist in the proposed method. More guidance mechanisms can be considered in addition to attention guidance. Moreover, DCP-A can be extended into multi-task pruning. In the future, we will 1) consider more different guidance mechanisms with layer information [32,46,47,49,56,57,66], 2) introduce control strategies to enhance the model robustness [3,27,33,50,51,58], and 3) extend our approach to other complicated multi-task learning problems [1,23,34,59,65,67].