Coresets based asynchronous network slimming

Pruning is effective to reduce neural networks’ parameters and accelerate inferences, facilitating deep learning in resource-limited scenarios. This paper proposes an asynchronous pruning method for multi-branch networks on the basis of our previous work on channel coresets constructions, to achieve module-level pruning. Firstly, this paper accelerates coreset based pruning by batch sampling with a sampling probability decided on our-designed importance function. Secondly, this paper gives asynchronous pruning solutions with an in-place distillation of feature maps for deployment on multi-branch networks such as ResNet and SqueezeNet. Thirdly, this paper provides an extension to neuron pruning by grouping weights as channels. During tests on sensitivity of different layers to channel pruning, our method outperforms comparison schemes on object detection networks, indicating advantages of data-independent channel selections in maintaining precision. As shown in tests of asynchronous pruning solutions on multi-branch classification networks, our method further decreases FLOPs with a small accuracy decline on ResNet and acquires a small accuracy increment on SqueezeNet. In tests on neuron pruning, our method achieves an accuracy comparable to existing coreset based pruning methods by two solutions of precision recovery.


Introduction
Due to the ability of reducing parameters and accelerating inference, neural network compression is adequate to facilitate deep learning in resource-limited scenarios such as mobile systems or embedded devices. Pruning is widely developed in the industry and connectable with other compression methods like the quantization [8] or the knowledge distillation [20] for attaining further compact networks.
The construction efficiency is an evaluator as important as the accuracy and the inference efficiency, involving the Gang Dong, Yaqian Zhao and Rengang Li contributed equally to this work. Wenfeng Yin yinwenfeng@inspur.com 1 State Key Laboratory of High-end Server & Storage Technology, Inspur Electronic Information Industry Co., Ltd., Jinan, 250101, China 2 Shandong Massive Information Technology Research Institute, Jinan, China complexity and the running time of pruning. To enhance the construction efficiency, it is feasible for inferencetime pruning to work in an one-shot way that compresses multiple layers once and finetunes until the precision is restored rather than iterates pruning and retraining. For an instance, the filter pruning method in [17] discards filters by the sort of every filter's 1 -norm and then executes retraining once. Since pruning and the weights reconstruction are performed merely on a batch of data during inference-time pruning, the compressed network without fine-tuning is prone to over-fit on the dataset utilized in compression. To alleviate such influences and offer a better initialization for finetuning, our method attempts to design a data-independent selection rule for inference-time pruning.
As elaborated in [13], although the compressed network is comparable with a complete network in the average accuracy, the compressed network's generalization is degraded at some difficult classes or instances of long-tail distribution. This issue makes pruning as a way to expose the weakness of a network's generalization and inspires relative researches such as the contrastive learning method against imbalance of learned representations [15]. The coreset theory has been applied in fine-grained pruning [2] with discussions on the generalization bound of compressed networks. The work in [2] constructs coresets to prune weights and neurons of fully-connected neural networks with the generalization bound declared in [1], assuring the performance of compressed network for arbitrary data drawn from the distribution. The coresets based pruning in [23] offers a guarantee of approximation error which is provably adequate for any future test data, enabling data-independent pruning. This paper proposes a coreset based asynchronous pruning method. Our method makes following contributions for data-independent module-level pruning and the optimized construction efficiency of pruning.
1. To increase the construction efficiency of pruning, the coreset is generated by batch sampling without duplication for multiple times using a sampling probability decided on our-designed importance function. 2. Our method achieves module-level pruning in multibranch networks by asynchronous channel pruning, which respectively compresses the shape and the number of filters in a specific layer in different rounds by different channel selection rules. Particularly on ResNet50, our method is able to compress the input and output channels of identity blocks, without importing extra operations in the forward inference. 3. An in-place distillation of feature map is designed for pruning output channels of identity blocks on ResNet50. 4. Our method is extended to neuron pruning via treating grouped weights as channels. Tests on full connected layers of VGG16 show batch sampling with our designed importance function is as effective as the existing coreset based neuron pruning method.

Related work
Inference-time pruning To learn compact structures via training, training based pruning adds various sparsity regularization to the training loss, such as the 1 -norm based [16,24] importance criteria. In addition, Taylor expansion has been adopted in [22,33] as importance criteria to minimize the loss change caused by pruning. The sparsity regularization is calculated directly in batch normalization (BN) layers [19] or convolution layers by group lasso [14,31] to measure the importance of neurons or structures. Liu et al [19] calculates the 1 -norm based sparsity regularization using channel-wise scaling factors in BN layers. Differently, inference-time pruning discards structures by heuristic selection criteria such as the reconstructionbased rule [4] during forward inferences, and then restores the accuracy by updating weights through feature map reconstructions or finetuning [25,28]. FPGM in [10] iterates pruning whole network and finetuning the compressed network to retain accuracy. FPGM prunes filters with the most replaceable contribution that are nearest to the geometric median of filters in a layer. FilterSketch in [18] works in an one-shot way that prunes the network once and finetunes until the accuracy is recovered. FilterSketch preserves filters which retain sufficient covariance information of filters in pre-trained network, through solving the matrix sketch problem. Compensating accuracy loss by feature map reconstructions contributes to reductions in finetuning times and then lightens the computation burden of finetuning pruned networks. Gou et al [11] updates weights of compressed layers by solving feature map reconstructions with linear least squares, requiring only few times of finetuning for accuracy recovery. Nevertheless, extra computations are introduced to the inference for pruning input channels of one identity block on ResNet in [11]. Our method makes the first effort to prune input channels and output channels of downsampling blocks and identity blocks on ResNet, without importing extra computations or operations into the inference in situations of inference-time pruning.
Coresets based pruning The application of coreset theory has been extended from computational geometry to diverse machine learning algorithms including k-means clustering and neural network pruning. Dubey et al [5] discards filters through activation-based pruning and employs coreset theory to construct efficient filter coreset representation. Braverman et al [3] suggests a streaming algorithm to construct coresets for metric k-means clustering, based on the generic framework for coreset construction in [6]. This frame work converts coreset to ε−approximation which is calculated by non-uniform sampling. In this framework, the importance of each point determines sampling distribution and the coreset size relies on the sum of point's importance. Braverman et al [3] further reduces the coreset size to be near-linear in the total importance.
On the basis of off-line and streaming coreset constructions in [3], data-driven and data-independent pruning methods [2,23] arise. Baykal et al [2] constructs data-driven coresets of weights, while [23] generates a data-independent coreset of neurons. The method in [2] separately builds coresets for positive and negative weights and is extendable to prune neurons. Although above coreset based methods achieve different scales of sparsity such as neurons, weights or filters, they have not applied the coreset theory directly in structural pruning. While our method designs an asynchronous pruning solution constructing the coreset of channels. The pruning method in [23] computes one neuron's importance by the upper bound of activation values, assuring the total importance finite. Our methods modifies the function of sample importance making the bound of total importance theoretically controllable by the compression ratio [32]. The coreset in [23] is a multiset generated by iteratively sampling single sample. Our method is to construct the coreset of channels without duplication by batch sampling, so as to meet the compression ratio and meanwhile improve the construction efficiency.
Knowledge distillation Knowledge distillation presents the procedure of knowledge transferring from a complete network to the compressed network. Distillation algorithms apply response based-knowledge, feature-based knowledge [27] or relation-based knowledge [29] under the teacherstudent architecture [7] to facilitate training the compressed network. Knowledge distillation is deployed in such ways as offline distillation, online distillation and selfdistillation. Besides, knowledge distillation is expanded to data distillation [26] and dataset distillation [30] not only focused on model distillation. Model distillation in [12] builds an ensemble of models as the teacher model. Instead, data distillation [26] ensembles predictions of one pretrained network performing on multiple transformations of unlabeled data as automatic annotations to train a student network. Dataset distillation refines knowledge from a large scale dataset into several distilled images whose effectiveness is comparable to that of the dataset on training a network.
Inspired by the dataset distillation, our method attempts to distill feature maps to recover the precision of compressed networks. Specially, feature maps are distilled with channel reduction, forcing the compressed network to approximate a part of feature maps of the complete networks. In an in-place way, [34] trains a series of sub-networks by pseudo labels produced by the complete network in every iteration, requiring no extra inputs or operations. Similarly, our method distills feature maps via reconstructing the outputs of compressed layers according to parts of the complete layer's outputs by linear least squares.

Coreset based channel selction rule
Input data x with N channels is imported to a layer with weights θ = (W :,1,:,: , . . . , W :,N,:,: ), where W :,i,:,: denotes weights of the ith input channel and is abbreviated as and * denotes convolution operations. To achieve dataindependent pruning, our method attempts to construct an (ε, δ)-coresetθ for the weights set θ based on user-specified additive ε-error and probability δ, ε, δ ∈ (0, 1). According to deviations in [23], it is feasible to construct an (ε, δ)coreset with additive ε-error in situations that activation functions are adopted as the loss function of a query space.
Our method constructs the (ε, δ)-coreset of input channels by non-uniform sampling whose probability distribution is related to the importance of one channel's weights W i [3]. The importance function is marked as s(·) and s i is short for one channel's importance s(W i ). Within a layer, every channel's importance functions are accumulated as the total importance t, t = W i ∈θ s(W i ). As for the ith input channel, its sampling probability p i is defined by the ratio of its importance to the total importance as s i /t. This way of coreset construction holds on the condition that the channel importance s i and the total importance t are bounded.
As in [32], our method designs a bounded channel importance function to acquire a controllable total importance. For the ith channel, the designed channel importance is a multiplication of the primary channel importance g i and the assigned fraction v i , i.e. s i = g i ·v i . The primary importance g i is settled as 1 for every input channel in convolutional neural networks except MobileNet. The fraction allocated for non-uniform sampling in lth layer is formed as where N l is the number of input channels in lth layer, and a l is the number of preserved channels determined by N l and the user-specified compression ratio. The compression ratio is predefined before pruning to satisfy restrictions of FLOPs or inference delays. The fraction v i is assigned to a channel according to the descending sort of 1 -norm of every input channel's weights W i . The first a l channels with large norms are offered with large fractions v i . Since there is no relation to input data in the assignment of v i and the calculation of g i , our method constructs a data-independent coreset of input channels. Then the total importance t is controlled with the user-specified compression ratio within a range, , where the equality is valid when a l = N l − 1.

Asynchronous pruning for multi-branch structures
To further increase compression ratio of parameters or FLOPs, our method provides specific solutions of asynchronous channel pruning for multi-branch networks. Our method constructs compression units which are connected to different modules or stay within one module. It takes two rounds of channel pruning to process compression units belonging to a module by different channel selection rules. Therefore our compression method is called as asynchronous channel pruning. For the first round, our method designs a layer-wise channel pruning flow via coreset constructions in Fig. 1. Fig. 1, our method sequentially discards channels according to the constructed coreset Cθ and recovers precision by optimizing the feature map reconstruction. The channel selection procedure and the feature map reconstruction are performed by following steps.

I. Acceleration of channel pruning within a layer As in
Step1 The channel selection procedure constructs the channel coreset by simultaneously sampling a batch of channels rather than single channel to accelerate pruning. The channel coreset is constructed for R rounds and then a merged coreset Cθ is generated according to the histogram which counts times of each channel being selected. The merged channel coreset consists of channels with high selected frequency in the histogram. Step2 Channels outside the merged coreset Cθ are discarded from input channels C l in of lth layer, generating compressed weightsθ. Meanwhile output channels C l−1 out of l − 1th layer are pruned in accordance with Cθ .
Step3 Compressed weightsθ of the merged channel coreset are updated through solving the optimization by linear least squares.
The step 1 is equivalent to construct a channel subset whose size is within R times of the amount required by the compression ratio. The importance function s i of a channel in this subset is an accumulation of a re-scaling of their original importance during every sampling [3], as where K records the times of a channel being sampled. In our situation, given that p i = s i /t and s i = v i , the updated importance function is merely positively correlated with the sampled frequency K, as the deviation To simplify the calculation, our method treats the sampled frequency K of a channel as the selection criterion to extract channels from the channel subset into the coreset.

II. Module-level pruning and in-place distillation
Asynchronous solution for SqueezeNet For module-level pruning on SqueezeNet, in addition to the compression unit across modules, our method constructs compression units within modules, which comprise output channels of squeeze layers and input channels of expand layers, as shown in Fig. 2. For the fire module i, its compression unit across modules is processed as the layer-wise channel pruning stated above in the first pruning round. And its compression unit within modules is compressed by the random channel pruning method in the second pruning round. Such asynchronous channel pruning flow is successively executed on every fire module. In this pruning way, compression is executed on both the shape and the number of filters in expand layers.

Asynchronous flow of ResNet
Our method designs different asynchronous pruning flow for downsampling blocks and identity blocks on ResNet. As depicted in Fig. 3, compression units are constructed between layers on residual branches or layers on downsampling branches. Those compression units in green, which consist of bottleneck layers in downsampling blocks or identity blocks, are pruned according to the proposed channel  pruning flow in the first pruning round. Those compression units in blue, which respectively contain downsampling layers, first bottleneck layers and last bottleneck layers, are pruned by the random channel pruning method in the second pruning round. It is different for downsampling blocks and identity blocks whether the second pruning round of a specific block is executed immediately. For downsampling blocks, the second pruning round is performed right after the first round. For N stacked identity blocks, the second pruning round of the first identity block is carried out until the first pruning round of Nth identity block is finished. In other words, taking all the first pruning rounds of stacked blocks as the first step, all the second pruning rounds are the second step to complete.

Feature map distillation
During the second pruning round of N stacked identity blocks in ResNet, feature maps of the last bottleneck layer in every identity block are distilled in an in-place way. One channel set Cθ , generated by random sampling during pruning input channels of the first identity block, is shared with rest blocks for pruning both input channels of the first bottleneck layer and output channels of the last bottleneck layer. Channels' feature maps are chosen as the knowledge to be transferred from the last bottleneck layer with complete output channels to the compressed layer, according to the channel set Cθ . After output channels are compressed, the weights of pruned last bottleneck layer are updated by minimizing the reconstruction loss compared with selected channels' feature maps. Then feature maps of the last bottleneck layer are re-obtained from the same inputs by updated weights, which achieves the distillation of feature maps from complete channels to compressed channels.

Neuron pruning via coresets
Our coreset based pruning method is extendable for compressing full connected layers. Weights linked to a neuron in lth layer W l i,: are treated as one output channel of lth layer, and weights connected to a neuron in l − 1th layer W l :,i are regarded as one input channel of lth layer. By grouping weights as channels, formula of random probability sampling are usable for constructing coresets of neurons. Specifically, the fraction v i is computed by the amount of neurons N l and the required number of preserved neurons a l in a full connected layer, as in (1). The fraction v i is allocated in accordance with the sort of 1 -norm of weights W l :,i that belong to the ith input channel. The primary importance g i is adjusted as a proportion of the Frobenius norm of one input channel's outputs h i in the average norm of every input channel's outputsh. Given a batch of input data x, h i is computed as x j ∈x f θ (W :,i * x ij ) 2 F andh is calculated as x j ∈x f θ (W :,i * x ij ) 2 F , where x ij is the j th output data of ith neuron in l −1th layer. Setting the primary importance g i as h i /h, the neuron coreset is constructed by sampling with the probabilityp i , The precision of compressed layer is recovered either by weights reconstruction with the proposed feature map distillation as channel pruning does, or by weights renewal as an existing method [23] does. The method in [23] updates compressed weights as a re-scaling of their original values. Following this weights renewal rule, new weights are obtained as W :,i = K mp i W :,i , where K counts the times of an input channel being sampled and m is the number of sampling with duplication. Since our method extracts channels by batch sampling without duplication, our method simplifies the calculation as W :,i = K g i W :,i . By such extension to neuron pruning, our pruning method is able to compress both convolution layers and full connected layers in a network, further increasing the compression rate of parameters and FLOPs.

Experiments
Our method is further developed on a pytorch project 1 and experimentally evaluated in this section with following settings.
• Datasets: Tests are executed on three prevalent object detection networks on COCO 2017 datasets, four popular classification networks on ImageNet 2012 and two common classification networks on Cifar-10. The validation dataset of classification is generated by extracting 10 images per category from the entire ImageNet 2012 training dataset. Total 10000 images corresponding to one thousand categories are randomly extracted into our validation dataset, employed for compressing original networks and testing pruned networks.
• Comparison schemes: Our method is compared with three existing methods, ThiNet [21], the Lasso regression based pruning method [11] and the random channel pruning method, respectively marked as Coreset, ThiNet, Lasso and Random. Due to its simplicity and effectiveness, random channel pruning [21] is reproduced as the reference on performance of our designed random probability sampling during coreset constructions. All comparison schemes are implemented for channel selections and rely on weight reconstructions to restore precision.

Channel pruning in classification netwroks
I. Sensitivity of different layer to channel pruning Given the different sensitivity of every layer to pruning, our method is employed to compress single convolution layer or block at a time. Accuracy of compressing a layer or block is measured on VGG16, ResNet50, MobileNet-v2 and SqueezeNet1 0. Pruning is performed on squeeze layers of SqueezeNet1 0 and middle layers of blocks in ResNet50 and MobileNet-v2. The batch size for layer compression is as small as 4 pictures to test the dependency on data applied in the channel selection and the weights reconstruction. Table 1 lists results on a subset randomly drawn out from the training dataset of ImageNet 2012 in each test. Accuracy declines across the training dataset and the validation dataset are filled in the brackets. It is observed that our method outperforms comparison schemes on VGG16 and MobileNet-v2. On ResNet50, our method achieves an accuracy second to comparison schemes with a small variance across datasets. The average delay of our method pruning one layer or block is tested to measure the construction efficiency. Figure 4 illustrates results obtained by averaging time delays of every layer during pruning the whole network. Our method costs less time on pruning than ThiNet and the Lasso regression based method. And the delay of our method is similar to that of random channel pruning. It indicates that our method has the construction efficiency of pruning improved.

II. Accuracy of pruning multiple layers
The accuracy of pruning multiple layers without finetuning is measured to test the feasibility of retraining compressed networks after multiple pruning iterations rather than every pruning iteration. Figure 5 illustrates Top1 accuracy by different compression ratios which restricts the number of pruned layers. Channel pruning is executed on the first 5 convolution layers of VGG16, middle layers of the first 5 blocks in ResNet50, middle layers of the first 4 blocks in MobileNet-v2, and squeeze layers of the first 6 fire modules in SqueezeNet1 0. Our method achieves the smallest accuracy decline among all tested methods on   ResNet and MobileNet-v2 by different compression ratios. And from our observation, as the number of compressed layers increases, the discrepancy between our method and comparison schemes becomes large. By reducing the accuracy decline of pruning multiple layers, our method decreases the finetuning times to recover precision, accelerating pruning the entire network.

III. Finetuned networks with channel pruning
The top 1 accuracy of finetuned compressed networks is tested to evaluate influences of pruning on the original network. Whole networks are pruned by our channel pruning method and then fintuned with 89 iterations. As listed in Table 2, the accuracy of compressed SqueezeNet is increased after pruning. Even without compression, there is an accuracy decline between the original network and the individual network of required widths, whose channel numbers are consistent with compression ratios. As in [34], individual networks such as MobileNet-v2 and ResNet50 with 0.5× widths respectively have an accuracy 65.4% and 72%. Our pruning method achieves acceptable accuracy declines compared to original networks.

Channel pruning in object detection networks
Our pruning method is applied to compress backbones of object detection networks including Yolo-v3, Retinanet and MaskRCNN. The average precision of pruning the middle layers of an identity block in the backbone, denoted as AP , is separately measured block by block. Figure 6 depicts the decline of AP with the Intersection-Over-Union (IoU) 0.5 by different compression ratios. Since deep layers are specific to task and sensitive to pruning, tested methods all cause large precision declines at deep layers. The random channel pruning method is able to obtain small declines indicating the effectiveness of feature map reconstruction in recovering precision. The AP difference between one pruning method and random channel pruning reflects the loss generated by channel selection rules. Although the AP of our method is smaller than that of random pruning method at some deep layers, our method acquires a AP greater than those of other two comparison schemes. Table 3 lists measured declines in indicators AP and AR caused by pruning middle layers of the first block at a compression ratio 0.5 without finetuning. Our channel

Asynchronous pruning in multi-branch structures
Asynchronous pruning solutions on ResNet50 and SqueezeNet1 0 are evaluated by different compression ratios without finetuning. Figure 7 illustrates the top 1 accuracy of compressed networks whose first N blocks or fire modules (1 ≤ N ≤ 5) are pruned. The dash line denotes the original network's accuracy. Compared to random channel pruning, our asynchronous pruning is effective in maintaining precision as the compression ratio varies. Then asynchronous pruning solutions on ResNet50 are tested with and without the feature map distillation, to evaluate the influence of feature map distillation on recovering precision. Figure 8 depicts the accuracy variation of a specific block (layer3.0) measured for 12 times on the validation dataset and training dataset. The size of circle is determined on the value of accuracy variations across datasets. The abbreviation coreset ab denotes results without feature map distillation. The average accuracy increment with feature map distillation is greater than that without feature map distillation. Besides, pruned networks with different compression ratios are finetuned by 89 iterations. As listed in Table 4, the top 1 accuracy of compressed SqueezeNet is increased by 1.52% with a compression ratio 0.25. From the comparison between Tables 2 and 4, it shows that the proposed asynchronous pruning method enhances the FLOPs compression rate with small accuracy declines. Especially for ResNet50, the FLOPs is further decreased by 0.661G with an accuracy decline 2.37% at the compression  ratio 0.5. As in [34], the accuracy decline of individual networks with 0.5× widths is more than 4% compared to the original ResNet network. The proposed asynchronous pruning method achieves a smaller accuracy decline with the FLOPs similar to individual networks'. The asynchronous pruning flow is portable for identity blocks consisted of two convolution layers in ResNet20 or ResNet56. Table 5 lists finetuned results after channel pruning by our method and comparison schemes on the CIFAR-10 dataset. Results of comparison schemes [9,17,31] are obtained by Distiller 2 . The scheme CRank performs channel ranking and pruning as described in [17]. The indicator sparsity is calculated as the number ratio of parameters pruned to total parameters. Our method sets the channel compression ratio as 0.5, and achieves a sparsity greater than those of comparison schemes.

Accuracy of neuron pruning
The extension of our method to neuron pruning is tested on full connected layers of VGG16. Table 6 lists the delay and top1 accuracy of compressing the second full connected layer, denoted as VGG-fc2. The network is pruned on validation dataset and tested using a random subset of the training dataset of ImageNet 2012, generating the accuracy decline across datasets in brackets. In tests recovering precision by weights renewal rules of coreset theory, two neuron pruning methods in [23] and [2] are reproduced, respectively marked as Neural and CoreNet+. Our method outperforms the CoreNet+ and obtains an accuracy similar to that of Neural. In tests compensating accuracy declines by weights reconstruction, our method achieves a greater top 1 accuracy than that of ThiNet and saves more time during pruning than ThiNet. In either way of precision recovery, our method performs as effective as Neural, indicating the modified importance function in our method is adequate for neuron pruning. 2 Distiller is available at https://github.com/IntelLabs/distiller

Conclusion
This paper proposed an asynchronous pruning method based on the application of coreset theory in neural network compression. Our method adopted a channel selection rule which adapts the probability calculation of random probability sampling in coreset constructions using our devised channel importance function. This modification makes the total importance function controllable by the specified compression ratio during constructing an εcoreset. The designed channel selection rule is implemented by batch sampling, improving the construction efficiency of pruning. Our asynchronous pruning solutions for multibranch structures achieve module-level pruning on ResNet without importing extra operations into forward inference. And our method apply an in-place feature map distillation to recover precision after pruning input and output channels of an identity block. Solutions of deploying our method on more kinds of multi-branch networks are included in our future experiments.

Declarations
Competing interests The authors have no competing interests to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.