1 Introduction

Due to the ability of reducing parameters and accelerating inference, neural network compression is adequate to facilitate deep learning in resource-limited scenarios such as mobile systems or embedded devices. Pruning is widely developed in the industry and connectable with other compression methods like the quantization [8] or the knowledge distillation [20] for attaining further compact networks.

The construction efficiency is an evaluator as important as the accuracy and the inference efficiency, involving the complexity and the running time of pruning. To enhance the construction efficiency, it is feasible for inference-time pruning to work in an one-shot way that compresses multiple layers once and finetunes until the precision is restored rather than iterates pruning and retraining. For an instance, the filter pruning method in [17] discards filters by the sort of every filter’s 1-norm and then executes retraining once. Since pruning and the weights reconstruction are performed merely on a batch of data during inference-time pruning, the compressed network without fine-tuning is prone to over-fit on the dataset utilized in compression. To alleviate such influences and offer a better initialization for finetuning, our method attempts to design a data-independent selection rule for inference-time pruning.

As elaborated in [13], although the compressed network is comparable with a complete network in the average accuracy, the compressed network’s generalization is degraded at some difficult classes or instances of long-tail distribution. This issue makes pruning as a way to expose the weakness of a network’s generalization and inspires relative researches such as the contrastive learning method against imbalance of learned representations [15]. The coreset theory has been applied in fine-grained pruning [2] with discussions on the generalization bound of compressed networks. The work in [2] constructs coresets to prune weights and neurons of fully-connected neural networks with the generalization bound declared in [1], assuring the performance of compressed network for arbitrary data drawn from the distribution. The coresets based pruning in [23] offers a guarantee of approximation error which is provably adequate for any future test data, enabling data-independent pruning.

This paper proposes a coreset based asynchronous pruning method. Our method makes following contributions for data-independent module-level pruning and the optimized construction efficiency of pruning.

  1. 1.

    To increase the construction efficiency of pruning, the coreset is generated by batch sampling without duplication for multiple times using a sampling probability decided on our-designed importance function.

  2. 2.

    Our method achieves module-level pruning in multi-branch networks by asynchronous channel pruning, which respectively compresses the shape and the number of filters in a specific layer in different rounds by different channel selection rules. Particularly on ResNet50, our method is able to compress the input and output channels of identity blocks, without importing extra operations in the forward inference.

  3. 3.

    An in-place distillation of feature map is designed for pruning output channels of identity blocks on ResNet50.

  4. 4.

    Our method is extended to neuron pruning via treating grouped weights as channels. Tests on full connected layers of VGG16 show batch sampling with our designed importance function is as effective as the existing coreset based neuron pruning method.

2 Related work

Inference-time pruning

To learn compact structures via training, training based pruning adds various sparsity regularization to the training loss, such as the 1-norm based [16, 24] importance criteria. In addition, Taylor expansion has been adopted in [22, 33] as importance criteria to minimize the loss change caused by pruning. The sparsity regularization is calculated directly in batch normalization (BN) layers [19] or convolution layers by group lasso [14, 31] to measure the importance of neurons or structures. Liu et al [19] calculates the 1-norm based sparsity regularization using channel-wise scaling factors in BN layers.

Differently, inference-time pruning discards structures by heuristic selection criteria such as the reconstruction-based rule [4] during forward inferences, and then restores the accuracy by updating weights through feature map reconstructions or finetuning [25, 28]. FPGM in [10] iterates pruning whole network and finetuning the compressed network to retain accuracy. FPGM prunes filters with the most replaceable contribution that are nearest to the geometric median of filters in a layer. FilterSketch in [18] works in an one-shot way that prunes the network once and finetunes until the accuracy is recovered. FilterSketch preserves filters which retain sufficient covariance information of filters in pre-trained network, through solving the matrix sketch problem. Compensating accuracy loss by feature map reconstructions contributes to reductions in finetuning times and then lightens the computation burden of finetuning pruned networks. Gou et al [11] updates weights of compressed layers by solving feature map reconstructions with linear least squares, requiring only few times of finetuning for accuracy recovery. Nevertheless, extra computations are introduced to the inference for pruning input channels of one identity block on ResNet in [11]. Our method makes the first effort to prune input channels and output channels of downsampling blocks and identity blocks on ResNet, without importing extra computations or operations into the inference in situations of inference-time pruning.

Coresets based pruning

The application of coreset theory has been extended from computational geometry to diverse machine learning algorithms including k-means clustering and neural network pruning. Dubey et al [5] discards filters through activation-based pruning and employs coreset theory to construct efficient filter coreset representation. Braverman et al [3] suggests a streaming algorithm to construct coresets for metric k-means clustering, based on the generic framework for coreset construction in [6]. This frame work converts coreset to ε −approximation which is calculated by non-uniform sampling. In this framework, the importance of each point determines sampling distribution and the coreset size relies on the sum of point’s importance. Braverman et al [3] further reduces the coreset size to be near-linear in the total importance.

On the basis of off-line and streaming coreset constructions in [3], data-driven and data-independent pruning methods [2, 23] arise. Baykal et al [2] constructs data-driven coresets of weights, while [23] generates a data-independent coreset of neurons. The method in [2] separately builds coresets for positive and negative weights and is extendable to prune neurons. Although above coreset based methods achieve different scales of sparsity such as neurons, weights or filters, they have not applied the coreset theory directly in structural pruning. While our method designs an asynchronous pruning solution constructing the coreset of channels. The pruning method in [23] computes one neuron’s importance by the upper bound of activation values, assuring the total importance finite. Our methods modifies the function of sample importance making the bound of total importance theoretically controllable by the compression ratio [32]. The coreset in [23] is a multiset generated by iteratively sampling single sample. Our method is to construct the coreset of channels without duplication by batch sampling, so as to meet the compression ratio and meanwhile improve the construction efficiency.

Knowledge distillation

Knowledge distillation presents the procedure of knowledge transferring from a complete network to the compressed network. Distillation algorithms apply response based-knowledge, feature-based knowledge [27] or relation-based knowledge [29] under the teacher-student architecture [7] to facilitate training the compressed network. Knowledge distillation is deployed in such ways as offline distillation, online distillation and self-distillation. Besides, knowledge distillation is expanded to data distillation [26] and dataset distillation [30] not only focused on model distillation. Model distillation in [12] builds an ensemble of models as the teacher model. Instead, data distillation [26] ensembles predictions of one pre-trained network performing on multiple transformations of unlabeled data as automatic annotations to train a student network. Dataset distillation refines knowledge from a large scale dataset into several distilled images whose effectiveness is comparable to that of the dataset on training a network.

Inspired by the dataset distillation, our method attempts to distill feature maps to recover the precision of compressed networks. Specially, feature maps are distilled with channel reduction, forcing the compressed network to approximate a part of feature maps of the complete networks. In an in-place way, [34] trains a series of sub-networks by pseudo labels produced by the complete network in every iteration, requiring no extra inputs or operations. Similarly, our method distills feature maps via reconstructing the outputs of compressed layers according to parts of the complete layer’s outputs by linear least squares.

3 Method

3.1 Coreset based channel selction rule

Input data x with N channels is imported to a layer with weights 𝜃 = (W:,1,:,:,…,W:,N,:,:), where W:,i,:,: denotes weights of the i th input channel and is abbreviated as Wi. The feature map of this layer is formed as \(Y = {\sum }_{i = 1}^{N} {{y_{i}}}\), where yi is calculated as yi = f𝜃(Wix) and ∗ denotes convolution operations. To achieve data-independent pruning, our method attempts to construct an (ε,δ)-coreset \(\hat \theta \) for the weights set 𝜃 based on user-specified additive ε-error and probability δ, ε,δ ∈ (0,1). According to deviations in [23], it is feasible to construct an (ε,δ)-coreset with additive ε-error in situations that activation functions are adopted as the loss function of a query space.

Our method constructs the (ε,δ)-coreset of input channels by non-uniform sampling whose probability distribution is related to the importance of one channel’s weights Wi [3]. The importance function is marked as s(⋅) and si is short for one channel’s importance s(Wi). Within a layer, every channel’s importance functions are accumulated as the total importance t, \(t = {\sum }_{{W_{i}} \in \theta } {s({W_{i}})} \). As for the i th input channel, its sampling probability pi is defined by the ratio of its importance to the total importance as si/t. This way of coreset construction holds on the condition that the channel importance si and the total importance t are bounded.

As in [32], our method designs a bounded channel importance function to acquire a controllable total importance. For the i th channel, the designed channel importance is a multiplication of the primary channel importance gi and the assigned fraction vi, i.e. si = givi. The primary importance gi is settled as 1 for every input channel in convolutional neural networks except MobileNet. The fraction allocated for non-uniform sampling in l th layer is formed as

$$ {v_{i}} = \left\{ {\begin{array}{*{20}{c}} {1/({a_{l}} + 1),}&{i \le {a_{l}}}\\ {1/({a_{l}} + 1)({N_{l}} - {a_{l}}),}&{{a_{l}} < i \le {N_{l}}}, \end{array}} \right. $$
(1)

where Nl is the number of input channels in l th layer, and al is the number of preserved channels determined by Nl and the user-specified compression ratio. The compression ratio is predefined before pruning to satisfy restrictions of FLOPs or inference delays. The fraction vi is assigned to a channel according to the descending sort of 1-norm of every input channel’s weights Wi. The first al channels with large norms are offered with large fractions vi. Since there is no relation to input data in the assignment of vi and the calculation of gi, our method constructs a data-independent coreset of input channels. Then the total importance t is controlled with the user-specified compression ratio within a range, \(\frac {{{N_{l}}}}{{({a_{l}} + 1)({N_{l}} - {a_{l}})}} \le t \le \frac {{{N_{l}}}}{{{a_{l}} + 1}}\), where the equality is valid when al = Nl − 1.

3.2 Asynchronous pruning for multi-branch structures

To further increase compression ratio of parameters or FLOPs, our method provides specific solutions of asynchronous channel pruning for multi-branch networks. Our method constructs compression units which are connected to different modules or stay within one module. It takes two rounds of channel pruning to process compression units belonging to a module by different channel selection rules. Therefore our compression method is called as asynchronous channel pruning. For the first round, our method designs a layer-wise channel pruning flow via coreset constructions in Fig. 1.

Fig. 1
figure 1

Working flow of the coreset based channel pruning

I. Acceleration of channel pruning within a layer

As in Fig. 1, our method sequentially discards channels according to the constructed coreset \(C_{\bar \theta }\) and recovers precision by optimizing the feature map reconstruction. The channel selection procedure and the feature map reconstruction are performed by following steps.

Step1:

The channel selection procedure constructs the channel coreset by simultaneously sampling a batch of channels rather than single channel to accelerate pruning. The channel coreset is constructed for R rounds and then a merged coreset \(C_{\bar \theta }\) is generated according to the histogram which counts times of each channel being selected. The merged channel coreset consists of channels with high selected frequency in the histogram.

Step2:

Channels outside the merged coreset \(C_{\bar \theta }\) are discarded from input channels \(C^{l}_{in}\) of l th layer, generating compressed weights \(\bar \theta \). Meanwhile output channels \(C^{l-1}_{out}\) of l − 1th layer are pruned in accordance with \(C_{\bar \theta }\).

Step3:

Compressed weights \(\bar \theta \) of the merged channel coreset are updated through solving the optimization function \({\min \limits } \left \| {Y - \sum \nolimits _{{W_{i}} \in \bar \theta } {{x_{i}}*{W_{i}}}} \right \|_{F}^{2}\) by linear least squares.

The step 1 is equivalent to construct a channel subset whose size is within R times of the amount required by the compression ratio. The importance function \(s_{i}^{\prime }\) of a channel in this subset is an accumulation of a re-scaling of their original importance during every sampling [3], as \(s_{i}^{\prime }= \sum \nolimits _{i =1}^{K} \frac {{{s_{i}}}}{{R{a_{l}} {p_{i}}}}\), where K records the times of a channel being sampled. In our situation, given that pi = si/t and si = vi, the updated importance function is merely positively correlated with the sampled frequency K, as the deviation \(s_{i}^{\prime }= \sum \nolimits _{i = 1}^{K} \frac {{{s_{i}}}}{{R{a_{l}} {p_{i}}}}=\frac {Kt}{R{a_{l}}}\). To simplify the calculation, our method treats the sampled frequency K of a channel as the selection criterion to extract channels from the channel subset into the coreset.

  1. II.

    Module-level pruning and in-place distillation

Asynchronous solution for SqueezeNet

For module-level pruning on SqueezeNet, in addition to the compression unit across modules, our method constructs compression units within modules, which comprise output channels of squeeze layers and input channels of expand layers, as shown in Fig. 2. For the fire module i, its compression unit across modules is processed as the layer-wise channel pruning stated above in the first pruning round. And its compression unit within modules is compressed by the random channel pruning method in the second pruning round. Such asynchronous channel pruning flow is successively executed on every fire module. In this pruning way, compression is executed on both the shape and the number of filters in expand layers.

Fig. 2
figure 2

Asynchronous pruning solution on SqueezeNet. The compression unit, consisted of different fire modules, is pruned prior to the compression unit within one fire module

Asynchronous flow of ResNet

Our method designs different asynchronous pruning flow for downsampling blocks and identity blocks on ResNet. As depicted in Fig. 3, compression units are constructed between layers on residual branches or layers on downsampling branches. Those compression units in green, which consist of bottleneck layers in downsampling blocks or identity blocks, are pruned according to the proposed channel pruning flow in the first pruning round. Those compression units in blue, which respectively contain downsampling layers, first bottleneck layers and last bottleneck layers, are pruned by the random channel pruning method in the second pruning round. It is different for downsampling blocks and identity blocks whether the second pruning round of a specific block is executed immediately. For downsampling blocks, the second pruning round is performed right after the first round. For N stacked identity blocks, the second pruning round of the first identity block is carried out until the first pruning round of N th identity block is finished. In other words, taking all the first pruning rounds of stacked blocks as the first step, all the second pruning rounds are the second step to complete.

Fig. 3
figure 3

Asynchronous pruning solution for downsampling blocks and identity blocks on ResNet. The compression units in green are sequentially pruned prior to compression units in blue

Feature map distillation

During the second pruning round of N stacked identity blocks in ResNet, feature maps of the last bottleneck layer in every identity block are distilled in an in-place way. One channel set \(C_{\bar \theta }\), generated by random sampling during pruning input channels of the first identity block, is shared with rest blocks for pruning both input channels of the first bottleneck layer and output channels of the last bottleneck layer. Channels’ feature maps are chosen as the knowledge to be transferred from the last bottleneck layer with complete output channels to the compressed layer, according to the channel set \(C_{\bar \theta }\). After output channels are compressed, the weights of pruned last bottleneck layer are updated by minimizing the reconstruction loss compared with selected channels’ feature maps. Then feature maps of the last bottleneck layer are re-obtained from the same inputs by updated weights, which achieves the distillation of feature maps from complete channels to compressed channels.

3.3 Neuron pruning via coresets

Our coreset based pruning method is extendable for compressing full connected layers. Weights linked to a neuron in l th layer \(W^{l}_{i,:}\) are treated as one output channel of l th layer, and weights connected to a neuron in l − 1th layer \(W^{l}_{:,i}\) are regarded as one input channel of l th layer. By grouping weights as channels, formula of random probability sampling are usable for constructing coresets of neurons. Specifically, the fraction vi is computed by the amount of neurons Nl and the required number of preserved neurons al in a full connected layer, as in (1). The fraction vi is allocated in accordance with the sort of 1-norm of weights \(W^{l}_{:,i}\) that belong to the i th input channel. The primary importance gi is adjusted as a proportion of the Frobenius norm of one input channel’s outputs hi in the average norm of every input channel’s outputs \(\bar h\). Given a batch of input data x, hi is computed as \({\sum \nolimits _{{x_{j}} \in x} {\left \| {f_{\theta }({W_{:,i}}*{x_{ij}})} \right \|_{F}^{2}} }\) and \(\bar h\) is calculated as \({\frac {1}{N_{l-1}} \sum \nolimits _{i = 1}^{N_{l-1}} {\sum \nolimits _{{x_{j}} \in x} {\left \| {f_{\theta }({W_{:,i}}*{x_{ij}})} \right \|_{F}^{2}} } }\), where xij is the j th output data of i th neuron in l − 1th layer. Setting the primary importance gi as \(h_{i} / \bar h\), the neuron coreset is constructed by sampling with the probabilitypi, pi = givi/t.

The precision of compressed layer is recovered either by weights reconstruction with the proposed feature map distillation as channel pruning does, or by weights renewal as an existing method [23] does. The method in [23] updates compressed weights as a re-scaling of their original values. Following this weights renewal rule, new weights are obtained as \({W^{\prime }_{:,i}}=\frac {K}{m{p_{i}}}{W_{:,i}}\), where K counts the times of an input channel being sampled and m is the number of sampling with duplication. Since our method extracts channels by batch sampling without duplication, our method simplifies the calculation as \({W^{\prime }_{:,i}}=\frac {{K}}{{g_{i}}}{W_{:,i}}\). By such extension to neuron pruning, our pruning method is able to compress both convolution layers and full connected layers in a network, further increasing the compression rate of parameters and FLOPs.

4 Experiments

Our method is further developed on a pytorch projectFootnote 1 and experimentally evaluated in this section with following settings.

  • Datasets: Tests are executed on three prevalent object detection networks on COCO 2017 datasets, four popular classification networks on ImageNet 2012 and two common classification networks on Cifar-10. The validation dataset of classification is generated by extracting 10 images per category from the entire ImageNet 2012 training dataset. Total 10000 images corresponding to one thousand categories are randomly extracted into our validation dataset, employed for compressing original networks and testing pruned networks.

  • Comparison schemes: Our method is compared with three existing methods, ThiNet [21], the Lasso regression based pruning method [11] and the random channel pruning method, respectively marked as Coreset, ThiNet, Lasso and Random. Due to its simplicity and effectiveness, random channel pruning [21] is reproduced as the reference on performance of our designed random probability sampling during coreset constructions. All comparison schemes are implemented for channel selections and rely on weight reconstructions to restore precision.

4.1 Channel pruning in classification netwroks

I. Sensitivity of different layer to channel pruning

Given the different sensitivity of every layer to pruning, our method is employed to compress single convolution layer or block at a time. Accuracy of compressing a layer or block is measured on VGG16, ResNet50, MobileNet-v2 and SqueezeNet1_0. Pruning is performed on squeeze layers of SqueezeNet1_0 and middle layers of blocks in ResNet50 and MobileNet-v2. The batch size for layer compression is as small as 4 pictures to test the dependency on data applied in the channel selection and the weights reconstruction. Table 1 lists results on a subset randomly drawn out from the training dataset of ImageNet 2012 in each test. Accuracy declines across the training dataset and the validation dataset are filled in the brackets. It is observed that our method outperforms comparison schemes on VGG16 and MobileNet-v2. On ResNet50, our method achieves an accuracy second to comparison schemes with a small variance across datasets.

Table 1 Accuracy of compressing a layer or block by the compression ratio 0.5 without finetuning

The average delay of our method pruning one layer or block is tested to measure the construction efficiency. Figure 4 illustrates results obtained by averaging time delays of every layer during pruning the whole network. Our method costs less time on pruning than ThiNet and the Lasso regression based method. And the delay of our method is similar to that of random channel pruning. It indicates that our method has the construction efficiency of pruning improved.

Fig. 4
figure 4

Delay of pruning one layer or block by the compression ratio 0.5 without finetuning

II. Accuracy of pruning multiple layers

The accuracy of pruning multiple layers without finetuning is measured to test the feasibility of retraining compressed networks after multiple pruning iterations rather than every pruning iteration. Figure 5 illustrates Top1 accuracy by different compression ratios which restricts the number of pruned layers. Channel pruning is executed on the first 5 convolution layers of VGG16, middle layers of the first 5 blocks in ResNet50, middle layers of the first 4 blocks in MobileNet-v2, and squeeze layers of the first 6 fire modules in SqueezeNet1_0. Our method achieves the smallest accuracy decline among all tested methods on ResNet and MobileNet-v2 by different compression ratios. And from our observation, as the number of compressed layers increases, the discrepancy between our method and comparison schemes becomes large. By reducing the accuracy decline of pruning multiple layers, our method decreases the finetuning times to recover precision, accelerating pruning the entire network.

Fig. 5
figure 5

Top 1 accuracy of four networks with multiple layers or blocks compressed by different compression ratios

III. Finetuned networks with channel pruning

The top 1 accuracy of finetuned compressed networks is tested to evaluate influences of pruning on the original network. Whole networks are pruned by our channel pruning method and then fintuned with 89 iterations. As listed in Table 2, the accuracy of compressed SqueezeNet is increased after pruning. Even without compression, there is an accuracy decline between the original network and the individual network of required widths, whose channel numbers are consistent with compression ratios. As in [34], individual networks such as MobileNet-v2 and ResNet50 with 0.5 × widths respectively have an accuracy 65.4% and 72%. Our pruning method achieves acceptable accuracy declines compared to original networks.

Table 2 Results of finetuned networks compressed by channel pruning with a compression ratio 0.5

4.2 Channel pruning in object detection networks

Our pruning method is applied to compress backbones of object detection networks including Yolo-v3, Retinanet and MaskRCNN. The average precision of pruning the middle layers of an identity block in the backbone, denoted as AP, is separately measured block by block. Figure 6 depicts the decline of AP with the Intersection-Over-Union (IoU) 0.5 by different compression ratios. Since deep layers are specific to task and sensitive to pruning, tested methods all cause large precision declines at deep layers. The random channel pruning method is able to obtain small declines indicating the effectiveness of feature map reconstruction in recovering precision. The AP difference between one pruning method and random channel pruning reflects the loss generated by channel selection rules. Although the AP of our method is smaller than that of random pruning method at some deep layers, our method acquires a AP greater than those of other two comparison schemes.

Fig. 6
figure 6

Average precision decline of pruning middle layers of an identity block in the backbone at IoU = 0.5 by different compression ratios

Table 3 lists measured declines in indicators AP and AR caused by pruning middle layers of the first block at a compression ratio 0.5 without finetuning. Our channel pruning method achieves the least decline in AP and AR for the bounding boxes detection and instance segmentation.

Table 3 Results of compressing one block in the backbone network by a compression ratio 0.5 without finetuning

4.3 Asynchronous pruning in multi-branch structures

Asynchronous pruning solutions on ResNet50 and SqueezeNet1_0 are evaluated by different compression ratios without finetuning. Figure 7 illustrates the top 1 accuracy of compressed networks whose first N blocks or fire modules (1 ≤ N ≤ 5) are pruned. The dash line denotes the original network’s accuracy. Compared to random channel pruning, our asynchronous pruning is effective in maintaining precision as the compression ratio varies.

Fig. 7
figure 7

Top 1 accuracy of compressed networks without finetuning at different compression ratios by asynchronous pruning

Then asynchronous pruning solutions on ResNet50 are tested with and without the feature map distillation, to evaluate the influence of feature map distillation on recovering precision. Figure 8 depicts the accuracy variation of a specific block (layer3.0) measured for 12 times on the validation dataset and training dataset. The size of circle is determined on the value of accuracy variations across datasets. The abbreviation coreset_ab denotes results without feature map distillation. The average accuracy increment with feature map distillation is greater than that without feature map distillation.

Fig. 8
figure 8

Accuracy variation of asynchronous pruning on stacked blocks with and without the feature map distillation

Besides, pruned networks with different compression ratios are finetuned by 89 iterations. As listed in Table 4, the top 1 accuracy of compressed SqueezeNet is increased by 1.52% with a compression ratio 0.25. From the comparison between Tables 2 and 4, it shows that the proposed asynchronous pruning method enhances the FLOPs compression rate with small accuracy declines. Especially for ResNet50, the FLOPs is further decreased by 0.661G with an accuracy decline 2.37% at the compression ratio 0.5. As in [34], the accuracy decline of individual networks with 0.5 × widths is more than 4% compared to the original ResNet network. The proposed asynchronous pruning method achieves a smaller accuracy decline with the FLOPs similar to individual networks’.

Table 4 Results of finetuned networks compressed by asynchronous pruning

The asynchronous pruning flow is portable for identity blocks consisted of two convolution layers in ResNet20 or ResNet56. Table 5 lists finetuned results after channel pruning by our method and comparison schemes on the CIFAR-10 dataset. Results of comparison schemes [9, 17, 31] are obtained by DistillerFootnote 2. The scheme CRank performs channel ranking and pruning as described in [17]. The indicator sparsity is calculated as the number ratio of parameters pruned to total parameters. Our method sets the channel compression ratio as 0.5, and achieves a sparsity greater than those of comparison schemes.

Table 5 Results of finetuned networks compressed on the CIFAR-10 dataset

4.4 Accuracy of neuron pruning

The extension of our method to neuron pruning is tested on full connected layers of VGG16. Table 6 lists the delay and top1 accuracy of compressing the second full connected layer, denoted as VGG-fc2. The network is pruned on validation dataset and tested using a random subset of the training dataset of ImageNet 2012, generating the accuracy decline across datasets in brackets. In tests recovering precision by weights renewal rules of coreset theory, two neuron pruning methods in [23] and [2] are reproduced, respectively marked as Neural and CoreNet +. Our method outperforms the CoreNet + and obtains an accuracy similar to that of Neural. In tests compensating accuracy declines by weights reconstruction, our method achieves a greater top 1 accuracy than that of ThiNet and saves more time during pruning than ThiNet. In either way of precision recovery, our method performs as effective as Neural, indicating the modified importance function in our method is adequate for neuron pruning.

Table 6 Results of compressing one full connected layer by different compression methods without finetuning

5 Conclusion

This paper proposed an asynchronous pruning method based on the application of coreset theory in neural network compression. Our method adopted a channel selection rule which adapts the probability calculation of random probability sampling in coreset constructions using our devised channel importance function. This modification makes the total importance function controllable by the specified compression ratio during constructing an ε-coreset. The designed channel selection rule is implemented by batch sampling, improving the construction efficiency of pruning. Our asynchronous pruning solutions for multi-branch structures achieve module-level pruning on ResNet without importing extra operations into forward inference. And our method apply an in-place feature map distillation to recover precision after pruning input and output channels of an identity block. Solutions of deploying our method on more kinds of multi-branch networks are included in our future experiments.