Progressive multi-level distillation learning for pruning network

Although the classification method based on the deep neural network has achieved excellent results in classification tasks, it is difficult to apply to real-time scenarios because of high memory footprints and prohibitive inference times. Compared to unstructured pruning, structured pruning techniques can reduce the computation cost of the model runtime more effectively, but inevitably reduces the precision of the model. Traditional methods use fine tuning to restore model damage performance. However, there is still a large gap between the pruned model and the original one. In this paper, we use progressive multi-level distillation learning to compensate for the loss caused by pruning. Pre-pruning and post-pruning networks serve as the teacher and student networks. The proposed approach utilizes the complementary properties of structured pruning and knowledge distillation, which allows the pruned network to learn the intermediate and output representations of the teacher network, thus reducing the influence of the model subject to pruning. Experiments demonstrate that our approach performs better on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets with different pruning rates. For instance, GoogLeNet can achieve near lossless pruning on the CIFAR-10 dataset with 60% pruning. Moreover, this paper also proves that using the proposed distillation learning method during the pruning process achieves more significant performance gains than after completing the pruning.


Introduction
Deep neural networks (DNNs) based on deep learning have shown impressive results on tasks such as image classification [1][2][3], object detection [4][5][6], and natural language processing [7]. With the development of network models, it seems to be a new trend to build more sophisticated networks to achieve higher accuracy [8,9]. These large, complex networks, however, do not work effectively on mobile devices or Internet of Things devices. Therefore, alleviating the model's operational burdens while ensuring high accuracy is one of the main problems facing DNNs. Pruning, B Wu Zhang zhangwu@ahau.edu.cn 1 School of Information and Computer, Anhui Agricultural University, 130 Changjiang West Road, Shushan District, Hefei, Anhui, China 2 Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, Anhui, China knowledge distillation, quantization, and lightweight networks have developed into available ways needed to reduce the considerable computational resources required [10].
Pruning methods [11] allow the model to be simpler and more efficient by eliminating redundant parameters or connections through a certain measure, which is why pruning is a popular technique. Although this technique can minimize the size of the model while maintaining performance, precision loss is unavoidable in pruning networks. The goal of knowledge distillation is to guide student learning through a more robust teacher model, which enables more straightforward learners to have a certain degree of mastery over the teacher's skills [12]. This, however, requires the researchers to manually select models for both teachers and students.
Knowledge distillation is an effective way to compensate for the loss of precision due to pruning. However, using knowledge distillation only after the pruning has been completed, and not while it is in progress, may result in suboptimal model performance. In addition, most of the previous studies have focused on the problem of how to improve the performance of unstructured pruning [13,14], while there have been few studies on structured pruning [15,16]. In fact, non-structured pruning needs special software libraries or hardware to speed up the network model, whereas structured pruning can compress the network without any help [17]. Therefore, it is more realistic to combine structured pruning with distillation learning.
To solve the image classification problem, we present a new method of progressive multi-level distillation for structural pruning. In this paper, the original and pruned networks can be considered teacher and student models, thus avoiding the need for manual selection of teacher models. Moreover, we take full advantage of the characteristics of structured pruning, using each block of student network pruning and its corresponding teacher block as input for distillation loss based on feature representation. Its respective blocks gradually increase as pruning progresses, forming a progressive distillation. In addition to feature learning, our proposed multi-level distillation learning includes response representation-based learning, which allows students to mimic the logits output of the teacher's model. In this way, our approach can effectively reduce accuracy losses, allowing the pruned network to minimize the size of the model and the computational resources required within an acceptable range of accuracy degradation.
The contributions of this paper are as follows.
1. This paper proposes a progressive multi-level distillation learning approach for structured pruning networks. We also validate the proposed method on different pruning rates, pruning methods, network models, and three public datasets (CIFAR-10/100, and Tiny-ImageNet). 2. Compared with other knowledge distillation methods, our proposed method can better restore the structured pruning network's accuracy and improve the model's performance after each pruning. 3. We conduct ablation study experiments further to understand each loss's contribution to our proposed framework. 4. We show that distillation learning during pruning, rather than after pruning, improves model performance without additional inference time.

Network pruning
In earlier studies on pruning, the focus was more on the granularity of the pruning of individual neurons, i.e., unstructured pruning. Optimal Brain Damage [18] and Optimal Brain Surgeon [19] assessed the significance of weights on the basis of information related to the second-order derivatives of the loss function. More directly, Han et al. [20] determined whether the parameters were significant (insignificant) depending on whether they were larger (less) than a given threshold. While leading to high compression ratios, these methods only changed the weight matrix from dense to sparse. Unstructured pruning would not yield the expected results without specialized software libraries or hardware to help calculate [17].
On the other hand, the pruning granularity of structured pruning is an integrated structure. For example, Li et al. [21] ranked the filters of each layer according to the sum of the absolute filter weights (i.e., L1-norm) to determine their importance. Zhuang et al. [22] considered sparse filters non-critical, and removing unimportant filters by imposing a scaling factor on the Batch Normalization (BN) was also an efficient approach [23]. In a recent study, Lin et al. [24] concluded that the rank of the feature map is more representative of the amount of information contained in a filter, which can lead to promising results.

Knowledge distillation
The initial knowledge distillation [25] argued that one-hot labels limit the performance of the network model, and that the soft labels of a more robust network would provide more abundant information, which would allow the transfer of knowledge from a larger teacher network to a smaller one, thereby bridging the gap. Moreover, besides focusing on extracting logits output knowledge, intermediate representations of knowledge within the teacher in the form of feature maps can also be learned by the student model. FitNet [26] first proposed distillation learning for a single intermediate layer of knowledge. AT [27] extended this idea by extracting multiple intermediate layers knowledge of the teacher model to guide student learning, and by using L2-regularization on each feature map to ensure consistent dimensions for each pair of feature maps. However, knowledge from deeper intermediate layers may provide students with overly standardized guidance, while knowledge from shallower layers may not serve as a guiding role [12], which results in the inefficient transfer of knowledge. In relation-based distillation learning, knowledge transfer relationships between different layers or data are further explored. Yim et al. [28] used the relationship between layers of the teacher's network as the goal of student model learning. SP [29] aimed to preserve the student's pairwise similarity rather than mimicking the teacher's representation space, so that students could better understand the relationships between instances. Furthermore, in addition to the applications mentioned above in classification tasks, knowledge distillation methods have also proven their effectiveness in more complex tasks such as object detection [30,31].
Model pruning and knowledge distillation are two independent parts of model compression. How to combine these two methods is one of the problems worth discussing. The simplest way to combine them is to use knowledge distillation after the completion of pruning [15,32]. However, we have shown that the use of distillation learning in the fine-tuning process of pruning can yield better results, as demonstrated in "Two combined strategies". Furthermore, it is also necessary to validate the efficacy of distillation learning for structured pruning networks on various model architectures and public datasets.

Quantization
The memory footprints and inference speed of the model can be effectively decreased by reducing the number of representation bits of original weights. This technique is known as quantization. Gong et al. [33] quantizing of the weights using K-means clustering could compress the network model by a factor of 8-16 with minimal or no performance impairment. In addition, under exceptional cases, weights could be represented as one-bit data and constituted a binarized network [34], significantly reducing computational consumption. Han et al. [35] integrated pruning, quantization, and Hoffman coding for deep model compression, providing a solution for its deployment on devices with low energy consumption. Figure 1 gives an overview of the progressive multi-level distillation learning approach for structured pruning. In the process of structured pruning, the original network and the pruned network are treated as a teacher and student model, respectively, and the proposed method is used in the finetuning process. In contrast to using knowledge distillation only after pruning is completed, our approach increases the training time but improves the performance of the model. Although the structure of the network model (i.e., the number of channels) is constantly changed with pruning, it has been shown that we can improve the performance after every pruning without the need to adjust the hyperparameters. The algorithm flow is illustrated in Algorithm 1. The proposed approach will be described in more detail in the following sections.

Progressive feature distillation
As mentioned in "Knowledge distillation", the intermediate knowledge from deep layers can easily lead to overnormalization of the students' models, and the intermediate knowledge from shallow layers will not be able to provide guidance. Therefore, effectively transferring the knowledge of teachers' models to students is a critical issue. As shown in Fig. 2, unlike FitNet [26] and AT [27] for distillation learning of fixed intermediate blocks of knowledge, we subtle used the characteristic of structured pruning in which each block is pruned in turn, so that each block that is pruned becomes a mentee. The corresponding unpruned block in the teacher model becomes a mentor. Although there is a significant deviation between the pruned block and the original one, the corresponding feature pairs can effectively transfer intermediate knowledge to achieve better performance recovery. As illustrated in Fig. 1, when pruning begins, the number of pruned blocks is small, and only shallow, intermediate knowledge can be used as a guide. But as the number of pruned blocks increases, the corresponding loss of information increases, so that the deep intermediate knowledge becomes useful, avoiding the over-standard of the student model and compensating for the loss of representation power caused by pruning.
In the pruning of the student model, the structured pruning removes the non-significant channels, which leads to a discrepancy in the number of channels between the two models. Using an adaptation layer consisting of a pointwise convolution (1 × 1 kernel) and a BN layer, we map the student channels to their corresponding teacher counterparts, allowing for more efficient knowledge extraction and reducing differences in feature maps between the pruned and the original model. We present the distillation losses of individual Fig. 1 An overview of the implementation of this method, which is based on a progressive multi-level distillation method for structured pruned networks. In the figure, the network is divided into three blocks, and we take the network after pruning each block as a student and the well-trained network as a teacher. Students #1 to #3 represent the student models obtained after sequential pruning of the first module of the teacher model to the third model. As pruning progresses, the intermediate features of the extracted knowledge are increased, which can maximize the utilization of pruning properties for distillation learning. The adaptation layer makes the feature mapping dimension of the student block the same as that of the teacher block. Note that a block can contain more than one convolution layer and block   [26] and AT [27], respectively blocks as follows,

Input
in which F s is denoted as a feature map of the student model and F t is denoted as a feature map of its corresponding teacher model. r (·) is a regressor consisting of a 1 × 1 convolutional layer and a BN layer. D p is a measure of the L 2 distance between student and teacher feature maps. The overall distillation loss based on feature representation can be expressed as follows, where B is the number of pruned blocks. This loss makes it possible for the student model to learn the features of the teacher model efficiently during the structure pruning process.

Output logits distillation learning
Multi-level distillation learning has been shown to perform better than single knowledge distillation methods for image classification [36] and object detection [37]. Therefore, we extend this concept to the pruning process in a reasonable manner. Apart from the feature representation-based knowledge distillation described above, our approach also includes output logits mimicking distillation learning. It is also necessary to mimic the softened teacher outputs in order to learn more from the teacher model. We use the Kullback--Leibler Divergence loss between the student and teacher outputs as the distillation loss for output imitation. The temperature τ softens the outputs between each pair of students and teachers. This method enables the student model to learn the predictions of the high-performance teacher model more efficiently, which can significantly reduce the classification error rate. The softened softmax function and the overall output imitation loss are shown below, where x i j represents the student single output logit for the j th class of the i th batch sample. X i j and X T i j represent the softened softmax output of the student model and the teacher model for the j th class of the i th batch sample, respectively. The temperature hyperparameter T determines the softening degree of output. X T i j can also be calculated by Eq. (3).

Total loss
In addition to the feature and output imitation learning described above, each student model is trained in a classical cross-entropy function with ground-truth labels and student output logits, which aids the model to learn better about a given dataset, as shown in the following equation, where X i j represents the student softmax output for the j th class of the i th batch sample. Y i j denotes the ground-truth label for the j th class of the i th batch sample. Our proposed progressive multi-level distillation learning is a weighted combination of these three losses mentioned above, updating the parameters of the student network only during the training phase to allow better accuracy recovery of the pruned model, which is mathematically represented as follows, We find the optimal values of weights by grid search can be taken at α = 0.25, β = 0.1, and γ = 0.9, and use these hyperparameters in all the subsequent experiments. Note that the proposed method does not increase the inference time of the model, and it is orthogonal to techniques such as quantization.

Experiments
The effectiveness of this method is evaluated by comparing it with the existing methods. See "Implementation details" for details of implementation. In "Main results", the superiority of our approach is demonstrated in publicly available datasets. The effects of the ablation experiments and different combination strategies will be discussed in later sections.

Implementation details
We perform L1-norm [21] pruning and HRank [24] pruning for VGGnet-16 [38] and ResNet-56 [39], GoogleNet [40]. The location of our selected feature distillation blocks is shown in Fig. 3. In addition, to enable a more comprehensive assessment of the usability of the proposed methods, we also validate it under different layer pruning rates: 60%, 70%, and the appropriate pruning rate (APR) given by HRank, as shown in Table 1. All experiments are performed using Pytorch and on an NVIDIA GeForce GTX 1080Ti GPU. The resource costs of the model at various pruning rates in CIFAR-10 are shown in Table 2.
In order to demonstrate the effectiveness of the proposed approach, we compare it with the following representative approaches. Baseline is the result of pruning without the use of a distillation method. Details are as follows.
(a) KD [25]: Makes use of KL divergence to close the softmax output of teacher and student, so as to transfer the knowledge and reduce the classification error of the student model. (b) FitNet [26]: It extracts the knowledge of a single intermediate layer of the well-trained teacher network, and uses it to guide the students' study. Knowledge distillation is accomplished by optimizing the distance between student and teacher intermediate layer features. (c) AT [27]: Improves student network performance by transferring the attention map of the teacher network so that the student can learn more useful information. (d) SP [29]: It uses pairwise activation similarity in each mini-batch to train students. Thus, it is possible to encourage student models to maintain pairwise similarity in their representation space without mimicking the teacher's representation space.

CIFAR-10/100
CIFAR-10 [41] has 50,000 training and 10,000 test images divided into 10 classes. CIFAR-100 [41] has the same number of training and test set images as CIFAR-10; the difference is that these images are classified into 100 categories. To obtain the pre-trained model to be pruned, we execute an SGD optimizer with a momentum of 0.9, weight decay of 0.0005, initial learning rate of 0.1, training of 350 epochs, and multiplying the learning rate by 0.1 at 175 and 262 epochs. Batch size is set to 64 and fine tuning using 40 epochs after each layer pruning with a learning rate of 0.01 and divided by 10 at epochs 5, 10. A more easily categorized CIFAR-10 dataset can be obtained from Table 3, as pruning can still cause performance impairments to the model even at lower pruning rates. Compared with other distillation methods, our method allows the model to recover the maximum lost accuracy during the finetuning phase. Especially on the GoogLeNet model with a 60% pruning rate, almost lossless pruning can be achieved (only a 0.04% decrease in accuracy compared to teacher). Our method improves only 0.28% accuracy on ResNet-56 at a 60% pruning rate, but the other methods improve at most 0.11%. GoogLeNet achieves an optimal 0.93% improvement at APR, while other methods achieve at most 0.4%. This result provides evidence for our framework to better transfer knowledge. Figure 4 illustrates the time comparison of VGGNet with different knowledge distillation approaches at a 60% pruning rate in CIFAR-10. FitNet [26] is closer to running time than the proposed approach, but it only improves the performance by 0.15% (we have an improvement of 0.64%). SP [29] and AT [27] do not effectively compensate for the loss of accuracy while consuming significant runtime resources. The KD [25] method requires less runtime but has a relatively limited precision recovery. Figure 4 shows that our approach can get the best results with fewer resources. [·] represents the pruning rate at each layer and * n means the same pruning rate in the following n layers Table 4 shows that on the more challenging to classify CIFAR-100 dataset, our method can achieve the most considerable boost on VGGNet 2.09%-over baseline (at 60% pruning rate). While ResNet-56 has only 0.41% accuracy improvement at a 70% pruning rate, the other methods only improve by 0.13% at most. Compared to the results in CIFAR-10, the proposed method generally improve more on CIFAR-100, which may explain the significant difference in performance between the teacher and student models. However, as described in [42], it is not the case that the better the teacher model performs, the better the distillation will be, as we can observe on ResNet56 at a 70% pruning rate for both datasets (note that the difference in performance between teacher and baseline in CIFAR-100 is much larger).
Both Tables 3 and 4 show that our method achieves effective and superior results on different pruning rates, models, and pruning methods and allows for continuous improvement.

Tiny-imageNet
Tiny-ImageNet [43] consists of 100,000 training and 10,000 validation images containing 200 classes, and we resize its input to 32 × 32. To obtain a pre-trained model, we use an SGD optimizer with a momentum of 0.9 and a weight decay rate of 0.0005, train 120 epochs with a learning rate of 0.01, and multiply by 0.1 at epochs 30, 60, and 90. The fine-tuning strategy after pruning is the same as CIFAR10/100.  We report the Top-1 accuracy (%) of the results. The teacher refers to the pre-trained model without pruning. Baseline is the result obtained without the use of any distillation learning method As shown in Table 5, on the larger dataset Tiny-IamgeNet, some knowledge distillation methods do not work as well as on CIFAR-10/100. However, our method still recovers the lost performance, clearly observed at a 60% pruning rate. The proposed method also obtains better results than other methods at a 70% pruning rate, allowing the model to recover 0.52% accuracy. This result proves that our method is still effective, even when it is more difficult to classify in larger datasets.

Improving each pruning
As described in the previous section, we show that our approach achieves promising results at the end of pruning. However, we hope it will improve the performance after   The pruning of each layer in the convolution layer removes the irrelevant filters, which leads to a lower precision. As shown in Figs. 5, 6 and 7, our method still works during the pruning process, which also means that the proposed approach is still practical even if we haven't completed the pruning. Moreover, the performance of the student is better than that of the teacher model during the initial process of pruning proceeding, which also shows the ability of our approach to combine structured pruning and distillation learning methods better.

Ablation study
In order to further analyze the contribution of each of our proposed losses, we add the ablated portions step by step to observe their effects. We perform experiments related to VGGNet-16 with a 70% pruning rate on CIFAR-10, as shown in Table 6, where baseline refers to the pruning process without using our method. It can be observed that our proposed method in "Progressive feature distillation" improves the performance of the model to the maximum, and the proposed progressive mechanism based on the pruning process further improves the pruning process in terms of feature distillation.
In conclusion, the weighted combination of the proposed components can be used to compensate for the loss of performance due to pruning as much as possible.

Two combined strategies
The strategy of combining distillation learning and pruning can be broadly divided into two categories: using after pruning is completed and using during pruning, and Fig. 8 shows the performance and time comparison of our proposed method on these two strategies. It has been shown that using distillation in the pruning process leads to higher precision recovery, but it also requires more training time. As a result of our progressive distillation process, it takes less time and achieves greater performance gains compared to AT [27]. Compared with the other one, FitNet [26], although it takes less time to train, its accuracy improvement is not even as  We report here the Top-1 accuracy (%) of the results. The teacher refers to the pre-trained model without pruning. Baseline is the result obtained without the use of any distillation learning method good as using our distillation method only after the completion of pruning. This result demonstrates that our method can recover the decreased accuracy within a relatively short training time without affecting the inference speed of the model.

Conclusion and future work
We propose a progressive multi-level distillation learning method to alleviate the accuracy drop by structured pruned networks. This method takes advantage of the characteristics of structured pruning, which allows the pruned network to learn more information from the teacher network. Experiments on different datasets, model architectures, and pruning rates show that the proposed approach achieves better performance than other approaches, and the accuracy of the model is improved after every pruning. Further experiments demonstrate that the proposed method in the pruning process enhances the model performance more effectively. Our approach has higher efficiency in training time and does not influence inference time. Our study provides a valuable approach to better integrate pruning and distillation learning.
In future work, we hope to extend this idea to more complicated tasks such as object detection and semantic segmentation. Furthermore, it is also worth exploring how to combine better pruning, knowledge distillation, and other compression techniques like quantization.