Introduction

Deep neural networks have achieved notable results in recent years in the fields of image classification [1,2,3], object detection [4,5,6], and natural language processing (NLP) [7,8,9], among other areas of application. On the other hand, improving the performance of networks often means using deeper and larger models, resulting in more parameters [11]. While edge devices have limited computational power, state-of-the-art algorithms are hard to deploy efficiently on edge devices [12]. Methods such as pruning [13], quantization [14], distillation [15], and low-rank decomposition [16] have been developed to address this issue in the hope of making the model smaller without loss of accuracy.

The key idea behind knowledge distillation is to transfer information from large or embedded models to smaller student models. Current distillation methods are categorized into Response-based, Feature-based, and Relationship-based distillation [17]. Response-based distillation improves student performance by emulating the teacher's final prediction, which is a simple and efficient approach. However, it does not consider the information in the middle layer of the teacher model [15]. Feature-based approaches improve student performance by emulating the output of either the middle or the last layer, which provides a clear direction to improve student performance [18]. Previous research has proposed various methods for optimizing knowledge extraction in the middle layer. These methods have carefully designed the knowledge representation in such a way that the student's performance would be continually improved, but balancing multiple pieces of knowledge requires complex tuning parameterizations and the increasing complexity of the knowledge makes explaining student performance difficult [18,19,20,21,22,23]. For the last layer, in a recent study by Baruch et al. By sharing classifiers between the teacher and student models, they could extract the dark knowledge within it and achieve competitive results [25]. However, the feature dimensions of the student network are not always the same as the teacher network. Sharing classifiers requires projector-based feature projection to resolve feature scale mismatch. Previous studies have largely neglected the critical role of the projector. Therefore, our study focuses more on the process of feature matching prior to classifier sharing with the projector integration strategy.

In this paper, student inference is performed using discriminative classifiers from pre-trained teacher models in anticipation of improved model performance. However, given that the feature dimensions of features extracted from the teacher and student models are usually different, a simple projector has been designed to be added after the student feature encoder for dimensional matching. In accordance with ensemble learning theory [25,26,27], the integration of an ensemble strategy plays a pivotal role in fostering model generalization, and successful applications in the field of fault detection have demonstrated the effectiveness of integration strategies [28]. Simultaneously, it is imperative to observe that projectors subjected to diverse initializations will engender disparate feature transformations. Therefore, we devise a simple projection integration method to enhance the performance of the student model further. Figure 1 illustrates the distillation framework of this paper. Not only does our proposed method lead to better performance, but it also eliminates the need for complex knowledge representation and hyperparameter tuning procedures. A large number of experiments have shown that the performance of the student model is significantly improved with the use of the teacher classifier and that performance continues to improve in parallel with increasing numbers of integrated projections.

Fig. 1
figure 1

The distillation framework proposed in the paper. The model is divided into two parts, Encoder and Classifier. The first layer of the final classifier uses an embedding projection mapping function to the joint space for matching, in which only the Encoder and Projector of the student model are updated during training, and the teacher's final classifier is shared with the student for the final inferences

The main contributions of the present paper are as follows.

  1. 1.

    This paper proposes a generalized distillation method that has only a single loss term and does not require hyperparameter tuning to balance multiple losses;

  2. 2.

    A simple and effective projector integration method is designed for feature alignment, and student performance can be consistently improved in conjunction with increasing the number of integrated projections. The size of the projector parameters and the size of the final classifier in different models are also investigated;

  3. 3.

    We have conducted a large number of experiments on the CIFAR-100 and Tiny-ImageNet datasets, and the method proposed in this paper is always competitive in terms of both accuracy and convergence speed compared with the latest methods.

Related work

Knowledge distillation (KD) is an essential method of model compression since it can improve accuracy and speed for smaller models while maintaining performance similar to that of larger models. Hinton et al. [15] were the first to transfer knowledge from large complex models (teachers) to small, simple models (students). As shown in Fig. 2a, the architecture is as follows: During the training process, soft predictions from the teacher model are used as an additional training objective to guide the training of the student model, resulting in an improvement in the performance of the student model.

Fig. 2
figure 2

a–c Represent the three distillation techniques. a Vanilla KD smoothes the logits after final classification by the teacher and calculates the distillation loss to guide the update of the student model. b Feature Distillation extracts intermediate features of the model and generates gradients to guide the update of the student model. Distillation methods (a) and b do not require the use of a projector. c KD with Projector, the gradient comes from the alignment loss between the final feature map and the Vanilla KD loss. The Encoder, Projector, and Classifier are updated during training

The symbolic description of the Vanilla Knowledge Distillation method is as follows. First define the notation used in this section, given a one-hot lable \(y\) and training samples \(x\) in an \(N\)-class categorical classification dataset. The encoded features of the second last layer of the student model are represented as \(f^{s} = \{ s_{1} ,s_{2} ,s_{3} ,...,s_{i} ,...s_{b} \} \in R^{d \times b}\), where \(d\) and \(b\) are the student’s features dimensions and batch size. The feature \(f^{s}\) is delivered to the classifier with weights \(W^{s} \in R^{N \times d \times b}\) to obtain logits \(g^{s} = W^{s} f^{s} \in R^{N}\) and the classification prediction \(p^{s} = \sigma (g^{s} /T) \in R^{N}\) after softmax function σ(.) at temperature \(T\); The corresponding teacher features are \(f^{t} = \left\{ {t_{1} ,t_{2} ,t_{3} ,...,t_{i} ,...t_{b} } \right\} \in R^{m \times b}\) where \(m\) is the feature dimension of the teacher [15].

$$ L_{KD} = L_{CE} (y,p^{s} ) + T^{2} L_{KL} (p^{t} ,p^{s} ) $$
(1)

As shown in Eq. (1), the loss function used in the vanilla knowledge distillation methods consists of two parts. The first part is the traditional \(L_{CE}\) loss Cross-Entropy for learning the predictive capability of the student model, where the Cross-Entropy loss measures the discrepancy between the student model output and the true labeling, where the temperature \(T\) is 1; the other part is \(L_{KL}\), the additional supervised signals contributed by the soft targets of the teacher model, which are probability distributions of the teacher model's output, may provide richer information and assist student models in learning more, where the temperature \(T\) is typically greater than 1 [14, 35].

Hinton's success has led to various logit-based improvements being proposed. In contrast, response-based knowledge typically relies on the output from the last layer, and feature-based knowledge from the middle layer is a reasonable extension of logits-based knowledge [17]. To further combat performance degradation in distillation and make KD more practical in model compression, more information from the middle layer is starting to be used.

FitNets [18] minimizes the L2 distance of the feature map between the students and the teachers generated by the middle layer of the network. The middle layer output of the teacher model feature extractor is extracted as hints to the students, and feature distillation is realized for the first time. Several studies have further optimized FitNets [23, 29] based knowledge extraction. Subsequent studies have devised different knowledge representations to transfer knowledge, such as sample relations encoded by pairwise similarity matrices [21, 30], the maximization of mutual information [18], or modeled by contrastive learning [20]. To fully use these intermediate features of teacher models, recent approaches focus on feature association [31, 32]. Figure 2b shows the structure of these previously described methods; while these methods elaborate knowledge representation, there is no unified representation of how the introduction of the knowledge above has a positive impact on the student model, along with iterative hyperparameter tuning that complicates the distillation process. Furthermore, these studies have shown that the last feature in the network can better distillation [21, 32]. One possible explanation is that the last feature representation is directly related to the model classifier and will directly affect the model's performance [32].

Projection techniques have been introduced in some feature distillation methods, as demonstrated in Fig. 2c, where a simple 1 × 1 convolutional kernel or fully connected layer is used for the transformation of the features after the student encoder [21, 32]. However, the key to the success of their methods lies not in the significant role played by the projector. A few studies attempted to improve the projector system, such as Factor Transfer (FT) [34] and Overhaul of Feature Distillation (OFD) [35], but failed to produce any competitive results. Thus, the critical role of projection has been overlooked for some time. We also note that when the model is asked to process multiple tasks with different data distributions, the essential operation is to freeze the shallow shared information as an encoder to refine the final classifier. Therefore, the distinctiveness contained in the teacher's classifier is just as important as the feature extraction portion of the encoder. Moreover, the success of the SH-KD method in HeadSharingKD [25] also provides evidence that it is efficient to introduce teacher classifiers into the learner model. This paper draws on theories such as HeadSharingKD with integrated learning to propose a simple distillation framework [21, 25,26,27,28].

Figure 1 shows the distillation framework we propose in this paper, which improves upon the original distillation framework by using two key operations: Teacher-Classifier-Share (abbreviated as TCS to simplify the description) and Projector Integration. TCS is to replace the classifier of the student model with the classifier of the more powerful teacher model, and for the student model to use the better-performing classifier, the projector alignment function must be added after the encoder of the student model. The more similar the features of the student and the teacher are, the smaller the gap between their performances will be; thus, the projector performance is crucial in the methodology proposed in this paper. However, considering that different initialized projectors provide different feature transformations, we propose the Projector Integration method, which further improves the projectors' performance to improve the student model's performance.

The proposed method

Improved distillation with projector integration

To enable students to reason using the teacher classifier, a projector \(Proj( \cdot )\) is needed to convert student or teacher features. This is usually done by applying the projector to the student; if the projector is applied to the teacher model in alignment with the student model, the original rich divisions of features from the teacher will be destroyed. So add the projector \(Proj(s_{i} ) = \tau (Ws_{i} )\) where \(\tau ( \cdot )\) is the \({\text{ReLU}}\) function, and \(W \in R^{m \times d}\) is a weighting matrix. First, let us discuss a projection case where the feature-based and logit-based losses are combined in SRRL to improve distillation performance. However, the impact of hyperparameter tuning in the distillation results is enormous, and the additional tuning of coefficients will increase the computational cost. To simplify the computational complexity in distillation, we employ the simple Direction Alignment (DA) loss [23, 36, 37]:

$$ L_{DA} = \frac{1}{2b}\sum\limits_{i = 1}^{b} {\left\| {\frac{{{\text{Proj}}(s_{i} )}}{{\left\| {{\text{Proj}}(s_{i} )} \right\|_{2} }} - \frac{{t_{i} }}{{\left\| {t_{i} } \right\|_{2} }}} \right\|}_{2}^{2} = 1 - \frac{1}{b}\sum\limits_{i = 1}^{b} {\frac{{\left\langle {{\text{Proj}}(s_{i} ),t_{i} } \right\rangle }}{{\left\| {{\text{Proj}}(s_{i} )} \right\|_{2} \left\| {t_{i} } \right\|_{2} }}} $$
(2)

where \(\left\| \cdot \right\|_{2}\) denotes the L2-norm and \(\left\langle {.,.} \right\rangle\) denotes the inner product of two vectors. However, different initialized projectors provide different transformed features, and the integration of multiple projectors theoretically provides a better transformation of the features; on the other hand, since we are using the ReLU function to make the projectors allow for non-linear feature extraction, the projected student features may contain zeros but with the average pooling operation commonly used in CNNs, teacher features cannot be zero, and integrating sets of multiple projectors effective avoids this problem. To verify the above statement, we have integrated the projector using the following method:

$$ {\text{Proj}}_{{{\text{Int}}}} (s_{i} ) = \frac{1}{{{q}}}\sum\limits_{{{{K}} = {1}}}^{{{q}}} {\text{Proj}}_{{{K}}} (s_{i})$$
(3)

where \(q\) is the number of projectors and \({\text{Proj}}_{{\text{K}}} ( \cdot )\) denotes the transformed features of the k-th projector. And measured differences between student and teacher characteristics that integrated different numbers of projectors as follows:

$$ M_{DA} = 1 - \frac{1}{b}\sum\limits_{i = 1}^{b} {\frac{{\left\langle {s_{i} ,t_{i} } \right\rangle }}{{\left\| {s_{i} } \right\|_{2} \left\| {t_{i} } \right\|_{2} }}} $$
(4)

Figure 3a depicts the experimental results of \(M_{DA}\) obtained for the model of students without projectors, and the model of students integrating different number of projectors, using different random seeds for the experiment. The results demonstrate that the \(M_{DA}\) results for students without projectors are significantly lower than those for students equipped with projectors, and, furthermore, the \(M_{DA}\) values gradually increase as the number of projectors increases. On the other hand, we measured the cosine similarity between classes in the student feature space:

$$ M_{BC} = 1 - \frac{1}{b}\sum\limits_{i = 1}^{b} {\sum\limits_{j = 1}^{{c_{i} }} {\frac{{\left\langle {s_{i} ,s_{j} } \right\rangle }}{{c_{i} \left\| {s_{i} } \right\|_{2} \left\| {s_{j} } \right\|_{2} }}} } $$
(5)

where \(s_{i}\) is the j-th sample belonging to a class different from \(s_{i}\) and \(c_{i}\) is the total number of \(s_{j}\) corresponding to \(s_{i}\). Figure 3b shows the test results using three different tests with random seeds integrating different number of projectors \(M_{BC}\). The experimental results show that students' ability to differentiate features gradually increases as the number of integrated projections increases.

Fig. 3
figure 3

a, b The left figure shows the loss of directional alignment between teacher and student features without projectors and with different numbers of projectors integrated. The right figure shows the average interclass cosine similarity in feature space for students with different number of projectors integrated. These results were obtained using a teacher–student pair ResNet32 × 4-ResNet8 × 4 on CIFAR-100. The feature dimensions of ResNet32 × 4 and ResNet8 × 4 are the same, which means \(m = d\)

In summary, with only a single projector, there is still a gap in the distribution of features between teachers and students; therefore, this paper uses Projector Integration. By introducing projector Integration, the Modified Direction Alignment (MDA) loss is as follows:

$$ L_{{{\text{MDA}}}} = 1 - \frac{1}{b}\sum\limits_{i = 1}^{b} {\frac{{\left\langle {{\text{Proj}}_{{{\text{Int}}}} \left( {s_{i} } \right),t_{i} } \right\rangle }}{{\left\| {{\text{Proj}}_{{{\text{Int}}}} \left( {s_{i} } \right)} \right\|_{2} \left\| {t_{i} } \right\|_{2} }}} . $$
(6)

Improved distillation with teacher-classifier-share

One of the most critical operations in our approach is Teacher-Classifier-Share (TCS), which directly borrows a pre-trained teacher classifier for student reasoning instead of training a new one. TH-KD has shown that linking the teacher's classifier to the student's backbone and freezing its parameters makes it easier to characterize the extraction process, leading to improvements [25]. In addition, when a model is asked to handle multiple tasks with different data distributions, a basic approach is to freeze or share some shallow layers as feature extractors. In contrast, the last layer is fine-tuned to learn task-specific information [38,39,40].

figure a

Algorithm 1: Knowledge Distillation based on Projector Integration and Classifier Sharing

In this multitasking setting, existing work assumes that task-invariant information can be shared, while task-specific information needs to be identified independently, usually by the final classifier; similarly, we assume that most capability-specific information is contained in deep layers, and reusing these layers enables direct access to this task-specific information. Thus, Fig. 1. shows the final distillation architecture, where we directly borrow the pre-trained teacher classifier for student inference instead of training a new classifier. This removes the need for label information to calculate the cross entropy loss and leaves the feature alignment loss as the sole source for gradient generation, where \(\alpha\) is a hyperparameter. The details of our method are shown in Algorithm 1.

$$ L_{{{\text{total}}}} = \alpha L_{{{\text{MDA}}}} $$
(7)

Experiments

We performed many experiments to demonstrate our proposed method's efficacy and superiority. “Implementation details” section details the implementation of the experiments as well as the baseline model; in “Main results” section, an experimental comparison using several representative methods on a publicly available dataset demonstrates the superiority of our proposed method. We then experimentally verify the important role of projector integration and Teacher-Classifier-Sharing. “Hyperparameter effect” section discusses the effects of different loss terms and hyperparameters on the distillation results.

Implementation details

Datasets. Two representative benchmark datasets were chosen, CIFAR-100 [42] and Tiny-ImageNet [42], for performance evaluation. The CIFAR-100 dataset contains 50,000 training images and 10,000 test images totaling 100 classes with 32 × 32 image sizes; The Tiny-ImageNet dataset contains 100,000 training images and 10,000 validation images totaling 200 classes with an image size of 64 × 64, and we changed the image size to 32 × 32 in our experiments. Standard image enhancement is used for both datasets, and all images are normalized by channel mean and standard deviation.

Baselines. To demonstrate the advanced performance of our proposed methods. Details, we have chosen different types of representative advanced distillation methods for comparison are given below:

KD [15]: This logit-based approach uses KL-dispersion to aggregate softened softmax outputs from teachers and students to transfer knowledge. FitNet [18]: It extracts the single intermediate representations learned by the teacher as knowledge and uses them to guide the students' learning at the single intermediate level. AT [24]: This approach uses the Attention Map learned from the teacher's network as distilled knowledge in the student's network, resulting in improved student performance. SP [22]: This approach states that students do not need to mimic the teacher's representation space but are encouraged to maintain pairwise similarity in their own representation space. VID [19]: This approach describes knowledge transfer as the maximization of mutual information between a network of teachers and students. RKD [25]: To optimize student performance using relational information to transfer knowledge from teacher to student. CRD [21]: It maps the linear projection of students' features to the mapping of the teacher space and trains the students to get more information from the teacher's description of the data. SRRL [23]: It decouples representation learning from classification and uses a convolutional kernel to transform the student features.

Training details. All training followed the settings of the previous work [21], using the same hyperparameters as their counterparts in the various distillation methods. The authors supplied the hyperparameter settings for SRRL [23]. A variety of representative networks were selected to form teacher–student pairs [10, 15, 43,44,45,46]. On the CIFAR-100 dataset, we use an SGD optimizer with 0.9 Nesterov momentum, with Batchsize set to 64, and train a total of 240 epochs; the initial learning rate was set to 0.01 on MobileNet and ShuffleNet, and 0.05 on the other models, with the learning rate multiplied by 0.1 at the 150th, 180th, and 210th, and the distillation temperature \(T\) is set to 4; on the Tiny-ImageNet dataset, an SGD optimizer with 0.9 Nesterov momentum was used, and to obtain a pre-trained teacher model, the Batchsize was set to 64, a total of 120 epochs were trained, the initial learning rate was set to 0.05, and the learning rate was multiplied by 0.1 at the 30th, 60th, and 90th epochs, respectively. The other settings in the CIFAR-100 dataset are the same. In addition to the results reported in a single experiment on Tiny-ImageNet, we conducted experiments on the CIFAR-100 dataset using different random seeds. Our method was run three times, and the average accuracy was reported. The optimal experimental results are bolded in the table. The number of integrated projectors was set to 3, the hyperparameters were set to 400, and the hyperparameters of the different methods were fixed in all experiments for a fair comparison. All experiments were done using the same NVIDIA GeForce RTX 3080Ti GPU.

Main results

Comparison of test accuracy

Tables 1 and 2 give the test accuracies based on 12 different teacher–student pairs on the CIFAR-100 dataset. The teacher models in Table 1 are all ResNet-32 × 4, and the student models include models that are similar in structure to them and entirely different models including representative lightweight models. Experiments are performed on these network combinations using different distillation methods, and their accuracy is reported. Observing the test accuracy in the table shows that our method consistently outperforms the competition on the CIFAR-100 dataset. On the two teacher–student combinations of “ResNet-32 × 4&MobileNetV2” and “ResNet-32 × 4&ResNet-8 × 4”, our method has a performance improvement of 2.97% and 2.79% over the KD method, an improvement of, respectively. Performance, and even compared to the best-performing SRRL method, our method still has an accuracy improvement of 1.59% and 1.30%. Furthermore, in the network pair “VGG13&VGG8”, the accuracy of the student model exceeds that of the teacher model after distillation using our method. According to the self-distillation method, we give a more reasonable explanation that the student model becomes more robust in aligning the features and thus yields better results [47, 48].

Table 1 Top-1 test accuracy of ResNet32 × 4 as a teacher for a variety of knowledge distillation methods on the CIFAR-100 dataset
Table 2 Top-1 test Accuracy of different knowledge distillation methods on the CIFAR-100 dataset using different network pairs

To exemplify the effectiveness of our proposed method, we conduct experiments using the large, complex dataset Tiny-ImageNet and introduce representative network pairs to evaluate the performance of our method. Experimental results from different distillation methods on the Tiny-ImageNet dataset are presented in Table 3. As can be seen, the partial distillation methods perform poorly on the Tiny-ImageNet dataset. In contrast, CRD, SRRL, and our method still perform well on the Tiny-ImageNet dataset. However, whereas CRD requires a transformation of features for both the student and teacher networks, the method we use maps student features to teacher space for feature matching, which is one reason our method achieves better performance. For distilled ShuffleNetV2 (8.2M), we achieve nearly the same accuracy as ResNet32 × 4 (29.6M) but with only one-quarter the number of ResNet32 × 4 parameters.

Table 3 Test accuracy of different knowledge distillation methods on Tiny-ImageNet dataset a ResNet-32 × 4&ResNet8 × 4, b ResNet-110 × 2&VGG8, c ResNet-110&MobileNetV2 × 2, d WRN-40–4&ShuffleNetV2

In Fig. 4a, b, the previous test accuracy for different training periods is shown. Compared to other methods, our method converges more rapidly and consistently outperforms the other distillation methods in the experiment in all epochs.

Fig. 4
figure 4

a, b Test Accuracy of ResNet-32 × 4&ResNet8 × 4 versus ResNet-110&MobileNetV2 × 2 for different training periods, respectively. The blue line represents our method and the left side is the Top-1 Accuracy

Analysis of projector integration

This section presents an analysis of the performance of different projector types to compare the effect of the horizontal and vertical integration methods on model performance and finally discuss the effect of the addition of projectors on the number of model parameters. All experimental results are obtained on the CIFAR-100 dataset using the combination of “ResNet-32 × 4&ResNet-8 × 4”. In some feature distillation methods, projectors are used to make student and instructor features as similar as possible, and most of these projectors transform the features via simple \(1 \times 1\) convolution kernels or linear methods. Table 4 shows the test accuracy and final Test Loss for final inference using the teacher model after feature mapping using different types of single projectors on the student model. The experimental results show that compared to the kernel method; the Linear projection method shows a strong potential, outperforming its competitors by 1.16% accuracy and 0.70% accuracy, respectively. Moreover, using a single Linear projector is already better than the 1 × 1Conv-1 × 1Conv method in terms of the results, so we opt for the Linear projector for the projection integration.

Table 4 Comparison of accuracy of different types of single projectors

Horizontal integration of projectors. We integrate a linear projector in the horizontal direction in the layer before the classifier and take the classifier of the teacher model directly to the student model. Table 5 shows the Top-1 classification Accuracy for different numbers of projectors integrated horizontally. The model using multiple projectors for lateral integration was significantly better than a single projector in terms of Top1 Accuracy, and, along with the increase in the number of projectors distilled, student accuracy was able to improve consistently to the point where there was no significant increase in accuracy when using 4 projectors. The results of this experiment demonstrate the simplicity and efficiency of the horizontally integrated projector method, which can significantly improve distillation performance.

Table 5 Top-1 classification accuracy on the CIFAR-100 using different teacher–student pairs in the Horizontally Integrated Projector.1-Proj, 2-Proj, 3-Proj, 4-Proj, 5-Proj denote the number of linear projectors integrated in a layer

Vertical integration of projectors. Integration can also be achieved by cascading projector depth to increase model depth; we attempt to integrate in the depth direction to determine whether integrating the projectors in the depth direction yields better results. Table 6 shows the change in distillation performance by stepwise stacking of linear projections. In this table, 2-MLP, 3-MLP, 4-MLP, and 5-MLP are multilayer perceptrons, where each layer outputs m-dimensional features followed by ReLU activation. From the results in Table 6, it can be seen that simply integrating the projectors in the depth direction not only fails to improve the model's accuracy but also reduces the projection efficiency, resulting in degraded performance of the student model. Therefore, we ultimately choose the lateral integration projection for our approach.

Table 6 Top-1 Classification Accuracy on the CIFAR-100 using different teacher–student pairs in the Vertically Integrated Projector. 1-MLP, 2-MLP, 3-MLP, 4-MLP, 5-MLP indicate the number of linear projectors integrated in the depth direction

Adding a projector after the Encoder student model adds additional parameters to the student model, and the projector size is different for different combinations of teacher and student. The sizes of the different projectors in various teacher–student combinations are shown in Table 7. Although they differ in size, the overall increase in the number of parameters is manageable, allowing for more advanced performance at a lower cost. Moreover, our proposed integration strategy has the flexibility to change the number of integrations to ensure the balance between the additional parameters and performance. Due to the trade-off between increasing the number of parameters and performance, the number of integrations is typically chosen as 3 rather than the 4 projectors that gave the best results in the horizontal integration experiments. Although only 3 projectors are integrated, based on the results in Tables 1, 2, and 3, our method is also superior to the other methods in terms of performance and convergence speed.

Table 7 Projector and classifier size for different teacher–student combinations

Analysis of teacher-classifier-share

The purpose of this section is to discuss the critical role of teacher classifiers in distillation. To this end, we have designed a few experiments to test the critical role of teacher classifiers in distillation. Similar to the previous experimental setup, we performed projector integration in the horizontal direction of the first layer of the classifier and then used the classifiers of the training students to perform the final classification without TCS.

Tables 8 and 9 report the Top-1 Accuracy for integrating the projectors in the horizontal plane, reasoning with the teacher classifier, and retraining the student classifier to reason. With the same number of projections integrated, the accuracy of students reasoning with the teacher's classifier is significantly better than that of retraining the student's classifier. VGG8 achieves 74.45% accuracy even with a single projector after using the classifier trained on the teacher, which is 0.1% higher than the best result of 74.35% obtained after retraining the trainee classifier to incorporate three projectors. The difference in accuracy between the models with and without TCS operation for the integration of the same number of projectors is in the region of 1%. Finally, we note that the models using the teacher classifier have a higher potential to consistently improve the performance of the student models as the number of integrated projections increases.

Table 8 Using the “VGG13&VGG8” network combination on the CIFAR-100 dataset
Table 9 Using the “ResNet-32 × 4&ResNet-8 × 4” network combination on the CIFAR-100 dataset

Hyperparameter effect

This section discusses the effect of hyperparameters on the performance of the student model. In our initial experiments, we do not have only one loss term, and as with other distillation methods [14, 20, 22], we train the student model using a joint training approach. LInitial is the initial loss, which consists of the vanilla distillation loss \(L_{KD}\) and the integrated projection direction alignment loss \(L_{MDA}\), with \(\alpha\) and \(\beta\) as two hyperparameters to balance the two losses. The formula is as follows:

$$ {{L}}_{{{\text{Initial}}}} = \alpha L_{MDA} + \beta {{L}}_{KD} $$
(8)

We used “ResNet-32 × 4&ResNet-8 × 4” and “VGG13&VGG8” 2 groups of teacher–student pair to conduct experiments on the CIFAR-100 dataset. Fixing the value of \(\alpha\) as 50 and \(\beta\) taking different values for training. The experiment results are shown in Fig. 5b. Increasing the value of \(\beta\) does not improve the performance of the model; instead, as \(\beta\) increases it leads to a continuous decrease in the accuracy of the model, and the continuous decrease in accuracy suggests that the use of projection integration alone to improve the performance of distillation is partial. When \(L_{KD}\) does not exist, i.e., when \(\beta\) takes on the value 0, 2 groups of teacher–student pairs obtained the best distillation results. Therefore we drop the continued use of the \(L_{KD}\) loss so that the final loss function simplifies to Eq. 7.

Fig. 5
figure 5

a, b Represent the effects of varying the hyperparameters, respectively. The results obtained in the figure were all obtained by integrating 3 projectors and using the teacher classifier

We chose two student–teacher pairs, “ResNet-32 × 4&ResNet-8 × 4” and “WRN-40-2&MobileNetV2”, to observe the effects of different values of \(\alpha\) on distillation on the CIFAR-100 dataset, with the values of \(\alpha\) ranging from 25 to 600 in the experiment. The experimental results are shown in Fig. 5a, Observing the images shows that our method is not sensitive to the hyperparameter \(\alpha\) and good distillation results are obtained when it takes any value from 50 to 500. In “ResNet-32 × 4&ResNet-8 × 4” teacher–student combination, the optimal distillation result is obtained when \(\alpha\) is taken as 400, and the accuracy of the distilled student is obtained as 77.21%, and the accuracy continues to decrease after the value is taken as more than 500. But \(\alpha\) cannot take too small a value; when his value is too small, the model collapses during training. However, the performance of the method proposed in this paper can outperform other methods as long as it is in our reasonable interval.

Conclusion and future work

In this paper, we explore a simple and effective distillation method to explore the value of teacher classifiers in distillation by simple parameter reuse. We investigate the critical role of feature matching in distillation, due to the mismatch of unavailability of feature sizes, in most cases, students cannot directly use the teacher classifiers with better performances, so feature matching plays a crucial role in our method, and the merits of the projection method will directly affect the final performance of the students. Based on the positive role of the integrated projector and teacher classifier in distillation, we conducted extensive experiments on the CIFAR-100 and Tiny-ImageNet datasets, and our method performs competitively with other state-of-the-art methods with different combinations of teachers and students.

On the other hand, the integration of projectors increases the complexity of the model, and, despite the controllable overall number of parameters added to the integrated projections, the development of an efficient projection-free distillation scheme remains a challenging study. Combining the unstructured pruning technique with projector pruning is a worthwhile research method to reduce the number of parameters brought by adding projectors. In addition, the method proposed in this paper can only be used for supervised distillation, such as image classification, machine translation, and data prediction. Developing a distillation method for unsupervised learning scenarios is also a challenging area of research.