A multi-level collaborative self-distillation learning for improving adaptive inference efficiency

Zhang, Likun; Li, Jinbao; Zhang, Benqian; Guo, Yahong

doi:10.1007/s40747-024-01572-3

A multi-level collaborative self-distillation learning for improving adaptive inference efficiency

Original Article
Open access
Published: 14 August 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

A multi-level collaborative self-distillation learning for improving adaptive inference efficiency

Download PDF

Likun Zhang^1,3,
Jinbao Li²,
Benqian Zhang⁴ &
…
Yahong Guo⁵

Abstract

A multi-exit network is an important technique for achieving adaptive inference by dynamically allocating computational resources based on different input samples. The existing works mainly treat the final classifier as the teacher, enhancing the classification accuracy by transferring knowledge to the intermediate classifiers. However, this traditional self-distillation training strategy only utilizes the knowledge contained in the final classifier, neglecting potentially distinctive knowledge in the other classifiers. To address this limitation, we propose a novel multi-level collaborative self-distillation learning strategy (MLCSD) that extracts knowledge from all the classifiers. MLCSD dynamically determines the weight coefficients for each classifier’s contribution through a learning process, thus constructing more comprehensive and effective teachers tailored to each classifier. These new teachers transfer the knowledge back to each classifier through a distillation technique, thereby further improving the network’s inference efficiency. We conduct experiments on three datasets, CIFAR10, CIFAR100, and Tiny-ImageNet. Compared with the baseline network that employs traditional self-distillation, our MLCSD-Net based on ResNet18 enhances the average classification accuracy by 1.18%. The experimental results demonstrate that MLCSD-Net improves the inference efficiency of adaptive inference applications, such as anytime prediction and budgeted batch classification. Code is available at https://github.com/deepzlk/MLCSD-Net.

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In recent years, numerous deep neural networks such as AlexNet [1], VGG [2], GoogleNet [3], ResNet [4], DenseNet [5] and Transformer [6] have achieved state-of-the-art performances in the field of computer vision. While deeper neural networks often yield higher accuracies, they also have more parameters and greater storage requirements, resulting in increased computational costs. To mitigate this challenge, Refs. [7,8,9] use network pruning to remove redundant parameters, Refs. [10,11,12,13,14] utilize knowledge distillation to compress model sizes, Refs. [15,16,17] apply weight quantization to reduce parameter bit precision. However, these methods typically rely on static inference, where all the samples must traverse the entire network to obtain prediction results. In contrast, Adaptive inference [18,19,20,21,22,23,24] dynamically adjusts the computational resources based on the input samples, thereby significantly improving the inference efficiency. Combining adaptive inference with static inference can further improve the network efficiency.

The most common approach for implementing adaptive inference is to construct a dynamic multi-exit network, which adjusts its structure based on input sample difficulty [24]. A multi-exit network incorporates multiple intermediate classifiers (ICs) at varying depths within the backbone network, enabling a quick classification of simple samples by shallow classifiers and handling more difficult samples with deep classifiers. This adaptive strategy not only enhances the network’s inference performance but also boosts its overall efficiency during inference.

To improve the adaptive inference efficiency of multi-exit networks, the self-distillation technique is commonly employed to enhance the classification accuracy of shallow classifiers. Given that the final classifier typically possesses deeper layers, more parameters, and stronger feature fitting capabilities, methods such as IMTA [19] and BYOL [21] often treat it as the teacher, transferring valuable knowledge to the shallow classifiers. However, through experimentation, we have observed instances where certain samples are accurately identified by the intermediate classifiers but misclassified by the final classifier. For example, when testing a ResNet18-based multi-exit network on the CIFAR100 dataset, approximately 5% of the test samples are correctly classified by the first classifier but are misclassified by the final classifier. This experiment demonstrated that different classifiers can extract varied knowledge and features due to discrepancies in a network’s structure and parameters. Relying solely on the final classifier as a teacher to transfer knowledge risks overlooking the distinctive knowledge offered by the intermediate classifiers.

To comprehensively extract and transfer effective knowledge from all the classifiers of a multi-exit network, we propose a novel multi-level collaborative self-distillation learning strategy (MLCSD). Initially, MLCSD aggregates logit results and feature maps from all the classifiers into logit-based and feature-based pools. Subsequently, an attention mechanism computes weight coefficients matching each classifier, and the pools are multiplied by these coefficients to obtain logit-based and feature-based teachers corresponding to each classifier. In contrast to traditional self-distillation, which only utilizes the final classifier as the teacher, MLCSD explores distinctive knowledge from all the classifiers to construct more comprehensive and effective teachers. These new teachers then transfer the knowledge back to each classifier. This collaborative learning approach promotes knowledge transfer between classifiers and enhances the overall classification performance of the network.

By conducting experiments on various multi-exit networks and diverse datasets, we validate the effectiveness and generality of the MLCSD strategy. The results consistently show that MLCSD outperforms the traditional self-distillation strategy.

Our main contributions can be summarized as follows:

1.
We propose the MLCSD, which constructs more comprehensive and effective teachers by extracting knowledge from all the classifiers. It can improve the inference efficiency of the multi-exit network without increasing computational costs.
2.
We employ various backbone networks and intermediate classifiers to construct multi-exit networks, thereby validating the effectiveness and generality of the MLCSD strategy on these networks.
3.
We conduct experiments on three datasets, and the experimental results show the effectiveness of the MLCSD in two typical adaptive inference applications:anytime prediction and budgeted batch classification.

Related work

Computationally efficient deep networks

Lightweight network model

In general, networks with more parameters and higher computational costs tend to outperform those with fewer parameters and lower computational costs. However, the excessive computations associated with deep networks pose deployment challenges in practical applications, particularly in time-sensitive and resource-limited scenarios. A direct and effective way to improve network computing efficiency is through the design of lightweight models. SqueezeNet [25] reduces the necessary number of channels and parameters by using compression and expansion layers. MobileNet [26] combines depthwise and pointwise convolution to form a depthwise separable convolution to replace general convolutions, reducing the computational costs. ShuffleNet [27] further reduces the computational costs by proposing pointwise group convolution and channel shuffling techniques. EspNet [28] combines pointwise convolution with spatial pyramid dilated convolution, reducing the number of parameters and computations while increasing the receptive field. GhostNet [29] first employs traditional convolution to generate feature maps with fewer channels, then further uses a depthwise convolution to reduce the computational costs, and finally integrates two groups of feature maps for inference. These methods mainly enhance standard convolutions in terms of the channel numbers and sparse connections between the convolution channels, aiming to reduce the network parameter sizes and improve the network inference speeds without sacrificing network performance.

Model compression and acceleration

Compressing the number of parameters or the scale of an existing network is also an effective way to improve network computational efficiency. Reference [7] leverages network pruning to remove parameters that contribute less to the network, thereby reducing the total number of parameters and accelerating the inference speed. Reference [10] utilizes knowledge distillation technology to promote student networks to simulate teacher networks, enabling the student networks to achieve superior generalization performances and enhance inference accuracies. Reference [15] quantifies network weights and uses low-precision bits to store the weights and activation outputs, which can markedly compress the network. Reference [28] decomposes the convolution kernel of the network in low rank and rearranges the order of parameters to reduce memory consumption. These methods compress the existing models based on the redundancies of the neural networks in different aspects, effectively reducing network parameters and computational costs to improve computational efficiencies.

Adaptive inference networks

Adaptive inference [30] offers an effective mechanism for dynamically balancing accuracies and computational costs. One intuitive approach involves cascading multiple models with different complexities [31, 32]. When the inference confidence of one model meets the preset threshold, the sample exits; otherwise, it proceeds to the subsequent, more complex network for inference. However, these cascade networks are independent and entail higher training times and storage costs. During testing, difficult samples undergo successive processing by several models, leading to increased computational costs.

A more efficient approach involves adding multiple output branches to one backbone network and dynamically adjusting the network structure through a width or depth adaptation, striking a balance between accuracy and computational costs. References [33, 34] achieve adaptive inference in the width direction by dynamically activating neurons with auxiliary branch structures. Moe [35] considers multiple branches of a network built in parallel as experts, and selectively activates them to complete a width adaptation. HydraNet [36] replaces the last-stage convolution blocks of a convolution network with multiple branches, and selectively executes these branches during testing to achieve an adaptive width inference. In addition to width adaptation, the MSDNET [18] adds multiple intermediate classifiers at different depths to DenseNet [5]. Based on the MSDNET, RANNET [20] incorporates the sample spatial resolution adaptation module to further improve network efficiency. These studies on adaptive inference primarily focus on designing more sophisticated network architectures to improve network inference efficiency.

Table 1 Preliminary experiments of the multi-exit network

A multi-level collaborative self-distillation learning for improving adaptive inference efficiency

Abstract

Explore related subjects

Introduction

Related work

Computationally efficient deep networks

Lightweight network model

Model compression and acceleration

Adaptive inference networks

Knowledge distillation for adaptive inference

Method

Motivation

Overall framework

Multi-exit network architectures

Multi-level collaborative self-distillation module

Classifier weight encoding module

Multi-level collaborative logit-based self-distillation

Multi-level collaborative feature-based self-distillation

Loss function

Experiments

Datasets

CIFAR10 and CIFAR100 datasets

Tiny-ImageNet dataset

Implementation details

Adaptive inference setting

Multi-level collaborative self-distillation network

Anytime prediction results

Budgeted batch classification results

Ablation study

Comparative experiments of different intermediate classifier structures

Comparison with the other teacher integration strategy

Results of multi-level collaborative self-distillation strategy on other adaptive inference networks

Results of multi-level collaborative self-distillation strategy on other distillation method

Results of different number of intermediate classifiers

Results of different hyper-parameters

Discussion

Traditional self-distillation and multi-level collaborative self-distillation

Ensemble self-distillation and multi-level collaborative delf-distillation

Effect of feature-based self-distillation

Limitations

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix

Implementation details

Experiment results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation