Multi-exit self-distillation with appropriate teachers

Sun, Wujie; Chen, Defang; Wang, Can; Ye, Deshi; Feng, Yan; Chen, Chun

doi:10.1631/FITEE.2200644

Multi-exit self-distillation with appropriate teachers

具备合适教师的多出口自蒸馏

Research Article
Published: 10 May 2024

Volume 25, pages 585–599, (2024)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Wujie Sun (孙武杰) ORCID: orcid.org/0000-0001-7739-3517¹,
Defang Chen (陈德仿)¹,
Can Wang (王灿) ORCID: orcid.org/0000-0002-5890-4307¹,
Deshi Ye (叶德仕)¹,
Yan Feng (冯雁)¹ &
…
Chun Chen (陈纯)¹

10 Accesses
Explore all metrics

Abstract

Multi-exit architecture allows early-stop inference to reduce computational cost, which can be used in resource-constrained circumstances. Recent works combine the multi-exit architecture with self-distillation to simultaneously achieve high efficiency and decent performance at different network depths. However, existing methods mainly transfer knowledge from deep exits or a single ensemble to guide all exits, without considering that inappropriate learning gaps between students and teachers may degrade the model performance, especially in shallow exits. To address this issue, we propose Multi-exit self-distillation with Appropriate TEachers (MATE) to provide diverse and appropriate teacher knowledge for each exit. In MATE, multiple ensemble teachers are obtained from all exits with different trainable weights. Each exit subsequently receives knowledge from all teachers, while focusing mainly on its primary teacher to keep an appropriate gap for efficient knowledge transfer. In this way, MATE achieves diversity in knowledge distillation while ensuring learning efficiency. Experimental results on CIFAR-100, TinyImageNet, and three fine-grained datasets demonstrate that MATE consistently outperforms state-of-the-art multi-exit self-distillation methods with various network architectures.

摘要

多出口架构允许早停推理以减少计算成本, 这使其可以在资源受限的情况下使用。最近的研究将多出口架构与自蒸馏相结合, 以在不同网络深度上同时实现高效率和卓越性能。然而, 现有方法主要从深层出口或单一集成中传递知识, 以指导所有出口, 而没有考虑学生和教师之间不适当的学习差距可能会降低模型性能, 特别是对于浅层出口而言。为解决这个问题, 提出具备合适教师的多出口自蒸馏方法, 为每个出口提供多样化且适当的教师知识。在我们的方法中, 根据不同可训练的集成权重, 从所有出口获得多个集成教师。每个出口从所有教师那里接收知识, 并重点关注其所对应的主教师, 以保持适当的学习差距并实现高效的知识传递。通过这种方式, 我们的方法在保证学习效率的同时实现了多样化的知识蒸馏。在CIFAR-100、TinyImageNet以及3个细粒度数据集上的实验结果表明, 我们的方法在各种网络架构中始终优于最先进的多出口自蒸馏方法。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

The data that support the findings of this study are openly available. The data based on CIFAR-100 are available from https://www.cs.toronto.edu/~kriz/cifar.html.

The data based on TinyImageNet are available from http://cs231n.stanford.edu/tiny-imagenet-200.zip.

The data based on CUB-200-2011 are available from https://www.vision.caltech.edu/datasets/cub_200_2011/. The data based on Stanford Dogs are available from http://vision.stanford.edu/aditya86/ImageNetDogs/. The data based on FGVC-Aircraft are available from https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/.

References

Ahn S, Hu SX, Damianou A, et al., 2019. Variational information distillation for knowledge transfer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9155–9163. https://doi.org/10.1109/CVPR.2019.00938
Anil R, Pereyra G, Passos A, et al., 2020 Large scale distributed neural network training through online distillation. https://arxiv.org/abs/1804.03235
Ba LJ, Caruana R, 2014. Do deep nets really need to be deep? Proc 27^th Int Conf on Neural Information Processing Systems, p.2654–2662.
Chen DF, Mei JP, Wang C, et al., 2020. Online knowledge distillation with diverse peers. Proc AAAI Conf Artif Intell, 34(4):3430–3437. https://doi.org/10.1609/aaai.v34i04.5746
Google Scholar
Chen DF, Mei JP, Zhang Y, et al., 2021. Cross-layer distillation with semantic calibration. Proc AAAI Conf Artif Intell, 35(8):7028–7036. https://doi.org/10.1609/aaai.v35i8.16865
Google Scholar
Chen DF, Mei JP, Zhang HL, et al., 2022. Knowledge distillation with the reused teacher classifier. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11923–11932. https://doi.org/10.1109/CVPR52688.2022.01163
Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. IEEE Conf on Computer Vision and Pattern Recognition, p.248–255. https://doi.org/10.1109/CVPR.2009.5206848
Deng X, Zhang ZF, 2021. Learning with retrospection. Proc AAAI Conf Artif Intell, 35(8):7201–7209. https://doi.org/10.1609/aaai.v35i8.16885
Google Scholar
Furlanello T, Lipton Z, Tschannen M, et al., 2018. Born again neural networks. Proc 35^th Int Conf on Machine Learning, p.1607–1616.
Ge YX, Zhang X, Choi CL, et al., 2021. Self-distillation with batch knowledge ensembling improves ImageNet classification. https://arxiv.org/abs/2104.13298
He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770–778. https://doi.org/10.1109/CVPR.2016.90
Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531
Huang G, Liu Z, Van Der Maaten L, et al., 2017. Densely connected convolutional networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2261–2269. https://doi.org/10.1109/CVPR.2017.243
Huang G, Chen DL, Li TH, et al., 2018. Multi-scale dense networks for resource efficient image classification. https://arxiv.org/abs/1703.09844
Ji M, Shin S, Hwang S, et al., 2021. Refine myself by teaching myself: feature refinement via self-knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10659–10668. https://doi.org/10.1109/CVPR46437.2021.01052
Jin X, Peng BY, Wu YC, et al., 2019. Knowledge distillation via route constrained optimization. IEEE/CVF Int Conf on Computer Vision, p.1345–1354. https://doi.org/10.1109/ICCV.2019.00143
Khosla A, Jayadevaprakash N, Yao BP, et al., 2011. Novel Dataset for Fine-Grained Image Categorization: Stanford Dogs. http://vision.stanford.edu/aditya86/ImageNetDogs/ [Accessed on Dec. 30, 2021].
Krizhevsky A, Hinton G, 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report, Computer Science Department, University of Toronto, Canada.
Google Scholar
Lan X, Zhu XT, Gong SG, 2018. Knowledge distillation by on-the-fly native ensemble. https://arxiv.org/abs/1806.04606
Le Y, Yang X, 2015. Tiny ImageNet Visual Recognition Challenge. http://cs231n.stanford.edu/tiny-imagenet-200.zip [Accessed on Dec. 30, 2021].
Lee H, Lee JS, 2021. Students are the best teacher: exit-ensemble distillation with multi-exits. https://arxiv.org/abs/2104.00299
Maji S, Rahtu E, Kannala J, et al., 2013. Fine-grained visual classification of aircraft. https://arxiv.org/abs/1306.5151
Mirzadeh SI, Farajtabar M, Li A, et al., 2020. Improved knowledge distillation via teacher assistant. Proc AAAI Conf Artif Intell, 34(4):5191–5198. https://doi.org/10.1609/aaai.v34i04.5963
Google Scholar
Phuong M, Lampert C, 2019. Distillation-based training for multi-exit architectures. Proc IEEE/CVF Int Conf on Computer Vision, p.1355–1364. https://doi.org/10.1109/ICCV.2019.00144
Schafer RW, 2011. What is a Savitzky–Golay filter? IEEE Signal Process Mag, 28(4):111–117. https://doi.org/10.1109/MSP.2011.941097
Article Google Scholar
Schwartz R, Dodge J, Smith NA, et al., 2020. Green AI. Commun ACM, 63(12):54–63. https://doi.org/10.1145/3381831
Article Google Scholar
Shi WX, Song YX, Zhou H, et al., 2021. Follow your path: a progressive method for knowledge distillation. https://arxiv.org/abs/2107.09305
Simonyan K, Zisserman A, 2015. Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556
Son W, Na J, Choi J, et al., 2021. Densely guided knowledge distillation using multiple teacher assistants. Proc IEEE/CVF Int Conf on Computer Vision, p.9375–9384. https://doi.org/10.1109/ICCV48922.2021.00926
Sun DW, Yao AB, Zhou AJ, et al., 2019. Deeply-supervised knowledge synergy. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6990–6999. https://doi.org/10.1109/CVPR.2019.00716
Teerapittayanon S, McDanel B, Kung HT, 2016. BranchyNet: fast inference via early exiting from deep neural networks. 23^rd Int Conf on Pattern Recognition, p.2464–2469. https://doi.org/10.1109/ICPR.2016.7900006
Tian YL, Krishnan D, Isola P, 2022. Contrastive representation distillation. https://arxiv.org/abs/1910.10699
Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000–6010.
Wah C, Branson S, Welinder P, et al., 2011. The Caltech-UCSD Birds-200-2011 Dataset. https://www.vision.caltech.edu/datasets/cub_200_2011/ [Accessed on Dec. 30, 2021].
Xie SN, Girshick R, Dollár P, et al., 2017. Aggregated residual transformations for deep neural networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5987–5995. https://doi.org/10.1109/CVPR.2017.634
Xu TB, Liu CL, 2019. Data-distortion guided self-distillation for deep neural networks. Proc AAAI Conf Artif Intell, 33(1):5565–5572. https://doi.org/10.1609/aaai.v33i01.33015565
Google Scholar
Yang CL, Xie LX, Su C, et al., 2019a. Snapshot distillation: teacher-student optimization in one generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2854–2863. https://doi.org/10.1109/CVPR.2019.00297
Yang CL, Xie LX, Qiao SY, et al., 2019b. Training deep neural networks in generations: a more tolerant teacher educates better students. Proc AAAI Conf Artif Intell, 33(1):5628–5635. https://doi.org/10.1609/aaai.v33i01.33015628
Google Scholar
Yuan L, Tay FEH, Li GL, et al., 2020. Revisiting knowledge distillation via label smoothing regularization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3902–3910. https://doi.org/10.1109/CVPR42600.2020.00396
Yun S, Park J, Lee K, et al., 2020. Regularizing class-wise predictions via self-knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13873–13882. https://doi.org/10.1109/CVPR42600.2020.01389
Zagoruyko S, Komodakis N, 2017. Wide residual networks. https://arxiv.org/abs/1605.07146
Zhang LF, Song JB, Gao AN, et al., 2019. Be your own teacher: improve the performance of convolutional neural networks via self distillation. Proc IEEE/CVF Int Conf on Computer Vision, p.3712–3721. https://doi.org/10.1109/ICCV.2019.00381
Zhang Y, Xiang T, Hospedales TM, et al., 2018. Deep mutual learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4320–4328. https://doi.org/10.1109/CVPR.2018.00454

Download references

Acknowledgements

The authors would like to thank the advanced computing resources provided by the Supercomputing Center of Hangzhou City University, China.

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, 310000, China
Wujie Sun (孙武杰), Defang Chen (陈德仿), Can Wang (王灿), Deshi Ye (叶德仕), Yan Feng (冯雁) & Chun Chen (陈纯)

Authors

Wujie Sun (孙武杰)
View author publications
You can also search for this author in PubMed Google Scholar
Defang Chen (陈德仿)
View author publications
You can also search for this author in PubMed Google Scholar
Can Wang (王灿)
View author publications
You can also search for this author in PubMed Google Scholar
Deshi Ye (叶德仕)
View author publications
You can also search for this author in PubMed Google Scholar
Yan Feng (冯雁)
View author publications
You can also search for this author in PubMed Google Scholar
Chun Chen (陈纯)
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wujie SUN designed the research, processed the data, and drafted the paper. Defang CHEN, Can WANG, Deshi YE, Yan FENG, and Chun CHEN helped organize the paper. All the authors revised and finalized the paper.

Corresponding author

Correspondence to Can Wang (王灿).

Ethics declarations

All the authors declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (No. U1866602) and the Starry Night Science Fund of Zhejiang University Shanghai Institute for Advanced Study, China (No. SN-ZJU-SIAS-001)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, W., Chen, D., Wang, C. et al. Multi-exit self-distillation with appropriate teachers. Front Inform Technol Electron Eng 25, 585–599 (2024). https://doi.org/10.1631/FITEE.2200644

Download citation

Received: 16 December 2022
Accepted: 04 July 2023
Published: 10 May 2024
Issue Date: March 2024
DOI: https://doi.org/10.1631/FITEE.2200644

Key words

关键词

CLC number

TP181

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-exit self-distillation with appropriate teachers

Abstract

摘要

Access this article

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Search

Navigation