The Phenomenon of Resonance in Knowledge Distillation: Learning Students by Non-strong Teachers

Terekhov, Valery; Ishkov, Denis

doi:10.1007/978-3-030-91581-0_4

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1008))

Included in the following conference series:

International Conference on Neuroinformatics

481 Accesses

Abstract

In real-world scenarios, we often want to obtain a small and at the same time accurate model. Smaller neural networks require less computational power and are faster than larger ones or ensembles. As a drawback, however, the representational knowledge of small models suffers. With the help of different techniques, it is possible to achieve an improvement in the predictive ability of such models while maintaining their complexity. Knowledge distillation, one of the popular techniques, introduces an additional term in the loss function. During the training of the student network, it minimizes Kullback-Leibler divergence between the smoothed output distributions of a teacher and student. We compared existing studies in the field of Knowledge Distillation. Specifically, we explore the influence of the teacher-student predictive ability gap on student improvement. Results show the existence of a clear dependency. Experiments on the image domain demonstrate the same tendency when increasing the accuracy of teacher’s predictions. Revealed insights make it possible to choose teacher network’s complexity to distil knowledge into the student network most efficiently in terms of student’s accuracy-complexity tradeoff.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhang, H., et al.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Szegedy, C., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Sun, S., et al.: Training augmentation with adversarial examples for robust speech recognition. arXiv preprint arXiv:1806.02782 (2018)
Burdakov, A.V., et al.: Forecasting of influenza-like illness incidence in amur region with neural networks. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research II, vol. 799, pp. 307–314. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01328-8_37
Chapter Google Scholar
Eroshenkova, D.A., et al.: Automated determination of forest-vegetation characteristics with the use of a neural network of deep learning. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research III, vol. 856, pp. 295–302. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30425-6_34
Chapter Google Scholar
Schuster, M.: Speech recognition for mobile devices at Google. In: Zhang, B.T., Orgun, M.A. (eds.) PRICAI 2010: Trends in Artificial Intelligence, vol. 6230, pp. 8–10. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15246-7_3
Chapter Google Scholar
He, Y., et al.: Streaming end-to-end speech recognition for mobile devices. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE (2019)
Google Scholar
Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1D time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. IEEE (2020)
Google Scholar
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Tang, R., et al.: Distilling task-specific knowledge from Bert into simple neural networks. arXiv preprint arXiv:1903.12136 (2019)
Sanh, V., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Yuan, L., et al.: Revisiting knowledge distillation via label smoothing regularization. In: Propceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3903–3911 (2020)
Google Scholar
Wang, J., et al.: Private model compression via knowledge distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 1190–1197 (2019)
Google Scholar
Jiao, X., et al.: Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)
Mirzadeh, S.I., et al.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 5191–5198 (2020)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019)
Xu, Y., et al.: Positive-unlabeled compression on the cloud. arXiv preprint arXiv:1909.09757 (2019)
Gao, M., et al.: Residual knowledge distillation. arXiv preprint arXiv:2002.09168 (2020)
Li, X., et al.: ResKD: residual-guided knowledge distillation. IEEE Trans. Image Process. 30, 4735–4746 (2021)
Article Google Scholar
Guo, J., et al.: Reducing the teacher-student gap via spherical knowledge disitllation (2020)
Google Scholar
Yue, K., Deng, J., Zhou, F.: Matching guided distillation. arXiv preprint arXiv:2008.09958 (2020)
Shen, Z., Savvides, M.: Meal v2: boosting vanilla resnet-50 to 80%+ top-1 accuracy on imagenet without tricks. arXiv preprint arXiv:2009.08453 (2020)
Shu, C., et al.: Knowledge squeezed adversarial network compression. arXiv preprint arXiv:1904.05100 (2019)
Zhou, Z., et al.: Channel distillation: channel-wise attention for knowledge distillation. arXiv preprint arXiv:2006.01683 (2020)
Xie, Q., et al.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–1069 (2020)
Google Scholar
Gao, M., et al.: An embarrassingly simple approach for knowledge distillation. arXiv preprint arXiv:1812.01819 (2018)
Yim, J., et al.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141 (2017)
Google Scholar
Yu, R., et al.: Visual relationship detection with internal and external linguistic knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1974–1982 (2017)
Google Scholar
Chen, G., et al.: Learning efficient object detection models with knowledge distillation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 742–751 (2017)
Google Scholar
Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quantization arXiv preprint arXiv:1802.05668 (2018)
Xu, Z., Hsu, Y.C., Huang, J.: Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint arXiv:1709.00513 (2017)
Sandler, M., et al.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Bauman Moscow State Technical University, 105005, Moscow, Russia
Valery Terekhov & Denis Ishkov

Authors

Valery Terekhov
View author publications
You can also search for this author in PubMed Google Scholar
Denis Ishkov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valery Terekhov .

Editor information

Editors and Affiliations

Scientific Research Institute for System Analysis, Russian Academy of Sciences, Moscow, Russia
Boris Kryzhanovsky
Scientific Research Institute for System Analysis, Russian Academy of Sciences, Moscow, Russia
Witali Dunin-Barkowski
Scientific Research Institute for System Analysis, Russian Academy of Sciences, Moscow, Russia
Vladimir Redko
Moscow Aviation Institute (National Research University), Moscow, Russia
Yury Tiumentsev
MEPhI, National Research Nuclear University, Moscow, Russia
Valentin V. Klimov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Terekhov, V., Ishkov, D. (2022). The Phenomenon of Resonance in Knowledge Distillation: Learning Students by Non-strong Teachers. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y., Klimov, V.V. (eds) Advances in Neural Computation, Machine Learning, and Cognitive Research V. NEUROINFORMATICS 2021. Studies in Computational Intelligence, vol 1008. Springer, Cham. https://doi.org/10.1007/978-3-030-91581-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-91581-0_4
Published: 23 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91580-3
Online ISBN: 978-3-030-91581-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics