Abstract
In real-world scenarios, we often want to obtain a small and at the same time accurate model. Smaller neural networks require less computational power and are faster than larger ones or ensembles. As a drawback, however, the representational knowledge of small models suffers. With the help of different techniques, it is possible to achieve an improvement in the predictive ability of such models while maintaining their complexity. Knowledge distillation, one of the popular techniques, introduces an additional term in the loss function. During the training of the student network, it minimizes Kullback-Leibler divergence between the smoothed output distributions of a teacher and student. We compared existing studies in the field of Knowledge Distillation. Specifically, we explore the influence of the teacher-student predictive ability gap on student improvement. Results show the existence of a clear dependency. Experiments on the image domain demonstrate the same tendency when increasing the accuracy of teacher’s predictions. Revealed insights make it possible to choose teacher network’s complexity to distil knowledge into the student network most efficiently in terms of student’s accuracy-complexity tradeoff.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhang, H., et al.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Szegedy, C., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Sun, S., et al.: Training augmentation with adversarial examples for robust speech recognition. arXiv preprint arXiv:1806.02782 (2018)
Burdakov, A.V., et al.: Forecasting of influenza-like illness incidence in amur region with neural networks. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research II, vol. 799, pp. 307–314. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01328-8_37
Eroshenkova, D.A., et al.: Automated determination of forest-vegetation characteristics with the use of a neural network of deep learning. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research III, vol. 856, pp. 295–302. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30425-6_34
Schuster, M.: Speech recognition for mobile devices at Google. In: Zhang, B.T., Orgun, M.A. (eds.) PRICAI 2010: Trends in Artificial Intelligence, vol. 6230, pp. 8–10. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15246-7_3
He, Y., et al.: Streaming end-to-end speech recognition for mobile devices. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE (2019)
Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1D time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. IEEE (2020)
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Tang, R., et al.: Distilling task-specific knowledge from Bert into simple neural networks. arXiv preprint arXiv:1903.12136 (2019)
Sanh, V., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Yuan, L., et al.: Revisiting knowledge distillation via label smoothing regularization. In: Propceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3903–3911 (2020)
Wang, J., et al.: Private model compression via knowledge distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 1190–1197 (2019)
Jiao, X., et al.: Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)
Mirzadeh, S.I., et al.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 5191–5198 (2020)
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019)
Xu, Y., et al.: Positive-unlabeled compression on the cloud. arXiv preprint arXiv:1909.09757 (2019)
Gao, M., et al.: Residual knowledge distillation. arXiv preprint arXiv:2002.09168 (2020)
Li, X., et al.: ResKD: residual-guided knowledge distillation. IEEE Trans. Image Process. 30, 4735–4746 (2021)
Guo, J., et al.: Reducing the teacher-student gap via spherical knowledge disitllation (2020)
Yue, K., Deng, J., Zhou, F.: Matching guided distillation. arXiv preprint arXiv:2008.09958 (2020)
Shen, Z., Savvides, M.: Meal v2: boosting vanilla resnet-50 to 80%+ top-1 accuracy on imagenet without tricks. arXiv preprint arXiv:2009.08453 (2020)
Shu, C., et al.: Knowledge squeezed adversarial network compression. arXiv preprint arXiv:1904.05100 (2019)
Zhou, Z., et al.: Channel distillation: channel-wise attention for knowledge distillation. arXiv preprint arXiv:2006.01683 (2020)
Xie, Q., et al.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–1069 (2020)
Gao, M., et al.: An embarrassingly simple approach for knowledge distillation. arXiv preprint arXiv:1812.01819 (2018)
Yim, J., et al.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141 (2017)
Yu, R., et al.: Visual relationship detection with internal and external linguistic knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1974–1982 (2017)
Chen, G., et al.: Learning efficient object detection models with knowledge distillation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 742–751 (2017)
Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quantization arXiv preprint arXiv:1802.05668 (2018)
Xu, Z., Hsu, Y.C., Huang, J.: Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint arXiv:1709.00513 (2017)
Sandler, M., et al.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Terekhov, V., Ishkov, D. (2022). The Phenomenon of Resonance in Knowledge Distillation: Learning Students by Non-strong Teachers. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y., Klimov, V.V. (eds) Advances in Neural Computation, Machine Learning, and Cognitive Research V. NEUROINFORMATICS 2021. Studies in Computational Intelligence, vol 1008. Springer, Cham. https://doi.org/10.1007/978-3-030-91581-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-91581-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91580-3
Online ISBN: 978-3-030-91581-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)