Skip to main content

The Phenomenon of Resonance in Knowledge Distillation: Learning Students by Non-strong Teachers

  • Conference paper
  • First Online:
Advances in Neural Computation, Machine Learning, and Cognitive Research V (NEUROINFORMATICS 2021)

Abstract

In real-world scenarios, we often want to obtain a small and at the same time accurate model. Smaller neural networks require less computational power and are faster than larger ones or ensembles. As a drawback, however, the representational knowledge of small models suffers. With the help of different techniques, it is possible to achieve an improvement in the predictive ability of such models while maintaining their complexity. Knowledge distillation, one of the popular techniques, introduces an additional term in the loss function. During the training of the student network, it minimizes Kullback-Leibler divergence between the smoothed output distributions of a teacher and student. We compared existing studies in the field of Knowledge Distillation. Specifically, we explore the influence of the teacher-student predictive ability gap on student improvement. Results show the existence of a clear dependency. Experiments on the image domain demonstrate the same tendency when increasing the accuracy of teacher’s predictions. Revealed insights make it possible to choose teacher network’s complexity to distil knowledge into the student network most efficiently in terms of student’s accuracy-complexity tradeoff.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhang, H., et al.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

  2. Szegedy, C., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  3. Sun, S., et al.: Training augmentation with adversarial examples for robust speech recognition. arXiv preprint arXiv:1806.02782 (2018)

  4. Burdakov, A.V., et al.: Forecasting of influenza-like illness incidence in amur region with neural networks. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research II, vol. 799, pp. 307–314. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01328-8_37

    Chapter  Google Scholar 

  5. Eroshenkova, D.A., et al.: Automated determination of forest-vegetation characteristics with the use of a neural network of deep learning. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research III, vol. 856, pp. 295–302. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30425-6_34

    Chapter  Google Scholar 

  6. Schuster, M.: Speech recognition for mobile devices at Google. In: Zhang, B.T., Orgun, M.A. (eds.) PRICAI 2010: Trends in Artificial Intelligence, vol. 6230, pp. 8–10. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15246-7_3

    Chapter  Google Scholar 

  7. He, Y., et al.: Streaming end-to-end speech recognition for mobile devices. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE (2019)

    Google Scholar 

  8. Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1D time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. IEEE (2020)

    Google Scholar 

  9. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015)

  10. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018)

  11. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  12. Tang, R., et al.: Distilling task-specific knowledge from Bert into simple neural networks. arXiv preprint arXiv:1903.12136 (2019)

  13. Sanh, V., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  14. Yuan, L., et al.: Revisiting knowledge distillation via label smoothing regularization. In: Propceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3903–3911 (2020)

    Google Scholar 

  15. Wang, J., et al.: Private model compression via knowledge distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 1190–1197 (2019)

    Google Scholar 

  16. Jiao, X., et al.: Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)

  17. Mirzadeh, S.I., et al.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 5191–5198 (2020)

    Google Scholar 

  18. Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019)

  19. Xu, Y., et al.: Positive-unlabeled compression on the cloud. arXiv preprint arXiv:1909.09757 (2019)

  20. Gao, M., et al.: Residual knowledge distillation. arXiv preprint arXiv:2002.09168 (2020)

  21. Li, X., et al.: ResKD: residual-guided knowledge distillation. IEEE Trans. Image Process. 30, 4735–4746 (2021)

    Article  Google Scholar 

  22. Guo, J., et al.: Reducing the teacher-student gap via spherical knowledge disitllation (2020)

    Google Scholar 

  23. Yue, K., Deng, J., Zhou, F.: Matching guided distillation. arXiv preprint arXiv:2008.09958 (2020)

  24. Shen, Z., Savvides, M.: Meal v2: boosting vanilla resnet-50 to 80%+ top-1 accuracy on imagenet without tricks. arXiv preprint arXiv:2009.08453 (2020)

  25. Shu, C., et al.: Knowledge squeezed adversarial network compression. arXiv preprint arXiv:1904.05100 (2019)

  26. Zhou, Z., et al.: Channel distillation: channel-wise attention for knowledge distillation. arXiv preprint arXiv:2006.01683 (2020)

  27. Xie, Q., et al.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–1069 (2020)

    Google Scholar 

  28. Gao, M., et al.: An embarrassingly simple approach for knowledge distillation. arXiv preprint arXiv:1812.01819 (2018)

  29. Yim, J., et al.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141 (2017)

    Google Scholar 

  30. Yu, R., et al.: Visual relationship detection with internal and external linguistic knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1974–1982 (2017)

    Google Scholar 

  31. Chen, G., et al.: Learning efficient object detection models with knowledge distillation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 742–751 (2017)

    Google Scholar 

  32. Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quantization arXiv preprint arXiv:1802.05668 (2018)

  33. Xu, Z., Hsu, Y.C., Huang, J.: Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint arXiv:1709.00513 (2017)

  34. Sandler, M., et al.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)

    Google Scholar 

  35. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valery Terekhov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Terekhov, V., Ishkov, D. (2022). The Phenomenon of Resonance in Knowledge Distillation: Learning Students by Non-strong Teachers. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y., Klimov, V.V. (eds) Advances in Neural Computation, Machine Learning, and Cognitive Research V. NEUROINFORMATICS 2021. Studies in Computational Intelligence, vol 1008. Springer, Cham. https://doi.org/10.1007/978-3-030-91581-0_4

Download citation

Publish with us

Policies and ethics