Gradient Methods for Optimizing Metaparameters in the Knowledge Distillation Problem

Gorpinich, M.; Bakhteev, O. Yu.; Strijov, V. V.

doi:10.1134/S00051179220100071

Gradient Methods for Optimizing Metaparameters in the Knowledge Distillation Problem

THEMATIC ISSUE
Published: 20 December 2022

Volume 83, pages 1544–1554, (2022)
Cite this article

Automation and Remote Control Aims and scope Submit manuscript

M. Gorpinich¹,
O. Yu. Bakhteev² &
V. V. Strijov²

43 Accesses
Explore all metrics

Abstract

The paper investigates the distillation problem for deep learning models. Knowledge distillation is a metaparameter optimization problem in which information from a model of a more complex structure, called a teacher model, is transferred to a model of a simpler structure, called a student model. The paper proposes a generalization of the distillation problem for the case of optimization of metaparameters by gradient methods. Metaparameters are the parameters of the distillation optimization problem. The loss function for such a problem is the sum of the classification term and the cross-entropy between the responses of the student model and the teacher model. Assigning optimal metaparameters to the distillation loss function is a computationally difficult task. The properties of the optimization problem are investigated so as to predict the metaparameter update trajectory. An analysis of the trajectory of the gradient optimization of metaparameters is carried out, and their value is predicted using linear functions. The proposed approach is illustrated using a computational experiment on CIFAR-10 and Fashion-MNIST samples as well as on synthetic data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DistPro: Searching a Fast Knowledge Distillation Process via Meta Optimization

Learning to Generate Synthetic Training Data Using Gradient Matching and Implicit Differentiation

The Phenomenon of Resonance in Knowledge Distillation: Learning Students by Non-strong Teachers

REFERENCES

Bakhteev, O.Y. and Strijov, V.V., Comprehensive analysis of gradient-based hyperparameter optimization algorithms, Ann. Oper. Res., 2020, vol. 289, no. 1, pp. 51–65.
Article MathSciNet MATH Google Scholar
Bergstra, J. and Bengio, Y., Random search for hyper-parameter optimization, Mach. Learn. Res., 2012, vol. 13, no. 2.
Bergstra, J., Yamins, D., and Cox, D., Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, Int. Conf. Mach. Learn. (2013), pp. 115–123.
Bishop, C.M., Pattern Recognition and Machine Learning (Information Science and Statistics), 2006.
Hinton, G.E., Vinyals, O., and Dean, J., Distilling the knowledge in a neural network, CoRR, 2015, vol. abs/1503.02531. arXiv:1503.02531.
Krizhevsky, A. et al., Learning Multiple Layers of Features from Tiny Images, 2009.
Liu, H., Simonyan, K., and Yang, Y., Darts: Differentiable Architecture Search, 2018. arXiv:1806.09055.
Luketina, J., Berglund, M., Greff, K., and Raiko, T., Scalable gradient-based tuning of continuous regularization hyperparameters, CoRR, 2015, vol. abs/1511.06727. arXiv:1511.06727.
Maclaurin, D., Duvenaud, D., and Adams, R.P., Gradient-based hyperparameter optimization through reversible learning, CoRR, 2015, vol. abs/1502.03492. arXiv:1502.03492.
Passalis, N., Tzelepi, M., and Tefas, A., Heterogeneous knowledge distillation using information flow modeling, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., (2020).
Pedregosa, F., Hyperparameter optimization with approximate gradient, CoRR, 2016, vol. abs/ 1602.02355. arXiv:1602.02355.
Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y., Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters, Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery & Data Min. (2020), pp. 3505–3506.
Vatanen, T., Raiko, T., Valpola, H., and LeCun, Y., Pushing stochastic gradient towards second-order methods—backpropagation learning with transformations in nonlinearities, in Int. Conf. Neural Inf. Process., Berlin–Heidelberg: Springer, 2013, pp. 442–449.
Xiao, H., Rasul, K., and Vollgraf, R., Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, CoRR, 2017, vol. abs/1708.07747. arXiv:1708.07747
Computational experiment code. https://github.com/Intelligent-Systems-Phystech/MetaOptDistillation . Accessed June 14, 2022.

Download references

Funding

This work was supported by K.V. Rudakov’s Academic Scholarship and by the Russian Foundation for Basic Research, project no. 20-07-00990.

Author information

Authors and Affiliations

Moscow Institute of Physics and Technology, Dolgoprudnyi, Moscow oblast, 141701, Russia
M. Gorpinich
Dorodnicyn Computing Centre, Russian Academy of Sciences, Moscow, 119333, Russia
O. Yu. Bakhteev & V. V. Strijov

Authors

M. Gorpinich
View author publications
You can also search for this author in PubMed Google Scholar
O. Yu. Bakhteev
View author publications
You can also search for this author in PubMed Google Scholar
V. V. Strijov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to M. Gorpinich, O. Yu. Bakhteev or V. V. Strijov.

Additional information

Translated by V. Potapchouck

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gorpinich, M., Bakhteev, O.Y. & Strijov, V.V. Gradient Methods for Optimizing Metaparameters in the Knowledge Distillation Problem. Autom Remote Control 83, 1544–1554 (2022). https://doi.org/10.1134/S00051179220100071

Download citation

Received: 17 February 2022
Revised: 23 June 2022
Accepted: 29 June 2022
Published: 20 December 2022
Issue Date: October 2022
DOI: https://doi.org/10.1134/S00051179220100071

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions